# Latent-Factor Models

Like change of basis but instead of hand-picking the features, we learn them from data.

*Part weights* are a change of basis from $x_i$ to some $z_i$. The canonical example is Principal Component Analysis (PCA)

## # PCA

PCA is parametric and does *not* provide unique global optimum.

Takes in a matrix $X$ and an input $k$ and outputs two matrices such that $X \approx ZW$:

- $Z$ is a (n,k) matrix. Each row $z_i$ is a set of features/part weights
- W is a (k,d) matrix. Each row $w_c$ is a part/factor/principle component
- We can think of $W$ as
*rotating*data so that the slope is zero

- We can think of $W$ as
- Approximation of one $\hat x_{ij}$ is $(w^{j^T}z_i) = \langle w^j, z_i \rangle$
- Each $x_i$ can be thought of as a linear combination of all the factors

Assumptions:

- Assumes $X$ is centered (each column of $X$ has a mean of zero)

Use cases:

- Dimension reductionality: Effectively allows us to reduce the dimensionality of X if $k < d$
- Actually, it only ever makes sense if $k \leq d$

- Outlier detection: if PCA gives a poor approximation, $x_i$ could be an outlier
- Data visualization: $k=2$ to visualize high-dimensional objects

We minimize

$$\begin{aligned} f(W,Z)&= \sum_{i=1}^n \sum_{j=1}^d (\langle w^j, z_i \rangle - x_{ij})^2 & \textrm{Approximating } x_{ij} \textrm{ by } \langle w^j, z_i \rangle \\ &= \sum_{i=1}^n \lVert W^Tz_i - x_i \rVert^2 & \textrm{Approximating } x_i \textrm{ by } W^Tz_i\\ &= \lVert ZW - X \rVert_F^2 & \textrm{Approximating } X \textrm{ by } ZW \end{aligned}$$

If we do alternating minimization,

- Fix Z and optimize W: $\nabla_wf(W,Z)=Z^TZW-Z^TX$
- Fix W and optimize Z: $\nabla_wf(W,Z)=ZWW^T-XW^T$

We converge to a local optimum which will be a global optimum if W and Z are randomly initialized (if you don’t pick a saddle point)

For large X, we can also just use SGD and cost per iteration is only $O(k)$

### # Choose $k$ by variance explained

How much of the variance can be explained by the choice of factors?

For a given $k$, we compute the variance of the errors over the variable of each given $x_{ij}$

$$\frac{\lVert ZW-X \rVert_F^2}{\lVert X \rVert_F^2}$$

### # Uniqueness

Optimal $W$ is non-unique:

- Scaling problem: Can multiply any $w_c$ by any non-zero $\alpha$
- Rotation problem: Can rotate any $w_c$ within the span
- Label switching problem: Can switch any $w_c$ with any other $w_c$

To help with uniqueness,

- Normalization: We ensure $\lVert w_c \rVert = 1$
- Orthogonality: We enforce $w_c^Tw_{c’}=0$ for all $c \neq c'$
- Sequential fitting, we fit each $w_i$ in sequence

## # Multi-Dimensional Scaling

Gradient descent on points on a scatter point; try to make scatterplot distances match high-dimensional distances

$$f(Z) = \sum_{i=1}^n\sum_{j=i+1}^n (\lVert z_i - z_j \rVert - \lVert x_i - x_j \rVert)^2$$

No $W$ matrix needed! However, cannot be done using singular value decomposition (a matrix factoring technique). We need gradient descent.

- Non convex
- Sensitive to initialization
- Unfortunately, MDS often does not work well in practice; MDS tends to “crowd/squash” all the data points together like PCA.

## # t-SNE

t-Distributed Stochastic Neighbour Embedding

However, using Euclidean (L2-norm) may not be a great representation of data that lives on low-dimensional manifolds. In these cases, Euclidean distances make sense “locally” but Euclidean distances may not make sense “globally”.

t-SNE is actually a special case of . The key idea is to focus on distance to “neighbours”, allowing gaps between distances to grow

## # Word2Vec

Each word $i$ is represented by a vector of real numbers $z_i$

Trained using a masking technique.

- Takes sentence fragments and hides/masks a middle word
- Train so that $z_i$ of hidden word is similar to $z_i$ of surrounding words

$$p(z_i) = \prod_{j \in \textrm{surrounding}} \frac{\exp(z_i^Tz_j)}{\sum_{c=1}^\textrm{# words} \exp(z_c^Tz_j)}$$

Gradient descent on for $-\log(p(z_i))$