Search IconIcon to open search

Latent-Factor Models

Last updated Nov 23, 2022 Edit Source

Like change of basis but instead of hand-picking the features, we learn them from data.

Part weights are a change of basis from $x_i$ to some $z_i$. The canonical example is Principal Component Analysis (PCA)


PCA is parametric and does not provide unique global optimum.

Takes in a matrix $X$ and an input $k$ and outputs two matrices such that $X \approx ZW$:


Use cases:

We minimize

$$\begin{aligned} f(W,Z)&= \sum_{i=1}^n \sum_{j=1}^d (\langle w^j, z_i \rangle - x_{ij})^2 & \textrm{Approximating } x_{ij} \textrm{ by } \langle w^j, z_i \rangle \\ &= \sum_{i=1}^n \lVert W^Tz_i - x_i \rVert^2 & \textrm{Approximating } x_i \textrm{ by } W^Tz_i\\ &= \lVert ZW - X \rVert_F^2 & \textrm{Approximating } X \textrm{ by } ZW \end{aligned}$$

If we do alternating minimization,

  1. Fix Z and optimize W: $\nabla_wf(W,Z)=Z^TZW-Z^TX$
  2. Fix W and optimize Z: $\nabla_wf(W,Z)=ZWW^T-XW^T$

We converge to a local optimum which will be a global optimum if W and Z are randomly initialized (if you don’t pick a saddle point)

For large X, we can also just use SGD and cost per iteration is only $O(k)$

# Choose $k$ by variance explained

How much of the variance can be explained by the choice of factors?

For a given $k$, we compute the variance of the errors over the variable of each given $x_{ij}$

$$\frac{\lVert ZW-X \rVert_F^2}{\lVert X \rVert_F^2}$$

# Uniqueness

Optimal $W$ is non-unique:

  1. Scaling problem: Can multiply any $w_c$ by any non-zero $\alpha$
  2. Rotation problem: Can rotate any $w_c$ within the span
  3. Label switching problem: Can switch any $w_c$ with any other $w_c$

To help with uniqueness,

  1. Normalization: We ensure $\lVert w_c \rVert = 1$
  2. Orthogonality: We enforce $w_c^Tw_{c’}=0$ for all $c \neq c'$
  3. Sequential fitting, we fit each $w_i$ in sequence

# Multi-Dimensional Scaling

Gradient descent on points on a scatter point; try to make scatterplot distances match high-dimensional distances

$$f(Z) = \sum_{i=1}^n\sum_{j=i+1}^n (\lVert z_i - z_j \rVert - \lVert x_i - x_j \rVert)^2$$

No $W$ matrix needed! However, cannot be done using singular value decomposition (a matrix factoring technique). We need gradient descent.

# t-SNE

t-Distributed Stochastic Neighbour Embedding

However, using Euclidean (L2-norm) may not be a great representation of data that lives on low-dimensional manifolds. In these cases, Euclidean distances make sense “locally” but Euclidean distances may not make sense “globally”.

t-SNE is actually a special case of . The key idea is to focus on distance to “neighbours”, allowing gaps between distances to grow

# Word2Vec

Each word $i$ is represented by a vector of real numbers $z_i$

Trained using a masking technique.

$$p(z_i) = \prod_{j \in \textrm{surrounding}} \frac{\exp(z_i^Tz_j)}{\sum_{c=1}^\textrm{# words} \exp(z_c^Tz_j)}$$

Gradient descent on for $-\log(p(z_i))$