Latent-Factor Models

Like change of basis but instead of hand-picking the features, we learn them from data.

Part weights are a change of basis from $x_{i}$ to some $z_{i}$ . The canonical example is Principal Component Analysis (PCA)

PCA

PCA is parametric and does not provide unique global optimum.

Takes in a matrix $X$ and an input $k$ and outputs two matrices such that $X \approx Z W$ :

$Z$ is a (n,k) matrix. Each row $z_{i}$ is a set of features/part weights
W is a (k,d) matrix. Each row $w_{c}$ is a part/factor/principle component
- We can think of $W$ as rotating data so that the slope is zero
Approximation of one $\overset{x}{^}_{ij}$ is $(w^{j^{T}} z_{i}) = ⟨ w^{j}, z_{i} ⟩$
- Each $x_{i}$ can be thought of as a linear combination of all the factors

Assumptions:

Assumes $X$ is centered (each column of $X$ has a mean of zero)

Use cases:

Dimension reductionality: Effectively allows us to reduce the dimensionality of X if $k < d$
- Actually, it only ever makes sense if $k \leq d$
Outlier detection: if PCA gives a poor approximation, $x_{i}$ could be an outlier
Data visualization: $k = 2$ to visualize high-dimensional objects

We minimize

f (W, Z) = i = 1 \sum n j = 1 \sum d (⟨ w^{j}, z_{i} ⟩ - x_{ij})^{2} = i = 1 \sum n ∥ W^{T} z_{i} - x_{i} ∥^{2} = ∥ Z W - X ∥_{F}^{2} Approximating x_{ij} by ⟨ w^{j}, z_{i} ⟩ Approximating x_{i} by W^{T} z_{i} Approximating X by Z W

If we do alternating minimization,

Fix Z and optimize W: $\nabla_{w} f (W, Z) = Z^{T} Z W - Z^{T} X$
Fix W and optimize Z: $\nabla_{w} f (W, Z) = Z W W^{T} - X W^{T}$

We converge to a local optimum which will be a global optimum if W and Z are randomly initialized (if you don’t pick a saddle point)

For large X, we can also just use SGD and cost per iteration is only $O (k)$

Choose $k$ by variance explained

How much of the variance can be explained by the choice of factors?

For a given $k$ , we compute the variance of the errors over the variable of each given $x_{ij}$

$\frac{∥ Z W - X ∥ _{F}^{2}}{∥ X ∥ _{F}^{2}}$

Uniqueness

Optimal $W$ is non-unique:

Scaling problem: Can multiply any $w_{c}$ by any non-zero $α$
Rotation problem: Can rotate any $w_{c}$ within the span
Label switching problem: Can switch any $w_{c}$ with any other $w_{c}$

To help with uniqueness,

Normalization: We ensure $∥ w_{c} ∥ = 1$
Orthogonality: We enforce $w_{c}^{T} w_{c^{'}} = 0$ for all $c \neq = c^{'}$
Sequential fitting, we fit each $w_{i}$ in sequence

Multi-Dimensional Scaling

Gradient descent on points on a scatter point; try to make scatterplot distances match high-dimensional distances

$f (Z) = \sum_{i = 1}^{n} \sum_{j = i + 1}^{n} (∥ z_{i} - z_{j} ∥ - ∥ x_{i} - x_{j} ∥)^{2}$

No $W$ matrix needed! However, cannot be done using singular value decomposition (a matrix factoring technique). We need gradient descent.

Non convex
Sensitive to initialization
Unfortunately, MDS often does not work well in practice; MDS tends to “crowd/squash” all the data points together like PCA.

t-SNE

t-Distributed Stochastic Neighbour Embedding

However, using Euclidean (L2-norm) may not be a great representation of data that lives on low-dimensional manifolds. In these cases, Euclidean distances make sense “locally” but Euclidean distances may not make sense “globally”.

t-SNE is actually a special case of Multi-Dimensional Scaling. The key idea is to focus on distance to “neighbours”, allowing gaps between distances to grow

Word2Vec

Each word $i$ is represented by a vector of real numbers $z_{i}$

Trained using a masking technique.

Takes sentence fragments and hides/masks a middle word
Train so that $z_{i}$ of hidden word is similar to $z_{i}$ of surrounding words

p (z_{i}) = j \in surrounding \prod \frac{exp ( z _{i}^{T} z _{j} )}{\sum _{c = 1}^{# words} exp ( z _{c}^{T} z _{j} )}

Gradient descent on for $- lo g (p (z_{i}))$

jzhao.xyz

Recent Writing

2024: Centering

Taste is a guide for what is worthwhile

Agentic Computing

Building a BFT JSON CRDT

Recent Notes

TrueTime

Concurrency control

Latent-Factor Models

PCA

Choose $k$ by variance explained

Uniqueness

Multi-Dimensional Scaling

t-SNE

Word2Vec

Recent Writing

2024: Centering

Taste is a guide for what is worthwhile

Agentic Computing

Building a BFT JSON CRDT

Recent Notes

TrueTime

Concurrency control

Graph View

Table of Contents

Backlinks

Recent Writing

Recent Notes

Latent-Factor Models

PCA

Choose k by variance explained

Uniqueness

Multi-Dimensional Scaling

t-SNE

Word2Vec

Recent Writing

Recent Notes

Graph View

Table of Contents

Backlinks

Choose $k$ by variance explained