Like change of basis but instead of hand-picking the features, we learn them from data.

*Part weights* are a change of basis from $x_{i}$ to some $z_{i}$. The canonical example is Principal Component Analysis (PCA)

## PCA

PCA is parametric and does *not* provide unique global optimum.

Takes in a matrix $X$ and an input $k$ and outputs two matrices such that $X≈ZW$:

- $Z$ is a (n,k) matrix. Each row $z_{i}$ is a set of features/part weights
- W is a (k,d) matrix. Each row $w_{c}$ is a part/factor/principle component
- We can think of $W$ as
*rotating*data so that the slope is zero

- We can think of $W$ as
- Approximation of one $x^_{ij}$ is $(w_{j_{T}}z_{i})=⟨w_{j},z_{i}⟩$
- Each $x_{i}$ can be thought of as a linear combination of all the factors

Assumptions:

- Assumes $X$ is centered (each column of $X$ has a mean of zero)

Use cases:

- Dimension reductionality: Effectively allows us to reduce the dimensionality of X if $k<d$
- Actually, it only ever makes sense if $k≤d$

- Outlier detection: if PCA gives a poor approximation, $x_{i}$ could be an outlier
- Data visualization: $k=2$ to visualize high-dimensional objects

We minimize

If we do alternating minimization,

- Fix Z and optimize W: $∇_{w}f(W,Z)=Z_{T}ZW−Z_{T}X$
- Fix W and optimize Z: $∇_{w}f(W,Z)=ZWW_{T}−XW_{T}$

We converge to a local optimum which will be a global optimum if W and Z are randomly initialized (if you don’t pick a saddle point)

For large X, we can also just use SGD and cost per iteration is only $O(k)$

### Choose $k$ by variance explained

How much of the variance can be explained by the choice of factors?

For a given $k$, we compute the variance of the errors over the variable of each given $x_{ij}$

$∥X∥_{F}∥ZW−X∥_{F} $

### Uniqueness

Optimal $W$ is non-unique:

- Scaling problem: Can multiply any $w_{c}$ by any non-zero $α$
- Rotation problem: Can rotate any $w_{c}$ within the span
- Label switching problem: Can switch any $w_{c}$ with any other $w_{c}$

To help with uniqueness,

- Normalization: We ensure $∥w_{c}∥=1$
- Orthogonality: We enforce $w_{c}w_{c_{′}}=0$ for all $c=c_{′}$
- Sequential fitting, we fit each $w_{i}$ in sequence

## Multi-Dimensional Scaling

Gradient descent on points on a scatter point; try to make scatterplot distances match high-dimensional distances

$f(Z)=∑_{i=1}∑_{j=i+1}(∥z_{i}−z_{j}∥−∥x_{i}−x_{j}∥)_{2}$

No $W$ matrix needed! However, cannot be done using singular value decomposition (a matrix factoring technique). We need gradient descent.

- Non convex
- Sensitive to initialization
- Unfortunately, MDS often does not work well in practice; MDS tends to “crowd/squash” all the data points together like PCA.

## t-SNE

t-Distributed Stochastic Neighbour Embedding

However, using Euclidean (L2-norm) may not be a great representation of data that lives on low-dimensional manifolds. In these cases, Euclidean distances make sense “locally” but Euclidean distances may not make sense “globally”.

t-SNE is actually a special case of Multi-Dimensional Scaling. The key idea is to focus on distance to “neighbours”, allowing gaps between distances to grow

## Word2Vec

Each word $i$ is represented by a vector of real numbers $z_{i}$

Trained using a masking technique.

- Takes sentence fragments and hides/masks a middle word
- Train so that $z_{i}$ of hidden word is similar to $z_{i}$ of surrounding words

Gradient descent on for $−g(p(z_{i}))$