Like change of basis but instead of hand-picking the features, we learn them from data.

Part weights are a change of basis from to some . The canonical example is Principal Component Analysis (PCA)


PCA is parametric and does not provide unique global optimum.

Takes in a matrix and an input and outputs two matrices such that :

  • is a (n,k) matrix. Each row is a set of features/part weights
  • W is a (k,d) matrix. Each row is a part/factor/principle component
    • We can think of as rotating data so that the slope is zero
  • Approximation of one is
    • Each can be thought of as a linear combination of all the factors


  • Assumes is centered (each column of has a mean of zero)

Use cases:

  • Dimension reductionality: Effectively allows us to reduce the dimensionality of X if
    • Actually, it only ever makes sense if
  • Outlier detection: if PCA gives a poor approximation, could be an outlier
  • Data visualization: to visualize high-dimensional objects

We minimize

If we do alternating minimization,

  1. Fix Z and optimize W:
  2. Fix W and optimize Z:

We converge to a local optimum which will be a global optimum if W and Z are randomly initialized (if you don’t pick a saddle point)

For large X, we can also just use SGD and cost per iteration is only

Choose by variance explained

How much of the variance can be explained by the choice of factors?

For a given , we compute the variance of the errors over the variable of each given


Optimal is non-unique:

  1. Scaling problem: Can multiply any by any non-zero
  2. Rotation problem: Can rotate any within the span
  3. Label switching problem: Can switch any with any other

To help with uniqueness,

  1. Normalization: We ensure
  2. Orthogonality: We enforce for all
  3. Sequential fitting, we fit each in sequence

Multi-Dimensional Scaling

Gradient descent on points on a scatter point; try to make scatterplot distances match high-dimensional distances

No matrix needed! However, cannot be done using singular value decomposition (a matrix factoring technique). We need gradient descent.

  • Non convex
  • Sensitive to initialization
  • Unfortunately, MDS often does not work well in practice; MDS tends to “crowd/squash” all the data points together like PCA.


t-Distributed Stochastic Neighbour Embedding

However, using Euclidean (L2-norm) may not be a great representation of data that lives on low-dimensional manifolds. In these cases, Euclidean distances make sense “locally” but Euclidean distances may not make sense “globally”.

t-SNE is actually a special case of Multi-Dimensional Scaling. The key idea is to focus on distance to “neighbours”, allowing gaps between distances to grow


Each word is represented by a vector of real numbers

Trained using a masking technique.

  • Takes sentence fragments and hides/masks a middle word
  • Train so that of hidden word is similar to of surrounding words

Gradient descent on for