Assumption that we know how many clusters there are as a prior ($k$ in K-Means). Designed for vector quantization: replacing examples with the mean of their cluster (collapsing a bunch of examples of a class down to a single example)

Can also be seen as a really bad latent-factor model

K-means partitions the space into convex regions, *but* clusters in the data might not be convex

Minimize $∑_{i∈clusters}{∑_{j∈i_{th}cluster}∣∣x_{j}−μ_{i}∣∣_{2}}$

- Pick some $k$
- Assign a cluster to $k$ different points randomly
- Iterate
- Center update → calculate average for each cluster (using euclidian distance)
- Label update → re-assign the data to the closest cluster center
- If no labels changed, finish (model has converged)

Warning: the clustering is initialization dependent and converges to a local minimum. Often requires some amount of random runs to approximate a good solution, pick best one.

Limited to compact/spherical clusters in high-dimensions (which is poor for modeling clusters with the same mean but different distributions)

Advantages

- easy to implement and interpret
- simple to understand
- computationally more efficient than other clustering algorithms

Disadvantages

- need to specify K
- dependent on initialization
- sensitive to scale of features (need to normalize/standardize)

Cost

- Dominated by calculating distance from each $x_{i}$ to each mean $w_{c}$

K-means, unlike the classification and regression models we studied in previous chapters, can get “stuck” in a bad solution. For example, if we were unlucky and initialized K-means with the following labels. To solve this problem when clustering data using K-means, we should randomly re-initialize the labels a few times, run K-means for each initialization, and pick the clustering that has the lowest final total WSSD.