Search

# K-means

Last updated Sep 28, 2022 Edit Source

Assumption that we know how many clusters there are as a prior ($k$ in K-Means). Designed for vector quantization: replacing examples with the mean of their cluster (collapsing a bunch of examples of a class down to a single example)

Can also be seen as a really bad latent-factor model

K-means partitions the space into convex regions, but clusters in the data might not be convex

Minimize $\sum_{i \in \textrm{clusters}} { \sum_{j \in i^{th} \textrm{cluster}} ||x_j - \mu_i||^2 }$

1. Pick some $k$
2. Assign a cluster to $k$ different points randomly
3. Iterate
1. Center update → calculate average for each cluster (using euclidian distance)
2. Label update → re-assign the data to the closest cluster center
3. If no labels changed, finish (model has converged)

Warning: the clustering is initialization dependent and converges to a local minimum. Often requires some amount of random runs to approximate a good solution, pick best one.

Limited to compact/spherical clusters in high-dimensions (which is poor for modeling clusters with the same mean but different distributions)

• easy to implement and interpret
• simple to understand
• computationally more efficient than other clustering algorithms

• Dominated by calculating distance from each $x_i$ to each mean $w_c$