Hierarchical clustering produces a tree of clusterings
 Each node in the tree splits the data into 2 or more clusters.
 Much more information than using a fixed clustering.
 Often have individual data points as leaves.
Often applied in phylogenetics
Agglomerative Clustering (Bottom up)
Requires a “distance” measure between two clusters.
Cluster distance measures

Distance between closest members of $C_{1}$ and $C_{2}$. Also called singlelink clustering: $mind(a,b),a∈C_{1},b∈C_{2}$

Distance between farthest members of $C_{1}$ and $C_{2}$. Also called completelink clustering: $maxd(a,b),a∈C_{1},b∈C_{2}$

Average distance between members of $C_{1}$ and $C_{2}$. Also called group average clustering: $∣C_{1}∣∣C_{2}∣1 ∑_{a∈C_{1}}∑_{b∈C_{2}}d(a,b)$

Starts with each point in its own cluster.

Each step merges the two “closest” clusters.

Stop with one big cluster that has all points.
Naive implementation cost is $O(n_{3}d)$
Divisive Clustering (Topdown)
Start with all examples in one cluster, then start dividing. (e.g., run Kmeans on a cluster, then run again on resulting clusters)
Biclustering
Cluster the training examples and features. Helps to figure out the ‘why’ on why things are clustered together
 Run clustering method on $X$
 Run clustering method on $X_{T}$
A dendrogram describes the hierarchy of clusters generated by the clustering methods.