This is the 14th in a series of class notes as I go through the Georgia Tech/Udacity Machine Learning course. The class textbook is Machine Learning by Tom Mitchell.

Defining the basic Clustering Problem

We want to take a set of objects and put them into groups.

Formally,

Given:
- set of objects X
- inter object distances D(x,y)

Output:
- Partition Pd(x) = Pd(y)
  if x & y in same cluster

D, The distance algorithm, can be any similarity metric you choose depending on your domain knowledge.

Single Linkage Clustering

The simplest clustering algorithm is Single Linkage.

start by considering each object a cluster (so you have n objects and n clusters)
define intercluster distance as the distance between the closest 2 points in the two clusters
merge the two closest clusters
repeat n-k times to arrive at k clusters.

This helps you build a Hierarchical Agglomerative Cluster Structure. By cutting off the tree at k levels from the top you can get a predetermined number of clusters, which is handy.

Instead of the closest 2 points, you can also modify this algorithm to use the furthest 2 points or the average (mean, median) point of each cluster.

SLC is deterministic, and also generates a minimum spanning tree. The running time is O(n^3). However it generates some nonintuitive clusters because of the "closest points" rule. So we need something better.

K-Means Clustering

Try:

Pick k "centers" at random
Each center "claims" all its closest points
recompute centers by averaging the clustered points
repeat until converged (recomputing doesn't have any change)

Centers do not have to be any of the points, its just the "center" of the cluster.

K-Means can be viewed as an optimization problem, in that we are always trying to get better and better answers with each round. To use optimization terminology, we are saying that we can score each set of configurations (the centers), and we try to move from neighborhood to neighborhood trying to improve.

If we assume we know the "true" partitions, P(x), we can sum the squared errors between center of P(x) - x for an Error score to minimize. "Neighborhood" is a bit less intuitive - its the set of possible moves at the end of a round, basically, you can either move the partition, or you can move the centers.

You can prove that K-means Clustering:

always improves or statys the same at each step (when you move partitions, you only move if error goes down, and when you move centers, by recomputing the center you immediately jump to the least error possible given the partitions) aka "monotonically non-increasing in Error"
converges (because you have a finite number of configurations) as long as you have a way to do consistent tie breaking

If you recall the Randomized Optimization chapter, this sounds a lot like hill-climbing, where you pick and move to neighbors based on their score. This also means it can get stuck in local optima (bad clustering based on the random starting point), so the only fix is random restarts.

There's also an edge case where if a point is somehow equidistant between two perfectly converged clusters, it would nondeterministically be either part of one or the other. To deal with that we'll use an algorithm where sharing is allowed...

Soft Clustering (Best!) also known as Expectation Maximization or Gaussian Mixing

The main trick here is to do like we did with MIMIC and assume points come from a probability distribution. Try:

Assume the data was generated by 1 of k possible Gaussians with known variance
Sample Xi from that Gaussian
Repeat n times

The task is to find a hypothesis h=<u1, u2,... uk> of "k means" that maximizes the probability of the data (aka Maximum Likelihood). The Max Likelihood mean of the gaussian is just the mean of the data, so there's nothing really new here. However you can use k Hidden Variables ([0,1] membership in Partition 1 to k) run the algorithm in a similar way to k-Means Clustering. Instead of always improving in the Error metric, you're improving in the probabilistic metric. Running the algorithm is a process of expectation maximization.

Properties of Expectation Maximization:

monotonically non-decreasing likelihood
theoretically might not converge (because Gaussian has an infinite extent, but in practice it does)
cannot diverge
can get stuck
works with any distribution

Practical issue: Picking `k`: Elbows, Silhouettes, and more

Although you can lean on domain knowledge to pick k, there can algorithms to help pick the "Elbow", the "best" point given a tradeoff between number of clusters vs explanatory power.

You can choose any number of "good" metrics to quantify cluster quality: adj. Mutual Information, Homogeneity, Completeness, and V_Measure (see original paper).

Many ways to do this exist, see the Wikipedia article for more.

Silhouette profiles can help visually inspect the appropriateness of clusters.

Practical issue: Visualizing Clusters

Using t-SNE is very helpful:

A better alternative may have emerged in Feb 2018: UMAP, and the benefits are worth considering. Here is the author explaining the idea:

(theoretical aside) Kleinberg Impossibility Theorem

There are three desirable properties of clustering algorithms:

Richness (can describe any clustering as long as we tweak variables)
Scale invariance (works the same regardless of units)
Consistency (clusters the same every time)

SLC is consistent and but not Rich.
You can modify SLC's stopping point to make it richer, but you will lose scale invariance.

So on and so forth. This is a proven theorem, that these three properties are mutually contradictory.

Jon Kleinberg's original paper is called An impossibility theorem for clustering.

Next in our series

Further notes on this topic:

The Expectation Maximization Algorithm
K-Means vs Gaussian Mixture Models - a nice slide deck
Supervised Learning of Gaussian Mixture Models for Visual Vocabulary Generation - a useful application

Hopefully that was a good introduction to Clustering. I am planning more primers and would love your feedback and questions on:

Overview
Supervised Learning
Unsupervised Learning
- Randomized Optimization
- Information Theory
- Clustering
- Feature Selection
- Feature Transformation - week of Mar 11
Reinforcement Learning
- Markov Decision Processes - week of Mar 25
- "True" RL - week of Apr 1
- Game Theory - week of Apr 15

Unsupervised Learning: Clustering