k-Means Clustering
Definition and Purpose
k-Means Clustering is an unsupervised machine learning algorithm used to partition a dataset into k
distinct, non-overlapping clusters. Each cluster is defined by its centroid, which is the mean of the data points within that cluster. The main purpose of k-means clustering is to group similar data points together and discover underlying patterns or structures in the data without predefined labels.
Key Objectives:
-
Cluster Formation: Partitioning the data into
k
clusters based on similarity. - Centroid Calculation: Determining the mean position of all the points in a cluster.
- Minimizing Variance: Reducing the variance within each cluster to ensure that points in the same cluster are as similar as possible.
How k-Means Clustering Works
1. Initialization: The algorithm starts by randomly selecting k
initial centroids.
-
Random Initialization: Choose
k
data points at random from the dataset to serve as initial centroids.
2. Assignment Step: Each data point is assigned to the nearest centroid, forming k
clusters.
-
Distance Metric: Commonly uses Euclidean distance to measure the closeness of data points to centroids.
-
Euclidean Distance:
d(p, q) = sqrt((p1 - q1)^2 + (p2 - q2)^2 + ... + (pn - qn)^2)
- Measures the straight-line distance between two points
p
andq
in n-dimensional space.
-
Euclidean Distance:
3. Update Step: The centroids are recalculated as the mean of all data points assigned to each cluster.
-
Centroid Update:
- For each cluster, compute the mean of all the data points in that cluster.
- Update the centroid to this mean position.
4. Iteration: Steps 2 and 3 are repeated until convergence, i.e., when the centroids no longer change significantly.
- Convergence: The algorithm stops when the centroids stabilize or the changes are minimal.
Key Concepts
Centroid: The central point of a cluster, calculated as the mean of all points in the cluster.
-
Inertia: A measure of how internally coherent the clusters are. It is the sum of squared distances of samples to their closest centroid.
- Inertia Minimization: The goal of k-means is to minimize inertia, thus creating compact clusters.
Cluster Assignment: The process of assigning data points to the nearest centroid based on a distance metric.
-
Number of Clusters (k): The value of
k
is a critical hyperparameter. It can be chosen based on prior knowledge, domain expertise, or methods like the elbow method or silhouette analysis.-
Elbow Method: Plot inertia for different values of
k
and look for an "elbow point" where the rate of decrease slows. - Silhouette Analysis: Measure how similar a data point is to its own cluster compared to other clusters.
-
Elbow Method: Plot inertia for different values of
Iterative Refinement: The process of repeatedly updating cluster assignments and centroids until convergence.
k-Means Clustering Example
k-Means Clustering is an unsupervised learning algorithm that partitions a dataset into k
distinct clusters. This example demonstrates how to implement k-Means clustering using synthetic data, evaluate the clustering results, and visualize the clusters along with their centroids.
Python Code Example
1. Import Libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
This block imports the necessary libraries for data manipulation and plotting.
2. Generate Sample Data with 3 Clusters
np.random.seed(42) # For reproducibility
n_samples = 300
# Cluster 1: Centered at (0, 0)
X1 = np.random.randn(n_samples // 3, 2) * 0.5
# Cluster 2: Centered at (3, 3)
X2 = np.random.randn(n_samples // 3, 2) * 0.5 + [3, 3]
# Cluster 3: Centered at (-2, 3)
X3 = np.random.randn(n_samples // 3, 2) * 0.5 + [-2, 3]
# Combine all clusters
X = np.vstack((X1, X2, X3))
This block generates synthetic data for three clusters located in different regions of the feature space.
3. Perform k-Means Clustering
n_clusters = 3
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
cluster_labels = kmeans.fit_predict(X)
This block initializes the k-Means algorithm with the specified number of clusters and fits it to the dataset, predicting the cluster labels for each data point.
4. Visualize the Clustering Results
plt.figure(figsize=(12, 8))
# Plot the data points
scatter = plt.scatter(X[:, 0], X[:, 1], c=cluster_labels, cmap='viridis', alpha=0.7)
# Plot the cluster centers
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', marker='x', s=200, linewidths=3)
plt.title(f'K-means Clustering (k={n_clusters})')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(scatter)
# Add text annotations for each cluster
for i, center in enumerate(centers):
plt.annotate(f'Cluster {i}', center, fontsize=12, fontweight='bold',
xytext=(5, 5), textcoords='offset points')
plt.show()
This block visualizes the clustering results, showing the data points colored by their cluster labels and marking the cluster centroids with red 'X' symbols.
Output:
This visualization helps to illustrate how k-Means clustering groups the data points into distinct clusters, providing insights into the structure and distribution of the data.
By following this structured approach, you can effectively implement and visualize k-Means clustering for various datasets, gaining valuable insights into their underlying patterns.