k-Nearest Neighbors Classification
Definition and Purpose
k-Nearest Neighbors (k-NN) classification is a non-parametric, instance-based learning algorithm used in machine learning to classify data points based on the classes of their nearest neighbors in the feature space. It assigns a class to a data point by considering the classes of its k
closest neighbors. The main purpose of k-NN classification is to predict the class of new data points by leveraging the similarity to existing labeled data.
Key Objectives:
- Classification: Assigning new data points to one of the predefined classes based on the majority vote or weighted vote of the nearest neighbors.
- Estimation: Determining the likelihood of a data point belonging to a particular class.
- Understanding Relationships: Identifying which data points are similar in the feature space.
How k-NN Classification Works
1. Distance Metric: The algorithm uses a distance metric (commonly Euclidean distance) to determine the "closeness" of data points.
-
Euclidean Distance:
d(p, q) = sqrt((p1 - q1)^2 + (p2 - q2)^2 + ... + (pn - qn)^2)
- Measures the straight-line distance between two points
p
andq
in n-dimensional space.
2. Choosing k: The parameter k
specifies the number of nearest neighbors to consider for making the classification decision.
- Small k: Can lead to overfitting, where the model is too sensitive to the training data.
- Large k: Can lead to underfitting, where the model is too generalized and may miss finer patterns in the data.
3. Majority Voting: The predicted class for a new data point is the class that is most common among its k
nearest neighbors.
-
Majority Vote:
- Count the number of occurrences of each class among the
k
neighbors. - Assign the class with the highest count to the new data point.
- Count the number of occurrences of each class among the
4. Weighted Voting: In some cases, neighbors are weighted according to their distance, with closer neighbors having more influence on the classification.
-
Weighted Vote:
- Weigh each neighbor's vote by the inverse of its distance.
- Sum the weighted votes for each class.
- Assign the class with the highest weighted sum to the new data point.
Key Concepts
Non-Parametric: k-NN is a non-parametric method, meaning it makes no assumptions about the underlying distribution of the data. This makes it flexible in handling various types of data.
Instance-Based Learning: The algorithm stores the entire training dataset and makes predictions based on the local patterns in the data. It is also known as a "lazy" learning algorithm because it delays processing until a query is made.
Distance Calculation: The choice of distance metric can significantly affect the model's performance. Common metrics include Euclidean, Manhattan, and Minkowski distances.
Choice of k: The value of
k
is a critical hyperparameter. Cross-validation is often used to determine the optimal value ofk
for a given dataset.
k-Nearest Neighbors (k-NN) Classification Example
k-Nearest Neighbors (k-NN) classification is a non-parametric, instance-based learning algorithm used to classify data points based on the classes of their nearest neighbors. This example demonstrates how to implement k-NN for multiclass classification using synthetic data, evaluate the model's performance, and visualize the decision boundary for three classes.
Python Code Example
1. Import Libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
This block imports the necessary libraries for data manipulation, plotting, and machine learning.
2. Generate Sample Data with 3 Classes
np.random.seed(42) # For reproducibility
n_samples = 300
# Class 0: Cluster at the top-left corner
X0 = np.random.randn(n_samples // 3, 2) * 0.5 + [-2, 2]
# Class 1: Cluster at the top-right corner
X1 = np.random.randn(n_samples // 3, 2) * 0.5 + [2, 2]
# Class 2: Cluster at the bottom-center
X2 = np.random.randn(n_samples // 3, 2) * 0.5 + [0, -2]
# Combine all classes
X = np.vstack((X0, X1, X2))
y = np.array([0] * (n_samples // 3) + [1] * (n_samples // 3) + [2] * (n_samples // 3))
This block generates synthetic data for three classes located in different regions of the feature space.
3. Split the Dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
This block splits the dataset into training and testing sets for model evaluation.
4. Create and Train the k-NN Classifier
k = 5 # Number of neighbors
knn_classifier = KNeighborsClassifier(n_neighbors=k)
knn_classifier.fit(X_train, y_train)
This block initializes the k-NN classifier with the specified number of neighbors and trains it using the training dataset.
5. Make Predictions
y_pred = knn_classifier.predict(X_test)
This block uses the trained model to make predictions on the test set.
6. Evaluate the Model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
Output:
Accuracy: 1.00
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 22
1 1.00 1.00 1.00 16
2 1.00 1.00 1.00 22
accuracy 1.00 60
macro avg 1.00 1.00 1.00 60
weighted avg 1.00 1.00 1.00 60
This block calculates and prints the accuracy and classification report, providing insights into the model's performance.
7. Visualize the Decision Boundary
h = 0.02 # Step size in the mesh
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = knn_classifier.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(figsize=(12, 8))
plt.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.RdYlBu, edgecolors='black')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title(f'k-NN Classification (k={k})')
plt.colorbar()
plt.show()
This block visualizes the decision boundaries created by the k-NN classifier, illustrating how the model separates the three classes in the feature space.
Output:
This structured approach demonstrates how to implement and evaluate k-NN for multiclass classification tasks, providing a clear understanding of its capabilities and the effectiveness of visualizing decision boundaries.