Logistic Regression, Classification: Supervised Machine Learning

Harsh Mishra - Jul 15 - - Dev Community

What is Classification?

Definition and Purpose

Classification is a supervised learning technique used in machine learning and data science to categorize data into predefined classes or labels. It involves training a model to assign input data points to one of several discrete categories based on their features. The main purpose of classification is to accurately predict the class or category of new, unseen data points.

Key Objectives:

  • Prediction: Assigning new data points to one of the predefined classes.
  • Estimation: Determining the probability that a data point belongs to a particular class.
  • Understanding Relationships: Identifying which features are significant in predicting the class of the data points.

Types of Classification

1. Binary Classification

  • Description: Categorizes data into one of two classes.
    • Examples: Spam detection (spam or not spam), disease diagnosis (disease or no disease).
    • Purpose: Distinguishes between two distinct classes.

2. Multiclass Classification

  • Description: Categorizes data into one of three or more classes.
    • Examples: Handwritten digit recognition (digits 0-9), flower species classification (multiple species).
    • Purpose: Handles problems where there are more than two classes to predict.

What are Linear Classifiers?

Linear classifiers are a category of classification algorithms that use a linear decision boundary to separate different classes in the feature space. They make predictions by combining the input features through a linear equation, typically representing the relationship between the features and the target class labels. The main purpose of linear classifiers is to efficiently classify data points by finding a hyperplane that divides the feature space into distinct classes.

Logistic Regression

Definition and Purpose

Logistic Regression is a statistical method used for binary classification tasks in machine learning and data science. It is a part of linear classifiers and differs from linear regression by predicting the probability of occurrence of an event through fitting data to a logistic curve.

Key Objectives:

  • Binary Classification: Predicting a binary outcome (e.g., yes/no, true/false).
  • Probability Estimation: Estimating the probability of an event occurring based on input variables.
  • Decision Boundary: Determining a threshold to classify data into different classes.

Logistic Regression Model

1. Logistic Function (Sigmoid Function)

  • Description: The logistic function transforms any real-valued input into a value between 0 and 1, making it suitable for modeling probabilities.
    • Equation: σ(z) = 1 / (1 + e^(-z))
    • Purpose: Maps the input values to probabilities.

2. Logistic Regression Equation

  • Description: The logistic regression model applies the logistic function to a linear combination of input variables.
    • Equation: P(y=1|x) = σ(w0 + w1x1 + w2x2 + ... + wnxn)
    • Purpose: Predicts the probability P(y=1|x) of the binary outcome y=1 given input variables x.

Maximum Likelihood Estimation (MLE)

MLE is used to estimate the parameters (coefficients) of the logistic regression model by maximizing the likelihood of observing the data given the model.

Equation: Maximizing the log-likelihood function involves finding the parameters that maximize the probability of observing the data.

Cost Function and Loss Minimization in Logistic Regression

Cost Function

The cost function in logistic regression measures the difference between predicted probabilities and actual class labels. The goal is to minimize this function to improve the model's predictive accuracy.

Log Loss (Binary Cross-Entropy):
The log loss function is commonly used in logistic regression for binary classification tasks.

Log Loss = -(1/n) * Σ [y * log(ŷ) + (1 - y) * log(1 - ŷ)]

where:

  • y is the actual class label (0 or 1),
  • ŷ is the predicted probability of the class label,
  • n is the number of data points.

The log loss penalizes predictions that are far from the actual class label, encouraging the model to produce accurate probabilities.

Loss Minimization (Optimization)

Loss minimization in logistic regression involves finding the values of the model parameters that minimize the cost function value. This process is also known as optimization. The most common method for loss minimization in logistic regression is the Gradient Descent algorithm.

Gradient Descent

Gradient Descent is an iterative optimization algorithm used to minimize the cost function in logistic regression. It adjusts the model parameters in the direction of the steepest descent of the cost function.

Steps of Gradient Descent:

  1. Initialize Parameters: Start with initial values for the model parameters (e.g., coefficients w0, w1, ..., wn).

  2. Calculate Gradient: Compute the gradient of the cost function with respect to each parameter. The gradient is the partial derivative of the cost function.

  3. Update Parameters: Adjust the parameters in the opposite direction of the gradient. The adjustment is controlled by the learning rate (α), which determines the size of the steps taken towards the minimum.

  4. Repeat: Iterate the process until the cost function converges to a minimum value (or a pre-defined number of iterations is reached).

Parameter Update Rule:
For each parameter wj:
wj = wj - α * (∂/∂wj) Log Loss

where:

  • α is the learning rate,
  • (∂/∂wj) Log Loss is the partial derivative of the log loss with respect to wj.

The partial derivative of the log loss with respect to wj can be calculated as:
(∂/∂wj) Log Loss = -(1/n) * Σ [ (yi - ŷi) * xij / (ŷi * (1 - ŷi)) ]

where:

  • xij is the value of the jth independent variable for the ith data point,
  • ŷi is the predicted probability of the class label for the ith data point.

Logistic Regression (Binary Classification) Example

Logistic regression is a technique used for binary classification tasks, modeling the probability that a given input belongs to a particular class. This example demonstrates how to implement logistic regression using synthetic data, evaluate the model's performance, and visualize the decision boundary.

Python Code Example

1. Import Libraries

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
Enter fullscreen mode Exit fullscreen mode

This block imports the necessary libraries for data manipulation, plotting, and machine learning.

2. Generate Sample Data

np.random.seed(42)  # For reproducibility
X = np.random.randn(1000, 2)
y = (X[:, 0] + X[:, 1] > 0).astype(int)
Enter fullscreen mode Exit fullscreen mode

This block generates sample data with two features, where the target variable y is defined based on whether the sum of the features is greater than zero, simulating a binary classification scenario.

3. Split the Dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Enter fullscreen mode Exit fullscreen mode

This block splits the dataset into training and testing sets for model evaluation.

4. Create and Train the Logistic Regression Model

model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

This block initializes the logistic regression model and trains it using the training dataset.

5. Make Predictions

y_pred = model.predict(X_test)
Enter fullscreen mode Exit fullscreen mode

This block uses the trained model to make predictions on the test set.

6. Evaluate the Model

accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)
Enter fullscreen mode Exit fullscreen mode

Output:

Accuracy: 0.9950

Confusion Matrix:
[[ 92   0]
 [  1 107]]

Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99        92
           1       1.00      0.99      1.00       108

    accuracy                           0.99       200
   macro avg       0.99      1.00      0.99       200
weighted avg       1.00      0.99      1.00       200
Enter fullscreen mode Exit fullscreen mode

This block calculates and prints the accuracy, confusion matrix, and classification report, providing insights into the model's performance.

7. Visualize the Decision Boundary

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(10, 8))
plt.contourf(xx, yy, Z, alpha=0.4)
plt.scatter(X[:, 0], X[:, 1], c=y, alpha=0.8)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Logistic Regression Decision Boundary")
plt.show()
Enter fullscreen mode Exit fullscreen mode

This block visualizes the decision boundary created by the logistic regression model, illustrating how the model separates the two classes in the feature space.

Output:

Logistic Regression Binary Classification

This structured approach demonstrates how to implement and evaluate logistic regression, providing a clear understanding of its capabilities for binary classification tasks. The visualization of the decision boundary aids in interpreting the model's predictions.

Logistic Regression (Multiclass Classification) Example

Logistic regression can also be applied to multiclass classification tasks. This example demonstrates how to implement logistic regression using synthetic data, evaluate the model's performance, and visualize the decision boundary for three classes.

Python Code Example

1. Import Libraries

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
Enter fullscreen mode Exit fullscreen mode

This block imports the necessary libraries for data manipulation, plotting, and machine learning.

2. Generate Sample Data with 3 Classes

np.random.seed(42)  # For reproducibility
n_samples = 999  # Total number of samples
n_samples_per_class = 333  # Ensure this is exactly n_samples // 3

# Class 0: Top-left corner
X0 = np.random.randn(n_samples_per_class, 2) * 0.5 + [-2, 2]

# Class 1: Top-right corner
X1 = np.random.randn(n_samples_per_class, 2) * 0.5 + [2, 2]

# Class 2: Bottom center
X2 = np.random.randn(n_samples_per_class, 2) * 0.5 + [0, -2]

# Combine the data
X = np.vstack([X0, X1, X2])
y = np.hstack([np.zeros(n_samples_per_class), 
               np.ones(n_samples_per_class), 
               np.full(n_samples_per_class, 2)])

# Shuffle the dataset
shuffle_idx = np.random.permutation(n_samples)
X, y = X[shuffle_idx], y[shuffle_idx]
Enter fullscreen mode Exit fullscreen mode

This block generates synthetic data for three classes located in different regions of the feature space.

3. Split the Dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Enter fullscreen mode Exit fullscreen mode

This block splits the dataset into training and testing sets for model evaluation.

4. Create and Train the Logistic Regression Model

model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

This block initializes the logistic regression model and trains it using the training dataset.

5. Make Predictions

y_pred = model.predict(X_test)
Enter fullscreen mode Exit fullscreen mode

This block uses the trained model to make predictions on the test set.

6. Evaluate the Model

accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(class_report)
Enter fullscreen mode Exit fullscreen mode

Output:

Accuracy: 1.0000

Confusion Matrix:
[[54  0  0]
 [ 0 65  0]
 [ 0  0 81]]

Classification Report:
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        54
         1.0       1.00      1.00      1.00        65
         2.0       1.00      1.00      1.00        81

    accuracy                           1.00       200
   macro avg       1.00      1.00      1.00       200
weighted avg       1.00      1.00      1.00       200
Enter fullscreen mode Exit fullscreen mode

This block calculates and prints the accuracy, confusion matrix, and classification report, providing insights into the model's performance.

7. Visualize the Decision Boundary

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(10, 8))
plt.contourf(xx, yy, Z, alpha=0.4, cmap='RdYlBu')
scatter = plt.scatter(X[:, 0], X[:, 1], c=y, cmap='RdYlBu', edgecolor='black')
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Multiclass Logistic Regression Decision Boundary")
plt.colorbar(scatter)
plt.show()
Enter fullscreen mode Exit fullscreen mode

This block visualizes the decision boundaries created by the logistic regression model, illustrating how the model separates the three classes in the feature space.

Output:

Logistic Regression Multiclass Classification

This structured approach demonstrates how to implement and evaluate logistic regression for multiclass classification tasks, providing a clear understanding of its capabilities and the effectiveness of visualizing decision boundaries.

Evaluating Logistic Regression Model

Evaluating a logistic regression model involves assessing its performance in predicting binary or multiclass outcomes. Below are key methods for evaluation:

1. Performance Metrics

  • Accuracy: The proportion of correctly classified instances out of the total instances. It provides a general sense of the model's performance.
    • Formula: Accuracy = (TP + TN) / (TP + TN + FP + FN)
  from sklearn.metrics import accuracy_score

  accuracy = accuracy_score(y_test, y_pred)
  print(f'Accuracy: {accuracy:.4f}')
Enter fullscreen mode Exit fullscreen mode
  • Confusion Matrix: A table that summarizes the performance of the classification model by showing the true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).
  from sklearn.metrics import confusion_matrix

  conf_matrix = confusion_matrix(y_test, y_pred)
  print("\nConfusion Matrix:")
  print(conf_matrix)
Enter fullscreen mode Exit fullscreen mode
  • Precision: Measures the accuracy of the positive predictions. It is the ratio of true positives to the sum of true and false positives.
    • Formula: Precision = TP / (TP + FP)
  from sklearn.metrics import precision_score

  precision = precision_score(y_test, y_pred, average='weighted')
  print(f'Precision: {precision:.4f}')
Enter fullscreen mode Exit fullscreen mode
  • Recall (Sensitivity): Measures the model's ability to identify all relevant instances (true positives). It is the ratio of true positives to the sum of true positives and false negatives.
    • Formula: Recall = TP / (TP + FN)
  from sklearn.metrics import recall_score

  recall = recall_score(y_test, y_pred, average='weighted')
  print(f'Recall: {recall:.4f}')
Enter fullscreen mode Exit fullscreen mode
  • F1 Score: The harmonic mean of precision and recall, providing a balance between the two metrics. It is useful when the class distribution is imbalanced.
    • Formula: F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
  from sklearn.metrics import f1_score

  f1 = f1_score(y_test, y_pred, average='weighted')
  print(f'F1 Score: {f1:.4f}')
Enter fullscreen mode Exit fullscreen mode

2. Cross-Validation

Cross-validation techniques provide a more reliable evaluation of model performance by assessing it across different subsets of the dataset.

  • K-Fold Cross-Validation: The dataset is divided into k subsets, and the model is trained on k-1 subsets while validating on the remaining subset. This is repeated k times, and the average metric provides a robust evaluation.
  from sklearn.model_selection import KFold, cross_val_score

  kf = KFold(n_splits=5, shuffle=True, random_state=42)
  scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
  print(f'Cross-Validation Accuracy: {np.mean(scores):.4f}')
Enter fullscreen mode Exit fullscreen mode
  • Stratified K-Fold Cross-Validation: Similar to K-Fold but ensures that each fold maintains the class distribution, which is particularly beneficial for imbalanced datasets.
  from sklearn.model_selection import StratifiedKFold

  skf = StratifiedKFold(n_splits=5)
  scores = cross_val_score(model, X, y, cv=skf, scoring='accuracy')
  print(f'Stratified K-Fold Cross-Validation Accuracy: {np.mean(scores):.4f}')
Enter fullscreen mode Exit fullscreen mode

By utilizing these evaluation methods and cross-validation techniques, practitioners can gain insights into the effectiveness of their logistic regression model and its ability to generalize to unseen data.

Regularization in Logistic Regression

Regularization helps mitigate overfitting in logistic regression by adding a penalty term to the loss function, encouraging simpler models. The two primary forms of regularization in logistic regression are L1 regularization (Lasso) and L2 regularization (Ridge).

L2 Regularization (Ridge Logistic Regression)

Concept: L2 regularization adds a penalty equal to the square of the magnitude of coefficients to the loss function.

Loss Function: The modified loss function for Ridge logistic regression is expressed as:

Loss = -Σ[yi * log(ŷi) + (1 - yi) * log(1 - ŷi)] + λ * Σ(wj^2)

Where:

  • yi is the actual class label.
  • ŷi is the predicted probability of the positive class.
  • wj are the model coefficients.
  • λ is the regularization parameter.

Effects:

  • Ridge regularization shrinks the coefficients toward zero but does not eliminate them. All features remain in the model, which is beneficial for cases with many predictors or multicollinearity.

L1 Regularization (Lasso Logistic Regression)

Concept: L1 regularization adds a penalty equal to the absolute value of the magnitude of coefficients to the loss function.

Loss Function: The modified loss function for Lasso logistic regression can be expressed as:

Loss = -Σ[yi * log(ŷi) + (1 - yi) * log(1 - ŷi)] + λ * Σ|wj|

Where:

  • yi is the actual class label.
  • ŷi is the predicted probability of the positive class.
  • wj are the model coefficients.
  • λ is the regularization parameter.

Effects:

  • Lasso regularization can set some coefficients to exactly zero, effectively performing variable selection. This is advantageous in high-dimensional datasets where interpretability is essential.

By applying regularization techniques in logistic regression, practitioners can enhance model generalization and manage the bias-variance tradeoff effectively.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .