Lasso Regression
Lasso regression, or Least Absolute Shrinkage and Selection Operator, is a type of linear regression that includes a penalty term in the loss function to enforce both regularization and variable selection. This method can shrink some coefficients to zero, effectively selecting a simpler model that only includes the most significant predictors.
The Lasso regression loss function is given by:
Loss = Σ(yi - ŷi)^2 + λ * Σ|wj|
where:
- yi is the actual value,
- ŷi is the predicted value,
- wj represents the coefficients,
- λ (lambda) is the regularization parameter.
In this equation:
- The term
Σ(yi - ŷi)^2
is the Ordinary Least Squares (OLS) part, which represents the sum of squared residuals (the differences between observed and predicted values). - The term
λ * Σ|wj|
is the L1 penalty term, which adds the penalty for the absolute size of the coefficients.
Key Concepts
Ordinary Least Squares (OLS):
In standard linear regression, the goal is to minimize the sum of squared residuals. The loss function for OLS is the sum of squared errors.Adding L1 Penalty:
Lasso regression modifies the OLS loss function by adding an L1 penalty term, which is the sum of the absolute values of the coefficients multiplied by the regularization parameter (lambda). This penalty encourages sparsity in the coefficient estimates.Regularization Parameter (λ):
The value of lambda controls the strength of the penalty. A larger lambda increases the penalty on the size of the coefficients, leading to more regularization and potentially more coefficients being shrunk to zero. A smaller lambda allows for larger coefficients, approaching the OLS solution. When lambda is zero, lasso regression becomes equivalent to ordinary least squares.
Coefficients in L1 Regularization (Lasso Regression)
Penalty Term: The L1 penalty term is the sum of the absolute values of the coefficients.
-
Equation:
Loss = Σ(yi - ŷi)^2 + λ * Σ|wj|
- Effect on Coefficients: L1 regularization can shrink some coefficients to exactly zero, effectively performing variable selection by excluding certain features from the model.
- Usage: It is beneficial when a sparse model is desired, retaining only the most significant features, which enhances interpretability.
- Pattern in Coefficient Plotting: In coefficient plots for L1 regularization, as the regularization parameter increases, some coefficients quickly drop to zero while others remain significant, creating a sparse model.
- As λ Approaches Zero: When lambda is zero, the model behaves like ordinary least squares (OLS) regression, allowing coefficients to assume larger values.
- As λ Approaches Infinity: As lambda moves towards infinity, all coefficients will be driven to zero, resulting in a model that is overly simplistic and fails to capture the underlying data structure.
Lasso Regression Example
Lasso regression is a technique that applies L1 regularization to linear regression, which helps mitigate overfitting by adding a penalty term to the loss function. This example uses a polynomial regression approach with Lasso regression to demonstrate how to model complex relationships while encouraging sparsity in the model.
Python Code Example
1. Import Libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score
This block imports the necessary libraries for data manipulation, plotting, and machine learning.
2. Generate Sample Data
np.random.seed(42) # For reproducibility
X = np.linspace(0, 10, 100).reshape(-1, 1)
y = 3 * X.ravel() + np.sin(2 * X.ravel()) * 5 + np.random.normal(0, 1, 100)
This block generates sample data representing a relationship with some noise, simulating real-world data variations.
3. Split the Dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
This block splits the dataset into training and testing sets for model evaluation.
4. Create Polynomial Features
degree = 12 # Change this value for different polynomial degrees
poly = PolynomialFeatures(degree=degree)
X_poly_train = poly.fit_transform(X_train)
X_poly_test = poly.transform(X_test)
This block generates polynomial features from the training and testing datasets, allowing the model to capture non-linear relationships.
5. Create and Train the Lasso Regression Model
model = Lasso(alpha=1.0) # Alpha is the regularization strength
model.fit(X_poly_train, y_train)
This block initializes the Lasso regression model and trains it using the polynomial features derived from the training dataset.
6. Make Predictions
y_pred = model.predict(X_poly_test)
This block uses the trained model to make predictions on the test set.
7. Plot the Results
plt.figure(figsize=(10, 6))
plt.scatter(X, y, color='blue', alpha=0.5, label='Data Points')
X_grid = np.linspace(0, 10, 1000).reshape(-1, 1)
y_grid = model.predict(poly.transform(X_grid))
plt.plot(X_grid, y_grid, color='red', linewidth=2, label=f'Fitted Polynomial (Degree {degree})')
plt.title(f'Lasso Regression (Polynomial Degree {degree})')
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()
plt.grid(True)
plt.show()
Output:
This block creates a scatter plot of the actual data points versus the predicted values from the Lasso regression model, visualizing the fitted polynomial curve.
Note: Lasso regression effectively becomes ordinary least squares (OLS) regression when alpha is set to 0, meaning that no regularization is applied. However, due to the nature of L1 regularization, Lasso can still result in some coefficients being exactly zero, promoting sparsity in the model. This means that even at alpha equal to 0, L1 regularization can encourage some level of feature selection by eliminating certain features from the model which leads to some default regularization which is more compared to Ridge regression.
This structured approach demonstrates how to implement and evaluate Lasso regression with polynomial features. By encouraging sparsity through L1 regularization, Lasso regression effectively models complex relationships in data while selectively retaining the most important features, enhancing both the robustness and interpretability of predictions.