Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Detect and Treat Multicollinearity in Regression with Python
Multicollinearity occurs when independent variables in a regression model are highly correlated with each other. This can make model coefficients unstable and difficult to interpret, as it becomes unclear which variable is truly driving changes in the dependent variable. Let's explore how to detect and treat multicollinearity using Python.
What is Multicollinearity?
Multicollinearity happens when predictor variables share linear relationships. For example, if you're predicting house prices using both "square footage" and "number of rooms," these variables are likely correlated ? larger houses typically have more rooms.
Detecting Multicollinearity
Using Correlation Matrix
The correlation matrix shows how strongly variables are related. Values close to 1 or -1 indicate high correlation ?
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Create sample data with multicollinearity
np.random.seed(42)
n = 100
x1 = np.random.normal(0, 1, n)
x2 = x1 + np.random.normal(0, 0.1, n) # Highly correlated with x1
x3 = np.random.normal(0, 1, n) # Independent
y = 2*x1 + 3*x3 + np.random.normal(0, 0.1, n)
df = pd.DataFrame({'x1': x1, 'x2': x2, 'x3': x3, 'y': y})
# Calculate correlation matrix
corr_matrix = df.corr()
print("Correlation Matrix:")
print(corr_matrix.round(3))
Correlation Matrix:
x1 x2 x3 y
x1 1.000 0.995 -0.081 0.864
x2 0.995 1.000 -0.078 0.858
x3 -0.081 -0.078 1.000 0.849
y 0.864 0.858 0.849 1.000
Using Variance Inflation Factor (VIF)
VIF measures how much the variance of a coefficient increases due to collinearity. VIF > 5 indicates moderate multicollinearity, VIF > 10 indicates high multicollinearity ?
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Prepare independent variables
X = df[['x1', 'x2', 'x3']]
# Calculate VIF for each variable
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print("VIF Results:")
print(vif_data)
VIF Results: Variable VIF 0 x1 199.501247 1 x2 199.501247 2 x3 1.006608
Treating Multicollinearity
Method 1: Remove Highly Correlated Variables
Remove one of the highly correlated variables. Choose based on domain knowledge or statistical significance ?
# Remove x2 (highly correlated with x1)
X_reduced = df[['x1', 'x3']]
# Calculate VIF after removal
vif_reduced = pd.DataFrame()
vif_reduced["Variable"] = X_reduced.columns
vif_reduced["VIF"] = [variance_inflation_factor(X_reduced.values, i) for i in range(X_reduced.shape[1])]
print("VIF After Removing x2:")
print(vif_reduced)
VIF After Removing x2: Variable VIF 0 x1 1.006608 1 x3 1.006608
Method 2: Ridge Regression
Ridge regression adds a penalty term that reduces the impact of multicollinearity ?
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.metrics import r2_score
# Original data with multicollinearity
X_original = df[['x1', 'x2', 'x3']]
y_target = df['y']
# Standard linear regression
lr = LinearRegression()
lr.fit(X_original, y_target)
# Ridge regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_original, y_target)
print("Linear Regression Coefficients:", lr.coef_.round(3))
print("Ridge Regression Coefficients:", ridge.coef_.round(3))
Linear Regression Coefficients: [ 1.081 0.896 2.999] Ridge Regression Coefficients: [ 0.983 0.992 2.999]
Method 3: Principal Component Analysis (PCA)
PCA transforms correlated variables into uncorrelated components ?
from sklearn.decomposition import PCA
# Apply PCA to the first two correlated variables
pca = PCA(n_components=1)
x1_x2_combined = pca.fit_transform(df[['x1', 'x2']])
# Create new dataset with PCA component
X_pca = pd.DataFrame({
'x1_x2_component': x1_x2_combined.flatten(),
'x3': df['x3']
})
# Calculate VIF for PCA-transformed data
vif_pca = pd.DataFrame()
vif_pca["Variable"] = X_pca.columns
vif_pca["VIF"] = [variance_inflation_factor(X_pca.values, i) for i in range(X_pca.shape[1])]
print("VIF After PCA:")
print(vif_pca)
VIF After PCA:
Variable VIF
0 x1_x2_component 1.006608
1 x3 1.006608
Comparison of Methods
| Method | Pros | Cons | Best For |
|---|---|---|---|
| Variable Removal | Simple, interpretable | Loss of information | Clear redundant variables |
| Ridge Regression | Keeps all variables | Biased coefficients | Prediction focus |
| PCA | No information loss | Hard to interpret | Many correlated variables |
Conclusion
Multicollinearity can be detected using correlation matrices and VIF calculations. Treatment options include removing variables, using Ridge regression, or applying PCA. Choose the method based on your priorities: interpretability, prediction accuracy, or information retention.
