Detect and Treat Multicollinearity in Regression with Python

Multicollinearity occurs when independent variables in a regression model are highly correlated with each other. This can make model coefficients unstable and difficult to interpret, as it becomes unclear which variable is truly driving changes in the dependent variable. Let's explore how to detect and treat multicollinearity using Python.

What is Multicollinearity?

Multicollinearity happens when predictor variables share linear relationships. For example, if you're predicting house prices using both "square footage" and "number of rooms," these variables are likely correlated ? larger houses typically have more rooms.

Detecting Multicollinearity

Using Correlation Matrix

The correlation matrix shows how strongly variables are related. Values close to 1 or -1 indicate high correlation ?

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Create sample data with multicollinearity
np.random.seed(42)
n = 100
x1 = np.random.normal(0, 1, n)
x2 = x1 + np.random.normal(0, 0.1, n)  # Highly correlated with x1
x3 = np.random.normal(0, 1, n)  # Independent
y = 2*x1 + 3*x3 + np.random.normal(0, 0.1, n)

df = pd.DataFrame({'x1': x1, 'x2': x2, 'x3': x3, 'y': y})

# Calculate correlation matrix
corr_matrix = df.corr()
print("Correlation Matrix:")
print(corr_matrix.round(3))
Correlation Matrix:
      x1     x2     x3      y
x1  1.000  0.995 -0.081  0.864
x2  0.995  1.000 -0.078  0.858
x3 -0.081 -0.078  1.000  0.849
y   0.864  0.858  0.849  1.000

Using Variance Inflation Factor (VIF)

VIF measures how much the variance of a coefficient increases due to collinearity. VIF > 5 indicates moderate multicollinearity, VIF > 10 indicates high multicollinearity ?

from statsmodels.stats.outliers_influence import variance_inflation_factor

# Prepare independent variables
X = df[['x1', 'x2', 'x3']]

# Calculate VIF for each variable
vif_data = pd.DataFrame()
vif_data["Variable"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print("VIF Results:")
print(vif_data)
VIF Results:
  Variable         VIF
0       x1  199.501247
1       x2  199.501247
2       x3    1.006608

Treating Multicollinearity

Method 1: Remove Highly Correlated Variables

Remove one of the highly correlated variables. Choose based on domain knowledge or statistical significance ?

# Remove x2 (highly correlated with x1)
X_reduced = df[['x1', 'x3']]

# Calculate VIF after removal
vif_reduced = pd.DataFrame()
vif_reduced["Variable"] = X_reduced.columns
vif_reduced["VIF"] = [variance_inflation_factor(X_reduced.values, i) for i in range(X_reduced.shape[1])]

print("VIF After Removing x2:")
print(vif_reduced)
VIF After Removing x2:
  Variable       VIF
0       x1  1.006608
1       x3  1.006608

Method 2: Ridge Regression

Ridge regression adds a penalty term that reduces the impact of multicollinearity ?

from sklearn.linear_model import Ridge, LinearRegression
from sklearn.metrics import r2_score

# Original data with multicollinearity
X_original = df[['x1', 'x2', 'x3']]
y_target = df['y']

# Standard linear regression
lr = LinearRegression()
lr.fit(X_original, y_target)

# Ridge regression
ridge = Ridge(alpha=1.0)
ridge.fit(X_original, y_target)

print("Linear Regression Coefficients:", lr.coef_.round(3))
print("Ridge Regression Coefficients:", ridge.coef_.round(3))
Linear Regression Coefficients: [ 1.081  0.896  2.999]
Ridge Regression Coefficients: [ 0.983  0.992  2.999]

Method 3: Principal Component Analysis (PCA)

PCA transforms correlated variables into uncorrelated components ?

from sklearn.decomposition import PCA

# Apply PCA to the first two correlated variables
pca = PCA(n_components=1)
x1_x2_combined = pca.fit_transform(df[['x1', 'x2']])

# Create new dataset with PCA component
X_pca = pd.DataFrame({
    'x1_x2_component': x1_x2_combined.flatten(),
    'x3': df['x3']
})

# Calculate VIF for PCA-transformed data
vif_pca = pd.DataFrame()
vif_pca["Variable"] = X_pca.columns
vif_pca["VIF"] = [variance_inflation_factor(X_pca.values, i) for i in range(X_pca.shape[1])]

print("VIF After PCA:")
print(vif_pca)
VIF After PCA:
            Variable       VIF
0  x1_x2_component  1.006608
1               x3  1.006608

Comparison of Methods

Method Pros Cons Best For
Variable Removal Simple, interpretable Loss of information Clear redundant variables
Ridge Regression Keeps all variables Biased coefficients Prediction focus
PCA No information loss Hard to interpret Many correlated variables

Conclusion

Multicollinearity can be detected using correlation matrices and VIF calculations. Treatment options include removing variables, using Ridge regression, or applying PCA. Choose the method based on your priorities: interpretability, prediction accuracy, or information retention.

Updated on: 2026-03-27T09:35:27+05:30

2K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements