Detect and Treat Multicollinearity in Regression with Python


Multicollinearity occurs when the independent variables in a regression model exhibit a high degree of interdependence. It may cause the model's coefficients to be inaccurate, making it difficult to gauge how different independent variables will affect the dependent variable. In this case, it is necessary to recognise and deal with the multicollinearity of the regression model and along with different program and their outputs, we’ll cover step-by-step explanation as well.

Approaches

  • Detecting multicollinearity

  • Treating multicollinearity

Algorithm

Step 1 − Import necessary libraries

Step 2 − Load the data into a pandas Dataframes

Step 3 − Using the predictor variables, create a correlation matrix

Step 4 − Create a heatmap of correlation matrix to visualize the correlations

Step 5 − Calculate the variance Inflation Factor for each predictor variable for the output

Step 6 − Determine the predictor

Step 7 − Predictors should be removed

Step 8 − Re-Run the regression model

Step 9 − Check it again.

Approach I: Detecting Multicollinearity

Utilise the corr() function of the pandas package to determine the correlation matrix of the independent variables. Use the seaborn library to generate a heatmap to display the correlation matrix. Utilise the variance_inflation_factor() function from the statsmodels package to determine the Variance Inflation Factor (VIF) for each independent variable. High multicollinearity is indicated by a VIF larger than 5 or 10.

Example-1

In this code, the predictor variables X and the dependent variable y are separated once the data has been loaded into a Pandas DataFrame. To calculate the VIF for each predictor variable, we use the variance_inflation_factor() function from the statsmodels package. The final step in the process is to display the results after storing the VIF values as well as the names of the predictor variables in a fresh Pandas DataFrame. Using this code, a table containing the variable names and VIF values for each of the predictor variables will be generated. When a variable has high VIF values (higher than 5 or 10, depending on the situation), it is important to further analyze the variable.

import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Load data into a pandas DataFrame
data = pd.read_csv("mydata.csv")

# Select independent variables
X = data[['independent_var1', 'independent_var2', 'independent_var3']]

# Calculate VIF for each independent variable
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns

# Print the VIF results
print(vif)

Output

VIF  Factor      Features 
0    3.068988    Independent_var1
1    3.870567    Independent_var2
2    3.843753    Independent_var3

Approach II : Treating Multicollinearity

Take out the model's one or more strongly linked independent variables. Principal component analysis (PCA) may be used to combine independent variables that are highly connected into a single variable. Reduction of the influence of strongly correlated independent variables on the model coefficients can be achieved using regularisation methods like Ridge or Lasso regression. Using the methods above, the following sample code may be used to identify and address multicollinearity −

import pandas as pd
import seaborn as sns
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.decomposition import PCA
from sklearn.linear_model import Ridge

# Load the data into a pandas DataFrame
data = pd.read_csv('data.csv')

# Calculate the correlation matrix
corr_matrix = data.corr()

# Create a heatmap to visualize the correlation matrix
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')

# Check for VIF for each independent variable
for i in range(data.shape[1]-1):
   vif = variance_inflation_factor(data.values, i)
   print('VIF for variable {}: {:.2f}'.format(i, vif))

# Remove highly correlated independent variables
data = data.drop(['var1', 'var2'], axis=1)

# Use PCA to combine highly correlated independent variables
pca = PCA(n_components=1)
data['pca'] = pca.fit_transform(data[['var1', 'var2']])

# Use Ridge regression to reduce the impact of highly correlated independent variables
X = data.drop('dependent_var', axis=1)
y = data['dependent_var']
ridge = Ridge(alpha=0.1)
ridge.fit(X, y)

Further than outputting the VIF values for each independent variable, the function doesn't generate any further output. Running this code will just output the VIF values for each independent variable; no graphs or model performance will be printed.

In this example, the data is first loaded into a pandas DataFrame, the correlation matrix is then computed, and finally, a heatmap is created to show the correlation matrix. Then, we eliminate independent factors with a high degree of correlation after testing each independent variable for VIF. We employ Ridge regression to lessen the influence of highly correlated independent variables on the model coefficients and PCA to merge highly correlated independent variables into a single variable.

import pandas as pd

#create DataFrame
df = pd.DataFrame({'rating': [90, 85, 82, 18, 14, 90, 16, 75, 87, 86],
         'points': [22, 10, 34, 46, 27, 20, 12, 15, 14, 19],
         'assists': [1, 3, 5, 6, 5, 7, 6, 9, 9, 5],
         'rebounds': [11, 8, 10, 6, 3, 4, 4, 10, 10, 7]})

#view DataFrame
print(df)

Output

   rating  points  assists  rebounds
0      90      22        1        11
1      85      10        3         8
2      82      34        5        10
3      18      46        6         6
4      14      27        5         3
5      90      20        7         4
6      16      12        6         4
7      75      15        9        10
8      87      14        9        10
9      86      19        5         7

Using the Pandas package, an array data structure known as a DataFrame can be generated through this Python programme. The specific dimensions consist of four distinct columns: assists, rebounds, points, and rating. The library itself is imported on the opening line of code and is thereafter referred to as "pd" to reduce complexity. A DataFrame is ultimately constructed via the pd.DataFrame() approach that is executed in the second line of code.

The DataFrame is printed to the console using the print() method in the third line of code. The values for each column make up the definition of the list, acting as the keys and values of a dictionary input to the function. Information for each player is displayed in a table format, with statistical data for points, points, assists, and rebounds arranged in columns and each row symbolizing a player.

Conclusion

In summary, when two or more predictor variables in a model have a strong correlation with one another, it is known as multicollinearity. This occurrence can make it difficult to interpret the model findings. In this situation, it becomes challenging to ascertain how each unique predictor affects the result variable.

Updated on: 24-Jul-2023

937 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements