- Data Structure
- Networking
- RDBMS
- Operating System
- Java
- MS Excel
- iOS
- HTML
- CSS
- Android
- Python
- C Programming
- C++
- C#
- MongoDB
- MySQL
- Javascript
- PHP
- Physics
- Chemistry
- Biology
- Mathematics
- English
- Economics
- Psychology
- Social Studies
- Fashion Studies
- Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Detect and Treat Multicollinearity in Regression with Python
Multicollinearity occurs when the independent variables in a regression model exhibit a high degree of interdependence. It may cause the model's coefficients to be inaccurate, making it difficult to gauge how different independent variables will affect the dependent variable. In this case, it is necessary to recognise and deal with the multicollinearity of the regression model and along with different program and their outputs, we’ll cover step-by-step explanation as well.
Approaches
Detecting multicollinearity
Treating multicollinearity
Algorithm
Step 1 − Import necessary libraries
Step 2 − Load the data into a pandas Dataframes
Step 3 − Using the predictor variables, create a correlation matrix
Step 4 − Create a heatmap of correlation matrix to visualize the correlations
Step 5 − Calculate the variance Inflation Factor for each predictor variable for the output
Step 6 − Determine the predictor
Step 7 − Predictors should be removed
Step 8 − Re-Run the regression model
Step 9 − Check it again.
Approach I: Detecting Multicollinearity
Utilise the corr() function of the pandas package to determine the correlation matrix of the independent variables. Use the seaborn library to generate a heatmap to display the correlation matrix. Utilise the variance_inflation_factor() function from the statsmodels package to determine the Variance Inflation Factor (VIF) for each independent variable. High multicollinearity is indicated by a VIF larger than 5 or 10.
Example-1
In this code, the predictor variables X and the dependent variable y are separated once the data has been loaded into a Pandas DataFrame. To calculate the VIF for each predictor variable, we use the variance_inflation_factor() function from the statsmodels package. The final step in the process is to display the results after storing the VIF values as well as the names of the predictor variables in a fresh Pandas DataFrame. Using this code, a table containing the variable names and VIF values for each of the predictor variables will be generated. When a variable has high VIF values (higher than 5 or 10, depending on the situation), it is important to further analyze the variable.
import pandas as pd from statsmodels.stats.outliers_influence import variance_inflation_factor # Load data into a pandas DataFrame data = pd.read_csv("mydata.csv") # Select independent variables X = data[['independent_var1', 'independent_var2', 'independent_var3']] # Calculate VIF for each independent variable vif = pd.DataFrame() vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])] vif["features"] = X.columns # Print the VIF results print(vif)
Output
VIF Factor Features 0 3.068988 Independent_var1 1 3.870567 Independent_var2 2 3.843753 Independent_var3
Approach II : Treating Multicollinearity
Take out the model's one or more strongly linked independent variables. Principal component analysis (PCA) may be used to combine independent variables that are highly connected into a single variable. Reduction of the influence of strongly correlated independent variables on the model coefficients can be achieved using regularisation methods like Ridge or Lasso regression. Using the methods above, the following sample code may be used to identify and address multicollinearity −
import pandas as pd import seaborn as sns from statsmodels.stats.outliers_influence import variance_inflation_factor from sklearn.decomposition import PCA from sklearn.linear_model import Ridge # Load the data into a pandas DataFrame data = pd.read_csv('data.csv') # Calculate the correlation matrix corr_matrix = data.corr() # Create a heatmap to visualize the correlation matrix sns.heatmap(corr_matrix, annot=True, cmap='coolwarm') # Check for VIF for each independent variable for i in range(data.shape[1]-1): vif = variance_inflation_factor(data.values, i) print('VIF for variable {}: {:.2f}'.format(i, vif)) # Remove highly correlated independent variables data = data.drop(['var1', 'var2'], axis=1) # Use PCA to combine highly correlated independent variables pca = PCA(n_components=1) data['pca'] = pca.fit_transform(data[['var1', 'var2']]) # Use Ridge regression to reduce the impact of highly correlated independent variables X = data.drop('dependent_var', axis=1) y = data['dependent_var'] ridge = Ridge(alpha=0.1) ridge.fit(X, y)
Further than outputting the VIF values for each independent variable, the function doesn't generate any further output. Running this code will just output the VIF values for each independent variable; no graphs or model performance will be printed.
In this example, the data is first loaded into a pandas DataFrame, the correlation matrix is then computed, and finally, a heatmap is created to show the correlation matrix. Then, we eliminate independent factors with a high degree of correlation after testing each independent variable for VIF. We employ Ridge regression to lessen the influence of highly correlated independent variables on the model coefficients and PCA to merge highly correlated independent variables into a single variable.
import pandas as pd #create DataFrame df = pd.DataFrame({'rating': [90, 85, 82, 18, 14, 90, 16, 75, 87, 86], 'points': [22, 10, 34, 46, 27, 20, 12, 15, 14, 19], 'assists': [1, 3, 5, 6, 5, 7, 6, 9, 9, 5], 'rebounds': [11, 8, 10, 6, 3, 4, 4, 10, 10, 7]}) #view DataFrame print(df)
Output
rating points assists rebounds 0 90 22 1 11 1 85 10 3 8 2 82 34 5 10 3 18 46 6 6 4 14 27 5 3 5 90 20 7 4 6 16 12 6 4 7 75 15 9 10 8 87 14 9 10 9 86 19 5 7
Using the Pandas package, an array data structure known as a DataFrame can be generated through this Python programme. The specific dimensions consist of four distinct columns: assists, rebounds, points, and rating. The library itself is imported on the opening line of code and is thereafter referred to as "pd" to reduce complexity. A DataFrame is ultimately constructed via the pd.DataFrame() approach that is executed in the second line of code.
The DataFrame is printed to the console using the print() method in the third line of code. The values for each column make up the definition of the list, acting as the keys and values of a dictionary input to the function. Information for each player is displayed in a table format, with statistical data for points, points, assists, and rebounds arranged in columns and each row symbolizing a player.
Conclusion
In summary, when two or more predictor variables in a model have a strong correlation with one another, it is known as multicollinearity. This occurrence can make it difficult to interpret the model findings. In this situation, it becomes challenging to ascertain how each unique predictor affects the result variable.
To Continue Learning Please Login
Login with Google