How To Calculate Studentized Residuals In Python?


Studentized residuals are typically used in regression analysis to identify potential outliers in the data. An outlier is a point that is significantly different from the overall trend of the data, and it can have a significant influence on the fitted model. By identifying and analyzing outliers, you can better understand the underlying patterns in your data and improve the accuracy of your model. In this post, we will be closely looking at Studentized Residuals and how you can implement it in python.

What are Studentized Residuals?

The term "studentized residuals" refers to a particular class of residuals that have had their standard deviations divided by an estimate. Regression analysis residuals are used to describe the discrepancy between the response variable's observed values and its model-generated anticipated values. To find probable outliers in the data that can significantly affect the fitted model, studentized residuals are employed.

The following formula is typically used to calculate studentized residuals −

studentized residual = residual / (standard deviation of residuals * (1 - hii)^(1/2))

where "residual" refers to the discrepancy between the observed and anticipated response values, "standard deviation of residuals" refers to an estimate of the residuals' standard deviation, and "hii" refers to the leverage factor for each data point.

Calculating Studentized Residuals in Python

The statsmodels package can be used to compute studentized residuals in Python. As an illustration, consider the following −

Syntax

OLSResults.outlier_test()

Where OLSResults refers to a linear model that was fitted using statsmodels' ols() method.

df = pd.DataFrame({'rating': [95, 82, 92, 90, 97, 85, 80, 70, 82, 83], 'points': [22, 25, 17, 19, 26, 24, 9, 19, 11, 16]}) model = ols('rating ~ points', data=df).fit() stud_res = model.outlier_test()

Where ‘rating’ and ‘points’ refers to the simple linear regression.

Algorithm

  • Import numpy, pandas, Statsmodel apis.

  • Create a dataset.

  • Perform a simple linear regression model on dataset.

  • Calculating studentized residuals.

  • Printing studentized residuals.

Example

Using the scikit−posthocs lib to run Dunn’s test is demonstrated here −

#import necessary packages and functions import numpy as np import pandas as pd import statsmodels.api as sm from statsmodels.formula.api import ols #create dataset df = pd.DataFrame({'rating': [95, 82, 92, 90, 97, 85, 80, 70, 82, 83], 'points': [22, 25, 17, 19, 26, 24, 9, 19, 11, 16]})

Create a linear regression model next by using the statsmodels OLS class −

#fit simple linear regression model model = ols('rating ~ points', data=df).fit()

Using the outlier test() method, the studentized residuals for each observation in the dataset can be generated in a DataFrame −

#calculate studentized residuals stud_res = model.outlier_test() #display studentized residuals print(stud_res)

Output

  student_resid   unadj_p   bonf(p)
0       1.048218  0.329376  1.000000
1      -1.018535  0.342328  1.000000
2       0.994962  0.352896  1.000000
3       0.548454  0.600426  1.000000
4       1.125756  0.297380  1.000000
5      -0.465472  0.655728  1.000000
6      -0.029670  0.977158  1.000000
7      -2.940743  0.021690  0.216903
8       0.100759  0.922567  1.000000
9      -0.134123  0.897080  1.000000

We can also quickly plot the predictor variable values against the studentized residuals −

Syntax

x = df['points']
y = stud_res['student_resid']

plt.scatter(x, y)
plt.axhline(y=0, color='black', linestyle='--')
plt.xlabel('Points')
plt.ylabel('Studentized Residuals')

Here we will be using matpotlib library to plot the graph with color = ‘black’ and lifestyle = ‘--’

Algorithm

  • Importing matplotlib’s pyplot library

  • Defining predictor variable values

  • Defining studentized residuals

  • Creating scatterplot of predicotr variable vs. studentized residuals

Example

import matplotlib.pyplot as plt #define predictor variable values and studentized residuals x = df['points'] y = stud_res['student_resid'] #create scatterplot of predictor variable vs. studentized residuals plt.scatter(x, y) plt.axhline(y=0, color='black', linestyle='--') plt.xlabel('Points') plt.ylabel('Studentized Residuals')

Output

Conclusion

Identifying and evaluating possible data outliers. Examining the studentized residuals allows you to find points that deviate considerably from the overall trend of the data and explore why they are impacting the fitted model. Identifying significant observations Studentized residuals can be used to discover and evaluate influential data, which are points that have a substantial influence on the fitted model. Finding high-leverage spots. The studentized residuals can be used to identify high-leverage points. Leverage is a measure of how much a certain point affects the fitted model. Overall, using studentized residuals can help analyze and enhance the performance of a regression model.

Updated on: 28-Dec-2022

740 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements