How to Perform Grubbs Test in Python

Python Server Side Programming Programming

Introduction

The Grubbs test is a statistical hypothesis testing method to detect outliers in a dataset. Outliers are the observations that disburse the data distribution and are also known as anomalies. The dataset with outliers tends to overfit more than data with a Normal/Gaussian distribution. Hence, it is necessary to tackle the outliers before Machine Learning modeling. Before handling, we must detect and locate the outliers in the dataset. The most popular outliers detection techniques are QQPlot, Inter-Quartile range, and Grubbs statistical test. However, this article will discuss only the Grubbs test to detect the outliers. You will learn: what a Grubbs test is and how to implement it in Python.

What are Outliers?

Outliers are numerically distant data observations from other data values. These values are present out of the range of normally distributed data. The dataset must contain 67% of records under the first standard deviation, 95% of data under the second standard deviation, and 99.7% points under the third standard deviation of the mean to attain normal distribution. In other words, the data points should come between the first and the third quartile range. We consider records present below the first quartile and above the third quartile as outliers or anomalies.

Grubbs Statistical Hypothesis Test

Grubbs test also approves or rejects the Null (H0) or alternate (H1) hypothesis like any other statistical hypothesis test. The Grubbs test is a test to detect the outliers in a dataset.

We can perform the Grubbs test in two ways: the One-Sided Test and the Two-Sided Test for a univariate dataset with or a sample of an almost normal distribution with at least seven variables. This test is also known as the extreme studentized deviation test or maximum normalized residual test.

Grubbs test uses the following hypothesis ?

Null (H0): The dataset has no outliers.
Alternate (H1): The dataset has exactly one outlier.

Grubbs Test in Python

Python has its way with any programming challenge with its vast library collection. These libraries provide in-built methods to use directly to perform any operation, statistical test, and many more. Similarly, Python has a library with methods for performing the Grubbs test to detect outliers. However, we will explore both ways to implement the Grubbs test in Python: the in-built function from a library and implementing the formula from scratch.

Outliers Library and Smirnov_grubbs

Let's first install the outlier_utils library using the following command.

!pip install outlier_utils

Now let's make a dataset with outliers and perform the Grubbs test.

Two-Sided Grubbs Test

Syntax

grubbs.test(data, alpha=.05)

Parameters

data ? Numeric vector of data values.

alpha ? Significance level for the test.

Explanation

In this approach, the user must use the smirnov_grubbs.test() function from the outliers package passed with the necessary data as the inputs in order to run the Grubb's test.

Example

import numpy as np
from outliers import smirnov_grubbs as grubbs
 
#define data
data = np.array([ 5, 14, 15, 15, 14, 19, 17, 16, 20, 22, 8, 21, 28, 11, 9, 29, 40])
 
#perform Grubbs' test
grubbs.test(data, alpha=.05)

Output

array([ 5, 14, 15, 15, 14, 19, 17, 16, 20, 22,  8, 21, 28, 11,  9, 29])

The above code simply starts with loading the libraries and data and finally performing the Grubbs test on this data using "test" method. This test detects the outliers from both sides, the left, and right sides, or values below the first and above the third quartile. The data has only 1 single outlier as 40 which was removed using the Grubbs test.

One-Sided Grubbs Test

Synatx

grubbs.max_test(data, alpha=.05)

Explanation

In this method, the user must either call the grubbs.min_test() function to obtain the minimum outlier from the supplied data set or the grubbs.max_test() function to obtain the maximum outlier from the supplied data set in order to obtain the one-side Grubb's test.

Example

import numpy as np
from outliers import smirnov_grubbs as grubbs
 
#define data
data = np.array([5, 14, 15, 15, 14, 19, 17, 16, 20, 22, 8, 21, 28, 11, 9, 29, 40])

#perform Grubbs' test for minimum value is an outlier
print(grubbs.min_test(data, alpha=.05)) 

#perform Grubbs' test for minimum value is an outlier
grubbs.max_test(data, alpha=.05)

Output

[ 5 14 15 15 14 19 17 16 20 22  8 21 28 11  9 29 40]
array([ 5, 14, 15, 15, 14, 19, 17, 16, 20, 22,  8, 21, 28, 11,  9, 29])

The One-sided Grubbs test detects the outliers from either below the first quartile or above the third quartile. We can see that the min_test method removes the outliers from the minimum side and the max_test method from the top side of the data.

Formula Implementation

Here we will implement the following Grubbs test formula in Python. We will use Numpy and Scipy libraries for the implementation.

Syntax

g_calculated = numerator/sd_x
g_critical = ((n - 1) * np.sqrt(np.square(t_value_1))) / (np.sqrt(n) * np.sqrt(n - 2 + np.square(t_value_1)))

Algorithm

The steps of implementation are as follows ?

Calculate the mean of the dataset values.
Calculate the standard deviation of the dataset values.
To implement the Grubbs test formula, calculate the numerator by subtracting each value in the dataset from its mean.
Divide the numerator value by the standard deviation to get the calculated score.
Calculate the critical score for the same values.
If the critical value is greater than the calculated values then there is no outlier in the dataset otherwise there is the presence of outliers.

Example

import numpy as np
import scipy.stats as stats
## define data
x = np.array([12,13,14,19,21,23])
y = np.array([12,13,14,19,21,23,45])

## implement Grubbs test
def grubbs_test(x):
   n = len(x)
   mean_x = np.mean(x)
   sd_x = np.std(x)
   numerator = max(abs(x-mean_x))
   g_calculated = numerator/sd_x
   print("Grubbs Calculated Value:",g_calculated)
   t_value_1 = stats.t.ppf(1 - 0.05 / (2 * n), n - 2)
   g_critical = ((n - 1) * np.sqrt(np.square(t_value_1))) / (np.sqrt(n) * np.sqrt(n - 2 + np.square(t_value_1)))
   print("Grubbs Critical Value:",g_critical)
   if g_critical > g_calculated:
      print("We can see from the Grubbs test that the calculated value is less than the crucial value. Recognize the null hypothesis and draw the conclusion that there are no outliers\n")
   else:
      print("We see from the Grubbs test that the estimated value exceeds the critical value. Reject the null theory and draw the conclusion that there are outliers\n")
grubbs_test(x)
grubbs_test(y)

Output

Grubbs Calculated Value: 1.4274928542926593
Grubbs Critical Value: 1.887145117792422
We can see from the Grubbs test that the calculated value is less than the crucial value. Recognize the null hypothesis and draw the conclusion that there are no outliers

Grubbs Calculated Value: 2.2765147221587774
Grubbs Critical Value: 2.019968507680656
We see from the Grubbs test that the estimated value exceeds the critical value. Reject the null theory and draw the conclusion that there are outliers

The results of the Grubb test indicate that array x does not have any outliers but y does have 1 outlier.

Conclusion

We learned about the Outliers and Grubbs test in Python in this article. Let's summarize this article with a few takeaways.

Outliers are the records present out of the quartile range.
Outliers lie out of the normal distribution of the dataset.
We can detect outliers using Grubbs hypothesis statistical test.
We can perform the Grubbs test using in-built methods available in the outlier_utils library.
The two-sided Grubbs test detects and removes outliers from both the left and right sides.
However one-sided Grubbs test will detect the outliers from either side.

Gourav Bais

Updated on: 2023-04-28T15:52:21+05:30

3K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started