- Trending Categories
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
Physics
Chemistry
Biology
Mathematics
English
Economics
Psychology
Social Studies
Fashion Studies
Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
How to Perform Grubbs Test in Python
Introduction
The Grubbs test is a statistical hypothesis testing method to detect outliers in a dataset. Outliers are the observations that disburse the data distribution and are also known as anomalies. The dataset with outliers tends to overfit more than data with a Normal/Gaussian distribution. Hence, it is necessary to tackle the outliers before Machine Learning modeling. Before handling, we must detect and locate the outliers in the dataset. The most popular outliers detection techniques are QQPlot, Inter-Quartile range, and Grubbs statistical test. However, this article will discuss only the Grubbs test to detect the outliers. You will learn: what a Grubbs test is and how to implement it in Python.
What are Outliers?
Outliers are numerically distant data observations from other data values. These values are present out of the range of normally distributed data. The dataset must contain 67% of records under the first standard deviation, 95% of data under the second standard deviation, and 99.7% points under the third standard deviation of the mean to attain normal distribution. In other words, the data points should come between the first and the third quartile range. We consider records present below the first quartile and above the third quartile as outliers or anomalies.
Grubbs Statistical Hypothesis Test
Grubbs test also approves or rejects the Null (H0) or alternate (H1) hypothesis like any other statistical hypothesis test. The Grubbs test is a test to detect the outliers in a dataset.
We can perform the Grubbs test in two ways: the One-Sided Test and the Two-Sided Test for a univariate dataset with or a sample of an almost normal distribution with at least seven variables. This test is also known as the extreme studentized deviation test or maximum normalized residual test.
Grubbs test uses the following hypothesis −
Null (H0): The dataset has no outliers.
Alternate (H1): The dataset has exactly one outlier.
Grubbs Test in Python
Python has its way with any programming challenge with its vast library collection. These libraries provide in-built methods to use directly to perform any operation, statistical test, and many more. Similarly, Python has a library with methods for performing the Grubbs test to detect outliers. However, we will explore both ways to implement the Grubbs test in Python: the in-built function from a library and implementing the formula from scratch.
Outliers Library and Smirnov_grubbs
Let’s first install the outlier_utils library using the following command.
!pip install outlier_utils
Now let’s make a dataset with outliers and perform the Grubbs test.
Two-Sided Grubbs Test
Syntax
grubbs.test(data, alpha=.05)
Parameters
data − Numeric vector of data values.
alpha − Significance level for the test.
Explanation
In this approach, the user must use the smirnov_grubbs.test() function from the outliers package passed with the necessary data as the inputs in order to run the Grubb's test.
Example
import numpy as np from outliers import smirnov_grubbs as grubbs #define data data = np.array([ 5, 14, 15, 15, 14, 19, 17, 16, 20, 22, 8, 21, 28, 11, 9, 29, 40]) #perform Grubbs' test grubbs.test(data, alpha=.05)
Output
array([ 5, 14, 15, 15, 14, 19, 17, 16, 20, 22, 8, 21, 28, 11, 9, 29])
The above code simply starts with loading the libraries and data and finally performing the Grubbs test on this data using “test” method. This test detects the outliers from both sides, the left, and right sides, or values below the first and above the third quartile. The data has only 1 single outlier as 40 which was removed using the Grubbs test.
One-Sided Grubbs Test
Synatx
grubbs.max_test(data, alpha=.05)
Explanation
In this method, the user must either call the grubbs.min_test() function to obtain the minimum outlier from the supplied data set or the grubbs.max_test() function to obtain the maximum outlier from the supplied data set in order to obtain the one-side Grubb's test.
Example
import numpy as np from outliers import smirnov_grubbs as grubbs #define data data = np.array([5, 14, 15, 15, 14, 19, 17, 16, 20, 22, 8, 21, 28, 11, 9, 29, 40]) #perform Grubbs' test for minimum value is an outlier print(grubbs.min_test(data, alpha=.05)) #perform Grubbs' test for minimum value is an outlier grubbs.max_test(data, alpha=.05)
Output
[ 5 14 15 15 14 19 17 16 20 22 8 21 28 11 9 29 40] array([ 5, 14, 15, 15, 14, 19, 17, 16, 20, 22, 8, 21, 28, 11, 9, 29])
The One-sided Grubbs test detects the outliers from either below the first quartile or above the third quartile. We can see that the min_test method removes the outliers from the minimum side and the max_test method from the top side of the data.
Formula Implementation
Here we will implement the following Grubbs test formula in Python. We will use Numpy and Scipy libraries for the implementation.
Syntax
g_calculated = numerator/sd_x g_critical = ((n - 1) * np.sqrt(np.square(t_value_1))) / (np.sqrt(n) * np.sqrt(n - 2 + np.square(t_value_1)))
Algorithm
The steps of implementation are as follows −
Calculate the mean of the dataset values.
Calculate the standard deviation of the dataset values.
To implement the Grubbs test formula, calculate the numerator by subtracting each value in the dataset from its mean.
Divide the numerator value by the standard deviation to get the calculated score.
Calculate the critical score for the same values.
If the critical value is greater than the calculated values then there is no outlier in the dataset otherwise there is the presence of outliers.
Example
import numpy as np import scipy.stats as stats ## define data x = np.array([12,13,14,19,21,23]) y = np.array([12,13,14,19,21,23,45]) ## implement Grubbs test def grubbs_test(x): n = len(x) mean_x = np.mean(x) sd_x = np.std(x) numerator = max(abs(x-mean_x)) g_calculated = numerator/sd_x print("Grubbs Calculated Value:",g_calculated) t_value_1 = stats.t.ppf(1 - 0.05 / (2 * n), n - 2) g_critical = ((n - 1) * np.sqrt(np.square(t_value_1))) / (np.sqrt(n) * np.sqrt(n - 2 + np.square(t_value_1))) print("Grubbs Critical Value:",g_critical) if g_critical > g_calculated: print("We can see from the Grubbs test that the calculated value is less than the crucial value. Recognize the null hypothesis and draw the conclusion that there are no outliers\n") else: print("We see from the Grubbs test that the estimated value exceeds the critical value. Reject the null theory and draw the conclusion that there are outliers\n") grubbs_test(x) grubbs_test(y)
Output
Grubbs Calculated Value: 1.4274928542926593 Grubbs Critical Value: 1.887145117792422 We can see from the Grubbs test that the calculated value is less than the crucial value. Recognize the null hypothesis and draw the conclusion that there are no outliers Grubbs Calculated Value: 2.2765147221587774 Grubbs Critical Value: 2.019968507680656 We see from the Grubbs test that the estimated value exceeds the critical value. Reject the null theory and draw the conclusion that there are outliers
The results of the Grubb test indicate that array x does not have any outliers but y does have 1 outlier.
Conclusion
We learned about the Outliers and Grubbs test in Python in this article. Let's summarize this article with a few takeaways.
Outliers are the records present out of the quartile range.
Outliers lie out of the normal distribution of the dataset.
We can detect outliers using Grubbs hypothesis statistical test.
We can perform the Grubbs test using in-built methods available in the outlier_utils library.
The two-sided Grubbs test detects and removes outliers from both the left and right sides.
However one-sided Grubbs test will detect the outliers from either side.