Kolmogorov-Smirnov Test (KS Test)


Introduction

Numerous tools and methods are used in statistical analysis to help turn raw data into insightful information. The Kolmogorov-Smirnov Test (KS Test) is one such potent tool that is renowned for its adaptability and durability. This non-parametric test is a mainstay in the field of data analysis and is renowned for contrasting two samples or comparing a sample to a reference probability distribution (one-sample KS Test). We shall explain the KS Test's concept, uses, and workings in this post, with examples that are Python-coded for easy comprehension.

Decoding the Kolmogorov-Smirnov Test

The KS Test, developed by Nikolai Smirnov and Andrey Kolmogorov, is a non-parametric technique used to evaluate the degree to which data fit a given distribution or to contrast two cumulative distribution functions (CDFs). Its adaptability is increased by the fact that, because to its non-parametric character, it does not make any initial assumptions about the data that follow a particular distribution.

Quantifying the largest gap (D) between the empirical distribution function (EDF) of the sample and the cumulative distribution function (CDF) of the reference distribution, or between the CDFs of two empirical samples, is the fundamental idea behind the KS Test.

Python Examples for the Kolmogorov-Smirnov Test

Example 1: One-sample KS Test in Python

Imagine that you have a dataset of 50 people's weights and that you believe these weights to be regularly distributed. To test this hypothesis, do a one-sample KS Test. The Python code to do it is as follows −

# Import necessary libraries
from scipy import stats
import numpy as np

# Generate a sample of size 50 from a normal distribution
np.random.seed(0)
sample = np.random.normal(loc=0, scale=1, size=50)

# One-sample KS Test
d_statistic, p_value = stats.kstest(sample, 'norm')

print("One-sample KS Test:")
print("D statistic:", d_statistic)
print("p-value:", p_value)

Output

One-sample KS Test:
D statistic: 0.10706475374815838
p-value: 0.5781417630622738

We are comparing the sample to a typical normal distribution in this code by using the 'norm' argument in the kstest function. The null hypothesis is rejected if the p-value is less than the significance level, which is typically 0.05. This indicates that the data may not follow a normal distribution.

Example 2: Two-sample KS Test in Python

Let's say you wish to compare the weights of people from City A and City B to see if they are drawn from the same distribution. In this case, the two-sample KS Test is ideal. The Python code for doing this is as follows 

# Generate another sample of size 50 from a normal distribution
sample_2 = np.random.normal(loc=0.5, scale=1.5, size=50)

# Two-sample KS Test
d_statistic_2, p_value_2 = stats.ks_2samp(sample, sample_2)

print("\nTwo-sample KS Test:")
print("D statistic:", d_statistic_2)
print("p-value:", p_value_2)

The distributions of two samples are compared via the ks_2samp function. We reject the null hypothesis and arrive to the conclusion that the weights from Cities A and B come from different distributions if the p-value is less than our level of significance.

Harnessing the Power of the Kolmogorov-Smirnov Test

The KS Test is a useful tool in many domains due of its adaptability. The test is used by financial analysts to determine whether the returns from a particular stock follow a normal distribution. The test could be used in environmental science to compare the rainfall patterns of two different regions.

The KS Test is also very helpful in the fields of data science and machine learning. The KS Test, for instance, can be used to compare the distributions of projected probability for the positive and negative outcomes when creating a model to predict binary events. A strong KS statistic, which indicates a considerable difference between these distributions, would indicate a successful model.

The KS Test is helpful to the digital advertising sector in understanding user behaviour. To give one example, the test might examine user time spent on a webpage to determine whether it follows a particular distribution, enabling organisations to make data-driven decisions.

Conclusion

In the field of statistical analysis, the Kolmogorov-Smirnov Test is a potent, non-parametric technique for assessing goodness of fit and contrasting various samples. The test's broad applicability across numerous industries emphasises how crucial it is in today's data-driven environment.

The KS Test is accessible and simple to apply with Python's practical implementation, offering solid statistical insights. The KS Test can be your go-to tool for thorough statistical testing, whether you're a data scientist attempting to verify the performance of a machine learning model, a financial analyst checking assumptions about your data, or a researcher looking to compare datasets.

Updated on: 17-Jul-2023

396 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements