Welch’s T-Test in Python


Python is a powerful language to be used for performing various statistical tests. One such statistical test is the Welch’s t-test.

When there are two datasets with equal variances and you have to find out whether their means are the same or not, then using a two sample t-test would be wise enough. However, if the variances of the two datasets are not the same, then Welch’s t-test should be used to compare the means.

Syntax

stats.ttest_ind(dataset_one, dataset_two, equal_var = False/True)

Here, ttest_ind is the function that performs Welch’s t-test. It takes three parameters,

  • The first dataset as an array or list

  • The second dataset as an array or list

  • A boolean variable that tells if the variance is equal or not

Further, the function returns two values in the output, the test statistic value and the p-value.

Algorithm

  • Step 1 − Import Python’s numpy and scipy libraries.

  • Step 2 − Use the array() method to form two datasets.

  • Step 3 − Use the var() method to check if the variances of the two datasets are the same or not. If the ratio of the variances is more than 4:1, then the variances can’t be assumed equal and we can move to the next step to perform Welch’s ttest.

  • Step 4 − Use the stats.ttest_ind() method to find out the p-value. If the p-value is less than 0.05, then the difference in the means is assumed to be significant.

Example 1

In this example, we will take two arrays containing the number of leaves of 10 plants of two different species and perform Welch’s t-test on them. This is done using the stats.ttest_ind() function but first, we check if the variances of the two arrays are the same or not.

This is the hypothesis to be tested -

  • Null hypothesis(ho) − u1 = u2, meaning, mean of both these datasets in approximately equal.

  • Alternative hypothesis(h1) − u1≠ u2 meaning, mean of both these datasets differ significantly.

#import the numpy and scipy libraries 
import numpy as np
import scipy.stats as stats

#form two datasets as array_one and array_two 
array_one = np.array([25, 55, 59, 24, 21, 54, 32, 43, 54, 65])
array_two = np.array([23, 12, 24, 10, 18, 17, 22, 15, 16, 25])

#find out the ratio of variances of the two datasets
val = (np.var(array_one)/ np.var(array_two))

#if the ratio is greater than 4, perform the Welch's test  
if(val>4):
   print(stats.ttest_ind(array_one, array_two, equal_var = False))

Output

Ttest_indResult(statistic=4.602699733067644, pvalue=0.0008049287678035495)

Since the p-value is less than 0.05, we can conclude that the mean difference between the two datasets is quite high.

Example 2

In this example, we will take arrays with the values of runs scored by two batsmen in 10 matches and perform Welch’s t-test on them.

#import the numpy and scipy libraries 
import numpy as np
import scipy.stats as stats

#form two datasets as batsman_one and batsman_two 
batsman_one = [30, 91, 0, 64, 42, 80, 30, 5, 117, 71]
batsman_two = [53, 46, 48, 50, 53, 53, 58, 60, 57, 52]

#find out the ratio of variances of the two datasets
val = (np.var(batsman_one)/np.var(batsman_two))

#if the ratio is greater than 4, perform the Welch's test  
if(val>4):
   print(stats.ttest_ind(batsman_one, batsman_two, equal_var = False))

Output

Ttest_indResult(statistic=0.0, pvalue=1.0)

Since the p-value returned is not just greater than 0.5 but also equal to 1.0, we can conclude that the mean of these two datasets is the same.

Conclusion

Welch’s t-test gives better results than the two sample t-test, with minimal error rates even when the variances are equal. Thus, one can use Welch’s t-test directly, irrespective of the values of variances. However, it is advised to use it for large data samples with skewed distributions. Also, it is not limited to just Python, but languages like R and Julia too support Welch’s t-test.

Updated on: 07-Aug-2023

235 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements