A complete guide to resampling methods

Re-sampling is a statistical technique for gathering more data samples from which inferences about the population or the process by which the initial data were produced can be made. These methods are widely used in data analysis when it is necessary to estimate a population parameter from the given data or when there are few accessible data points. Resampling approaches typically use techniques like bootstrapping, jackknifing, and permutation testing to estimate standard errors, confidence intervals, and p-values. Analyzing and interpreting data is one of a data scientist's most crucial responsibilities. The supplied data, however, isn't always sufficiently representative, which might result in incorrect inferences. In these situations, resampling techniques can be used to create fresh samples from existing data in order to estimate parameters more precisely or to test hypotheses. This article will offer a comprehensive overview of resampling strategies(bootstrapping & permutation tests), including their varieties, benefits, and drawbacks.

Bootstrapping

Bootstrapping is a resampling technique in which a dataset is repeatedly sampled with replacement to provide fresh samples that are then used to calculate the variability of an interesting statistic. Standard errors, confidence intervals, and p-values for hypothesis testing for a variety of statistical models and estimators can all be estimated using this approach. In bootstrapping, the statistic of interest is obtained for each fresh sample, and the distribution of these statistics is then utilized to derive the statistic's population variability.

Advantages

Bootstrapping is a non-parametric approach since it doesn't rely on population distributional assumptions.
It is a valuable approach for data analysis since it is resistant to outliers and non-normality in the data.
The variability of many other statistics, such as mean, median, correlation, and regression coefficients, can be estimated using this method.
It serves as a potent tool for hypothesis testing and confidence interval calculation since it gives precise estimates of statistical uncertainty.

Disadvantages

Bootstrapping can be computationally demanding, particularly if the dataset is sizable or the statistic of interest necessitates complex calculations.
In cases where the sample size is limited or the population at large is skewed, bias may be introduced.
As it is predicated on the idea that the data points are independent, it might not be appropriate for dependent data, such as time series data.

Example

The iris dataset from scikit-learn will be used in this example.

import numpy as np
from sklearn.datasets import load_iris

# load iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# bootstrap function
def bootstrap(data, n_bootstraps, statistic):
   """Generate new samples by bootstrapping the data and calculate the statistic."""
   
   # initialize array to store statistic
   boot_statistic = np.zeros(n_bootstraps)
  
   # generate new samples by bootstrapping the data
   for i in range(n_bootstraps):
       bootstrap_sample = np.random.choice(data, size=len(data), replace=True)
       boot_statistic[i] = statistic(bootstrap_sample)
  
   return boot_statistic

# calculate mean sepal length by bootstrapping
mean_sepal_length = np.mean(X[:, 0])
boot_means = bootstrap(X[:, 0], n_bootstraps=1000, statistic=np.mean)
lower, upper = np.percentile(boot_means, [2.5, 97.5])
print(f"Mean Sepal Length: {mean_sepal_length:.2f}")
print(f"95% Confidence Interval: ({lower:.2f}, {upper:.2f})")

Output

Mean Sepal Length: 5.84
95% Confidence Interval: (5.72, 5.99)

The iris dataset is initially loaded into the code example before the sepal length values are extracted from the feature matrix X. Then, a bootstrap function is defined, which accepts three arguments: the data, the number of bootstraps, and a statistic function. By bootstrapping the data, the function creates fresh samples and computes the provided statistic for each bootstrap sample.

We next compute the mean sepal length for the original dataset, produce 1000 fresh samples using the bootstrap function, then compute the mean sepal length for each sample. The percentile function is then used to get the 95% confidence interval for the mean sepal length.

Permutation Tests

When it comes to resampling, permutation tests are a dependable and adaptable method that can be employed for a number of statistical tests. Permutation tests create fresh samples by randomly permuting the values of one or more variables in the initial dataset, as opposed to bootstrapping, which uses sampling with replacement. Because of this, they are especially helpful for evaluating hypotheses concerning differences between two or more groups or determining the importance of a discrepancy between two measurements. Every sort of data can be subjected to permutation tests, which are independent of underlying population distribution assumptions.

Advantages

No assumptions are made on the population distribution used in the permutation test. They are flexible tools that work with a range of data types and experimental layouts a consequence.
Permutation testing is used to generate new samples from the original dataset, which produces exact estimations of uncertainty and significance. They are more reliable since they don't rely on assumptions about sample size or population distribution like traditional tests do.
Permutation testing can be applied to run a variety of statistical tests, including t-tests, ANOVAs, and correlation analyses.
Permutation tests are typically more successful than traditional tests in situations when there is a limited sample size or a non-normal distribution of the data.

Disadvantages

When the sample size or the number of permutations is big, permutation testing can be computationally demanding.
Several types of data or experimental designs may not be suitable for permutation testing, particularly if the data includes outliers or missing values.
It might be challenging to explain permutation testing to non-experts since it can be less understandable than standard tests.

Example

Certainly! This is an example using the scipy module to show how to conduct a permutation test in Python. The "iris" dataset, which contains measurements of the sepal length, sepal breadth, petal length, and petal width of the three different types of iris blooms, will be used.

import numpy as np
from scipy.stats import ttest_ind

# Load the iris dataset
from sklearn.datasets import load_iris
iris = load_iris()
setosa_petal_length = iris.data[:50, 2]  # Select the petal length of the first species
versicolor_petal_length = iris.data[50:100, 2]  # Select the petal length of the second species

# Calculate the observed difference in means between the two groups
obs_diff = np.mean(setosa_petal_length) - np.mean(versicolor_petal_length)

# Permutation test
n_permutations = 10000
diffs = []
for i in range(n_permutations):

   # Randomly permute the data
   permuted_data = np.random.permutation(np.concatenate([setosa_petal_length, versicolor_petal_length]))
   
   # Split the permuted data into two groups
   permuted_setosa = permuted_data[:50]
   permuted_versicolor = permuted_data[50:]
   
   # Calculate the difference in means between the two groups
   permuted_diff = np.mean(permuted_setosa) - np.mean(permuted_versicolor)
   diffs.append(permuted_diff)

# Calculate the p-value as the proportion of permuted differences greater than or equal to the observed difference
p_value = np.sum(np.array(diffs) >= obs_diff) / n_permutations
print('Observed difference in means:', obs_diff)
print('p-value:', p_value)

Output

Observed difference in means: -2.7979999999999996
p-value: 1.0

In this example, we will apply a permutation test to investigate the assertion that there is no difference in petal length between the iris flower species setosa and versicolor. After calculating the observed mean difference between the two groups, the data is randomly permuted, and the mean difference for each permutation is determined. The p-value is calculated as the proportion of permuted differences that are greater than or equal to the observed difference after 10,000 repetitions of this method. If the p-value is lower than the predetermined level of significance, the null hypothesis is disregarded and it is decided that there is a significant difference in petal length between the two species (for example, 0.05).

Conclusion

In conclusion, to effectively estimate uncertainty and gauge the significance of statistical tests, resampling techniques are a crucial component of the data scientist's toolbox. Resampling techniques provide data scientists the ability to draw conclusions about the underlying population distribution without having to make any assumptions about its characteristics or form. This is crucial since typical statistical tests may not be accurate when applied to small or biased samples. The stability of machine learning models can also be assessed, and their performance on fresh, untested data can be predicted, using resampling techniques. Data scientists can make sure that their findings are solid, trustworthy, and reproducible by employing resampling techniques like bootstrapping, cross-validation, and permutation testing.

Jay Singh

Updated on: 2023-04-25T11:36:35+05:30

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started