Show the 68-95-99.7 rule in Statistics using Python


Statistics provides us with powerful tools to analyze and understand data. One of the fundamental concepts in statistics is the 68-95-99.7 rule, also known as the empirical rule or the three-sigma rule. This rule allows us to make important inferences about the distribution of data based on its standard deviation. In this blog post, we will explore the 68-95-99.7 rule and demonstrate how to apply it using Python.

Overview of the 68-95-99.7 Rule

The 68-95-99.7 rule provides a way to estimate the percentage of data that falls within a certain number of standard deviations from the mean in a normal distribution. According to this rule −

  • Approximately 68% of the data falls within one standard deviation of the mean.

  • Approximately 95% of the data falls within two standard deviations of the mean.

  • Approximately 99.7% of the data falls within three standard deviations of the mean.

These percentages hold true for a dataset that follows a normal distribution, also known as a bell curve. Understanding this rule allows us to quickly assess the spread of data and identify outliers or unusual observations.

Implementing the 68-95-99.7 Rule in Python

To demonstrate the 68-95-99.7 rule in action, we will use Python and its popular data analysis library, NumPy. NumPy provides efficient numerical operations and statistical functions that will help us compute the necessary values. Let's start by importing the required libraries 

import numpy as np
import matplotlib.pyplot as plt

Next, we will generate a random dataset that follows a normal distribution using the numpy.random.normal() function. We'll use a mean of 0 and a standard deviation of 1 

np.random.seed(42)  # Set the random seed for reproducibility
data = np.random.normal(0, 1, 10000)

Now, we can calculate the mean and standard deviation of the dataset 

mean = np.mean(data)
std = np.std(data)

To visualize the data and the areas covered by the 68-95-99.7 rule, we can create a histogram using the matplotlib.pyplot.hist() function 

plt.hist(data, bins=30, density=True, alpha=0.7)

# Plot the mean and standard deviations
plt.axvline(mean, color='r', linestyle='dashed', linewidth=1, label='Mean')
plt.axvline(mean - std, color='g', linestyle='dashed', linewidth=1, label='1 STD')
plt.axvline(mean + std, color='g', linestyle='dashed', linewidth=1)
plt.axvline(mean - 2*std, color='b', linestyle='dashed', linewidth=1, label='2 STD')
plt.axvline(mean + 2*std, color='b', linestyle='dashed', linewidth=1)
plt.axvline(mean - 3*std, color='m', linestyle='dashed', linewidth=1, label='3 STD')
plt.axvline(mean + 3*std, color='m', linestyle='dashed', linewidth=1)

plt.legend()
plt.xlabel('Value')
plt.ylabel('Density')
plt.title('Histogram of the Dataset')
plt.show()

The resulting histogram will display the distribution of the data along with the mean and the standard deviations marked by dashed lines.

To calculate the percentages covered by each range, we can use the cumulative distribution function (CDF) of the normal distribution. The NumPy function numpy.random.normal() generates data from a normal distribution, but NumPy also provides numpy.random.normal() that calculates the CDF 

# Calculate the percentage within one standard deviation
pct_within_1_std = np.sum(np.logical_and(data >= mean - std, data 7lt;= mean + std)) / len(data)

# Calculate the percentage within two standard deviations
pct_within_2_std = np.sum(np.logical_and(data >= mean - 2*std, data <= mean + 2*std)) / len(data)

# Calculate the percentage within three standard deviations
pct_within_3_std = np.sum(np.logical_and(data >= mean - 3*std, data <= mean + 3*std)) / len(data)

print("Percentage within one standard deviation: {:.2%}".format(pct_within_1_std))
print("Percentage within two standard deviations: {:.2%}".format(pct_within_2_std))
print("Percentage within three standard deviations: {:.2%}".format(pct_within_3_std))

When you run this code, you will see the percentages of data falling within one, two, and three standard deviations from the mean.

Percentage within one standard deviation: 68.27%
Percentage within two standard deviations: 95.61%
Percentage within three standard deviations: 99.70%

These results closely align with the expected percentages according to the 68-95-99.7 rule.

Interpretation of the 68-95-99.7 Rule

The percentages covered by each range have specific interpretations. Data falling within one standard deviation of the mean is relatively common, while data falling beyond three standard deviations is considered rare. Understanding these interpretations helps in making meaningful inferences about the data.

Limitations of the 68-95-99.7 Rule

While the 68-95-99.7 rule is a valuable guideline, it may not accurately apply to datasets that deviate significantly from a normal distribution. It's crucial to consider other statistical techniques and conduct further analysis when dealing with such datasets.

Outliers and the 68-95-99.7 Rule

Outliers can greatly impact the accuracy of the percentages covered by each range. These extreme values can skew the distribution and affect the validity of the rule. It is important to identify and handle outliers appropriately to ensure accurate statistical analysis.

Real-Life Examples

The 68-95-99.7 rule finds application in various fields. For example, it is relevant in quality control processes to identify defective products, in financial analysis to assess risk and return on investments, in healthcare research to understand patient characteristics, and in many other domains where understanding data distributions is essential.

As you go deeper into statistics, consider exploring other concepts that complement the 68-95-99.7 rule. Skewness, kurtosis, confidence intervals, hypothesis testing, and regression analysis are just a few examples of statistical tools that can further enhance your understanding and analysis of data.

Conclusion

The 68-95-99.7 rule is a powerful concept in statistics that allows us to understand the distribution of data based on its standard deviation. By applying this rule, we can estimate the proportions of data falling within specific ranges around the mean. In this blog, we used Python and the NumPy library to generate a random dataset, visualize it, and calculate the percentages covered by each range. Understanding this rule enables us to make meaningful inferences about our data and identify potential outliers or unusual observations.

Updated on: 16-Aug-2023

193 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements