- SciPy Tutorial
- SciPy - Home
- SciPy - Introduction
- SciPy - Environment Setup
- SciPy - Basic Functionality
- SciPy - Cluster
- SciPy - Constants
- SciPy - FFTpack
- SciPy - Integrate
- SciPy - Interpolate
- SciPy - Input and Output
- SciPy - Linalg
- SciPy - Ndimage
- SciPy - Optimize
- SciPy - Stats
- SciPy - CSGraph
- SciPy - Spatial
- SciPy - ODR
- SciPy - Special Package

- SciPy Useful Resources
- SciPy - Quick Guide
- SciPy - Useful Resources
- SciPy - Discussion

- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who

All of the statistics functions are located in the sub-package **scipy.stats** and a fairly complete listing of these functions can be obtained using **info(stats)** function. A list of random variables available can also be obtained from the **docstring** for the stats sub-package. This module contains a large number of probability distributions as well as a growing library of statistical functions.

Each univariate distribution has its own subclass as described in the following table −

Sr. No. | Class & Description |
---|---|

1 |
A generic continuous random variable class meant for subclassing |

2 |
A generic discrete random variable class meant for subclassing |

3 |
Generates a distribution given by a histogram |

A probability distribution in which the random variable X can take any value is continuous random variable. The location (loc) keyword specifies the mean. The scale (scale) keyword specifies the standard deviation.

As an instance of the **rv_continuous** class, **norm** object inherits from it a collection of generic methods and completes them with details specific for this particular distribution.

To compute the CDF at a number of points, we can pass a list or a NumPy array. Let us consider the following example.

from scipy.stats import norm import numpy as np print norm.cdf(np.array([1,-1., 0, 1, 3, 4, -2, 6]))

The above program will generate the following output.

array([ 0.84134475, 0.15865525, 0.5 , 0.84134475, 0.9986501 , 0.99996833, 0.02275013, 1. ])

To find the median of a distribution, we can use the Percent Point Function (PPF), which is the inverse of the CDF. Let us understand by using the following example.

from scipy.stats import norm print norm.ppf(0.5)

The above program will generate the following output.

0.0

To generate a sequence of random variates, we should use the size keyword argument, which is shown in the following example.

from scipy.stats import norm print norm.rvs(size = 5)

The above program will generate the following output.

array([ 0.20929928, -1.91049255, 0.41264672, -0.7135557 , -0.03833048])

The above output is not reproducible. To generate the same random numbers, use the seed function.

A uniform distribution can be generated using the uniform function. Let us consider the following example.

from scipy.stats import uniform print uniform.cdf([0, 1, 2, 3, 4, 5], loc = 1, scale = 4)

The above program will generate the following output.

array([ 0. , 0. , 0.25, 0.5 , 0.75, 1. ])

Let us generate a random sample and compare the observed frequencies with the probabilities.

As an instance of the **rv_discrete class**, the **binom object** inherits from it a collection of generic methods and completes them with details specific for this particular distribution. Let us consider the following example.

from scipy.stats import uniform print uniform.cdf([0, 1, 2, 3, 4, 5], loc = 1, scale = 4)

The above program will generate the following output.

array([ 0. , 0. , 0.25, 0.5 , 0.75, 1. ])

The basic stats such as Min, Max, Mean and Variance takes the NumPy array as input and returns the respective results. A few basic statistical functions available in the **scipy.stats package** are described in the following table.

Sr. No. | Function & Description |
---|---|

1 |
Computes several descriptive statistics of the passed array |

2 |
Computes geometric mean along the specified axis |

3 |
Calculates the harmonic mean along the specified axis |

4 |
Computes the kurtosis |

5 |
Returns the modal value |

6 |
Tests the skewness of the data |

7 |
Performs a 1-way ANOVA |

8 |
Computes the interquartile range of the data along the specified axis |

9 |
Calculates the z score of each value in the sample, relative to the sample mean and standard deviation |

10 |
Calculates the standard error of the mean (or standard error of measurement) of the values in the input array |

Several of these functions have a similar version in the **scipy.stats.mstats**, which work for masked arrays. Let us understand this with the example given below.

from scipy import stats import numpy as np x = np.array([1,2,3,4,5,6,7,8,9]) print x.max(),x.min(),x.mean(),x.var()

The above program will generate the following output.

(9, 1, 5.0, 6.666666666666667)

Let us understand how T-test is useful in SciPy.

Calculates the T-test for the mean of ONE group of scores. This is a two-sided test for the null hypothesis that the expected value (mean) of a sample of independent observations ‘a’ is equal to the given population mean, **popmean**. Let us consider the following example.

from scipy import stats rvs = stats.norm.rvs(loc = 5, scale = 10, size = (50,2)) print stats.ttest_1samp(rvs,5.0)

The above program will generate the following output.

Ttest_1sampResult(statistic = array([-1.40184894, 2.70158009]), pvalue = array([ 0.16726344, 0.00945234]))

In the following examples, there are two samples, which can come either from the same or from different distribution, and we want to test whether these samples have the same statistical properties.

**ttest_ind** − Calculates the T-test for the means of two independent samples of scores. This is a two-sided test for the null hypothesis that two independent samples have identical average (expected) values. This test assumes that the populations have identical variances by default.

We can use this test, if we observe two independent samples from the same or different population. Let us consider the following example.

from scipy import stats rvs1 = stats.norm.rvs(loc = 5,scale = 10,size = 500) rvs2 = stats.norm.rvs(loc = 5,scale = 10,size = 500) print stats.ttest_ind(rvs1,rvs2)

The above program will generate the following output.

Ttest_indResult(statistic = -0.67406312233650278, pvalue = 0.50042727502272966)

You can test the same with a new array of the same length, but with a varied mean. Use a different value in **loc** and test the same.

Advertisements