Which is faster, NumPy or pandas?

PandasServer Side ProgrammingProgramming

Both NumPy and pandas are essential tools for data science and machine learning technologies. We know that pandas provides DataFrames like SQL tables allowing you to do tabular data analysis, while NumPy runs vector and matrix operations very efficiently.

pandas provides a bunch of C or Cython optimized functions that can be faster than the NumPy equivalent function (e.g. reading text from text files).

If you want to do mathematical operations like a dot product, calculating mean, and some more, pandas DataFrames are generally going to be slower than a NumPy array. since pandas is doing a lot more stuff like aligning labels, dealing with heterogeneous data, and so on.

Example

import numpy as np
import pandas as pd

array = np.arange(100, 200)

s = pd.Series(array)

print('Series object time: ',end ='')
%timeit -n10 -r2 s.mean()

print('Numpy array time: ',end ='')
%timeit -n10 -r2 np.mean(array)

Explanation

Here we have created a NumPy array with 100 values ranging from 100 to 200 and also created a pandas Series object using a NumPy array.

We used the built-in IPython magic function %timeit to find the average time consumed by each function. To calculate the mean of each object data.

-n10 is used to set the number of loops counts and -r2 for set the number of runs counts.

Output

Series object: 225 µs ± 83 µs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Numpy array: 33.1 µs ± 10.8 µs per loop (mean ± std. dev. of 2 runs, 10 loops each)

We can see the time taken by both the NumPy array and Series object to calculate the mean.

Example

import numpy as np
import pandas as pd

array = np.arange(100, 200)

s = pd.Series(array)

print('Series object time: ',end ='')
%timeit -n10 -r2 s.std()

print('Numpy array time: ',end ='')
%timeit -n10 -r2 np.std(array)

Explanation

Here we have verified the time taken by both the NumPy array and and the pandas Series object to calculate the standard deviation. We can see the timestamp in the below output block.

Output

Series object time: 443 µs ± 26.6 µs per loop (mean ± std. dev. of 2 runs, 10 loops each)
Numpy array time: 104 µs ± 12.1 µs per loop (mean ± std. dev. of 2 runs, 10 loops each)

As we can see in the above two examples the average time consumed by pandas is more compared to the NumPy object.

Creating a pandas DataFrame took approximately 6000 times longer to time than creating a NumPy array. Pandas takes extra time to set up the index labels.

raja
Updated on 18-Nov-2021 06:24:19

Advertisements