- Data Structure
- Networking
- RDBMS
- Operating System
- Java
- MS Excel
- iOS
- HTML
- CSS
- Android
- Python
- C Programming
- C++
- C#
- MongoDB
- MySQL
- Javascript
- PHP
- Physics
- Chemistry
- Biology
- Mathematics
- English
- Economics
- Psychology
- Social Studies
- Fashion Studies
- Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Which is faster, NumPy or pandas?
Both NumPy and pandas are essential tools for data science and machine learning technologies. We know that pandas provides DataFrames like SQL tables allowing you to do tabular data analysis, while NumPy runs vector and matrix operations very efficiently.
pandas provides a bunch of C or Cython optimized functions that can be faster than the NumPy equivalent function (e.g. reading text from text files).
If you want to do mathematical operations like a dot product, calculating mean, and some more, pandas DataFrames are generally going to be slower than a NumPy array. since pandas is doing a lot more stuff like aligning labels, dealing with heterogeneous data, and so on.
Example
import numpy as np import pandas as pd array = np.arange(100, 200) s = pd.Series(array) print('Series object time: ',end ='') %timeit -n10 -r2 s.mean() print('Numpy array time: ',end ='') %timeit -n10 -r2 np.mean(array)
Explanation
Here we have created a NumPy array with 100 values ranging from 100 to 200 and also created a pandas Series object using a NumPy array.
We used the built-in IPython magic function %timeit to find the average time consumed by each function. To calculate the mean of each object data.
-n10 is used to set the number of loops counts and -r2 for set the number of runs counts.
Output
Series object: 225 µs ± 83 µs per loop (mean ± std. dev. of 2 runs, 10 loops each) Numpy array: 33.1 µs ± 10.8 µs per loop (mean ± std. dev. of 2 runs, 10 loops each)
We can see the time taken by both the NumPy array and Series object to calculate the mean.
Example
import numpy as np import pandas as pd array = np.arange(100, 200) s = pd.Series(array) print('Series object time: ',end ='') %timeit -n10 -r2 s.std() print('Numpy array time: ',end ='') %timeit -n10 -r2 np.std(array)
Explanation
Here we have verified the time taken by both the NumPy array and and the pandas Series object to calculate the standard deviation. We can see the timestamp in the below output block.
Output
Series object time: 443 µs ± 26.6 µs per loop (mean ± std. dev. of 2 runs, 10 loops each) Numpy array time: 104 µs ± 12.1 µs per loop (mean ± std. dev. of 2 runs, 10 loops each)
As we can see in the above two examples the average time consumed by pandas is more compared to the NumPy object.
Creating a pandas DataFrame took approximately 6000 times longer to time than creating a NumPy array. Pandas takes extra time to set up the index labels.