Pandas has been one of the most commonly used tools for Data Science and Machine learning, which is used for data cleaning and analysis.
Here, Pandas is the best tool for handling this real-world messy data. And pandas is one of the open-source python packages built on top of NumPy.
Handling data using pandas is very fast and effective by using pandas Series and data frame, these two pandas data structures will help you to manipulate data in various ways.
Based on the features available in pandas we can say pandas is best for handling data. It can handle missing data, cleaning up the data and it supports multiple file formats. This means it can read or load data in many formats like CSV, Excel, SQL, etc.,
Let’s take an example and see how it’s gonna read CSV data.
data = pd.read_csv('world-happiness-report.csv') print(data.shape) data.head()
In the above code, variable data stores CSV data which is a world happiness report (downloaded from Kaggle datasets) by using the read_csv function available in the pandas package. data.shape is used to give you the columns and row count.
Country name year Life Ladder Log GDP per capita Social support \ 0 Afghanistan 2008 3.724 7.370 0.451 1 Afghanistan 2009 4.402 7.540 0.552 2 Afghanistan 2010 4.758 7.647 0.539 3 Afghanistan 2011 3.832 7.620 0.521 4 Afghanistan 2012 3.783 7.705 0.521 Healthy life expectancy at birth Freedom to make life choices Generosity \ 50.80 0.718 0.168 51.20 0.679 0.190 51.60 0.600 0.121 51.92 0.496 0.162 52.24 0.531 0.236 Perceptions of corruption Positive affect Negative affect 0.882 0.518 0.258 0.850 0.584 0.237 0.707 0.618 0.275 0.731 0.611 0.267 0.776 0.710 0.268
The above block has the top 5 rows of data in the world’s happiness report data set that can be displayed by pandas dataframe.head() function.
There are many more features that help us to deal with large data for both machine learning data science operations. Which are merging and joining data sets, Visualization, grouping, masking, and also is very helpful for performing mathematical operations on our data sets.
Let’s take another example and see how to create an output file using pandas.
file = data.to_json('output_file.json')
Data.to_json is a pandas function that is used to create a JSON file based on our pandas dataframe object (data).
The resultant JSON file will be created in our working directory with an extension of .json and the name of the file is output_file (for our above example).
These are some reasons why we need python pandas.