Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How can data be summarized in Pandas Python?
Pandas provides powerful methods to summarize and get statistical insights from your data. The most comprehensive function for data summarization is describe(), which generates descriptive statistics for numerical columns.
The describe() function provides key statistics including count, mean, standard deviation, minimum value, and quartiles (25th, 50th, and 75th percentiles).
Syntax
DataFrame.describe(percentiles=None, include=None, exclude=None)
Basic Data Summarization
Here's how to use describe() to get a complete statistical summary ?
import pandas as pd
# Create sample data
data = {
'Name': pd.Series(['Tom', 'Jane', 'Vin', 'Eve', 'Will']),
'Age': pd.Series([45, 67, 89, 12, 23]),
'Value': pd.Series([8.79, 23.24, 31.98, 78.56, 90.20])
}
df = pd.DataFrame(data)
print("The DataFrame is:")
print(df)
print("\nThe description of data is:")
print(df.describe())
The DataFrame is:
Name Age Value
0 Tom 45 8.79
1 Jane 67 23.24
2 Vin 89 31.98
3 Eve 12 78.56
4 Will 23 90.20
The description of data is:
Age Value
count 5.000000 5.000000
mean 47.200000 46.554000
std 31.499206 35.747102
min 12.000000 8.790000
25% 23.000000 23.240000
50% 45.000000 31.980000
75% 67.000000 78.560000
max 89.000000 90.200000
Including All Data Types
By default, describe() only summarizes numerical columns. To include all data types ?
import pandas as pd
data = {
'Name': ['Tom', 'Jane', 'Vin', 'Eve', 'Will'],
'Age': [45, 67, 89, 12, 23],
'City': ['NY', 'LA', 'NY', 'Boston', 'LA']
}
df = pd.DataFrame(data)
print("Description of all columns:")
print(df.describe(include='all'))
Description of all columns:
Name Age City
count 5 5.000000 5
unique 5 NaN 3
top Eve NaN LA
freq 1 NaN 2
mean NaN 47.200000 NaN
std NaN 31.499206 NaN
min NaN 12.000000 NaN
25% NaN 23.000000 NaN
50% NaN 45.000000 NaN
75% NaN 67.000000 NaN
max NaN 89.000000 NaN
Other Summary Methods
Pandas offers additional methods for specific summary statistics ?
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [10, 20, 30, 40, 50]
})
print("Mean:")
print(df.mean())
print("\nSum:")
print(df.sum())
print("\nInfo about DataFrame:")
print(df.info())
Mean: A 3.0 B 30.0 dtype: float64 Sum: A 15 B 150 dtype: int64 Info about DataFrame: <class 'pandas.core.frame.DataFrame'> RangeIndex: 5 entries, 0 to 4 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 A 5 non-null int64 1 B 5 non-null int64 dtypes: int64(2) memory usage: 208.0 bytes None
Understanding the Statistics
| Statistic | Description |
|---|---|
count |
Number of non-null values |
mean |
Average value |
std |
Standard deviation |
25% |
First quartile (Q1) |
50% |
Median (Q2) |
75% |
Third quartile (Q3) |
Conclusion
The describe() function is essential for quick data exploration in Pandas. Use include='all' to analyze both numerical and categorical columns, providing comprehensive insights into your dataset's structure and distribution.
