Article Categories

Selected Reading

How can data be summarized in Pandas Python?

Python Server Side Programming Programming

Pandas provides powerful methods to summarize and get statistical insights from your data. The most comprehensive function for data summarization is describe(), which generates descriptive statistics for numerical columns.

The describe() function provides key statistics including count, mean, standard deviation, minimum value, and quartiles (25th, 50th, and 75th percentiles).

Syntax

DataFrame.describe(percentiles=None, include=None, exclude=None)

Basic Data Summarization

Here's how to use describe() to get a complete statistical summary ?

import pandas as pd

# Create sample data
data = {
    'Name': pd.Series(['Tom', 'Jane', 'Vin', 'Eve', 'Will']),
    'Age': pd.Series([45, 67, 89, 12, 23]),
    'Value': pd.Series([8.79, 23.24, 31.98, 78.56, 90.20])
}

df = pd.DataFrame(data)
print("The DataFrame is:")
print(df)
print("\nThe description of data is:")
print(df.describe())

The DataFrame is:
   Name  Age   Value
0   Tom   45    8.79
1  Jane   67   23.24
2   Vin   89   31.98
3   Eve   12   78.56
4  Will   23   90.20

The description of data is:
             Age       Value
count   5.000000    5.000000
mean   47.200000   46.554000
std    31.499206   35.747102
min    12.000000    8.790000
25%    23.000000   23.240000
50%    45.000000   31.980000
75%    67.000000   78.560000
max    89.000000   90.200000

Including All Data Types

By default, describe() only summarizes numerical columns. To include all data types ?

import pandas as pd

data = {
    'Name': ['Tom', 'Jane', 'Vin', 'Eve', 'Will'],
    'Age': [45, 67, 89, 12, 23],
    'City': ['NY', 'LA', 'NY', 'Boston', 'LA']
}

df = pd.DataFrame(data)
print("Description of all columns:")
print(df.describe(include='all'))

Description of all columns:
         Name        Age       City
count       5   5.000000          5
unique      5        NaN          3
top       Eve        NaN         LA
freq        1        NaN          2
mean      NaN  47.200000        NaN
std       NaN  31.499206        NaN
min       NaN  12.000000        NaN
25%       NaN  23.000000        NaN
50%       NaN  45.000000        NaN
75%       NaN  67.000000        NaN
max       NaN  89.000000        NaN

Other Summary Methods

Pandas offers additional methods for specific summary statistics ?

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50]
})

print("Mean:")
print(df.mean())
print("\nSum:")
print(df.sum())
print("\nInfo about DataFrame:")
print(df.info())

Mean:
A     3.0
B    30.0
dtype: float64

Sum:
A     15
B    150
dtype: int64

Info about DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       5 non-null      int64
 1   B       5 non-null      int64
dtypes: int64(2)
memory usage: 208.0 bytes
None

Understanding the Statistics

Statistic	Description
`count`	Number of non-null values
`mean`	Average value
`std`	Standard deviation
`25%`	First quartile (Q1)
`50%`	Median (Q2)
`75%`	Third quartile (Q3)

Conclusion

The describe() function is essential for quick data exploration in Pandas. Use include='all' to analyze both numerical and categorical columns, providing comprehensive insights into your dataset's structure and distribution.

AmitDiwan

Updated on: 2026-03-25T13:15:48+05:30

240 Views

Previous Next