Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Python Pandas - Filtering columns from a DataFrame on the basis of sum
In Pandas, you can filter DataFrame columns based on their sum values using the loc[] indexer with conditional logic. This technique is useful when you want to select only columns that meet certain aggregate criteria.
Creating the DataFrame
First, let's create a sample DataFrame with student marks ?
import pandas as pd
# Create a DataFrame with student marks
df = pd.DataFrame({
'Jacob_Marks': [95, 90, 75, 85, 88],
'Ted_Marks': [60, 50, 65, 85, 70],
'Jamie_Marks': [77, 76, 65, 45, 50]
})
print("Original DataFrame:")
print(df)
Original DataFrame: Jacob_Marks Ted_Marks Jamie_Marks 0 95 60 77 1 90 50 76 2 75 65 65 3 85 85 45 4 88 70 50
Checking Column Sums
Let's examine the sum of each column to understand our filtering criteria ?
import pandas as pd
df = pd.DataFrame({
'Jacob_Marks': [95, 90, 75, 85, 88],
'Ted_Marks': [60, 50, 65, 85, 70],
'Jamie_Marks': [77, 76, 65, 45, 50]
})
print("Column sums:")
print(df.sum(axis=0))
Column sums: Jacob_Marks 433 Ted_Marks 330 Jamie_Marks 313 dtype: int64
Filtering Columns by Sum
Now we'll filter to keep only columns where the total sum is greater than 400 ?
import pandas as pd
df = pd.DataFrame({
'Jacob_Marks': [95, 90, 75, 85, 88],
'Ted_Marks': [60, 50, 65, 85, 70],
'Jamie_Marks': [77, 76, 65, 45, 50]
})
print("Original DataFrame:")
print(df)
# Filter columns where sum > 400
filtered_df = df.loc[:, df.sum(axis=0) > 400]
print("\nFiltered DataFrame (columns with sum > 400):")
print(filtered_df)
Original DataFrame: Jacob_Marks Ted_Marks Jamie_Marks 0 95 60 77 1 90 50 76 2 75 65 65 3 85 85 45 4 88 70 50 Filtered DataFrame (columns with sum > 400): Jacob_Marks 0 95 1 90 2 75 3 85 4 88
How It Works
The filtering process works in three steps:
-
df.sum(axis=0)calculates the sum for each column -
df.sum(axis=0) > 400creates a boolean Series indicating which columns meet the criteria -
df.loc[:, boolean_series]selects all rows and only the columns where the condition is True
Alternative Approach
You can also store the condition separately for better readability ?
import pandas as pd
df = pd.DataFrame({
'Jacob_Marks': [95, 90, 75, 85, 88],
'Ted_Marks': [60, 50, 65, 85, 70],
'Jamie_Marks': [77, 76, 65, 45, 50]
})
# Create boolean condition
condition = df.sum(axis=0) > 400
print("Condition (sum > 400):")
print(condition)
# Apply the filter
result = df.loc[:, condition]
print("\nFiltered result:")
print(result)
Condition (sum > 400): Jacob_Marks True Ted_Marks False Jamie_Marks False dtype: bool Filtered result: Jacob_Marks 0 95 1 90 2 75 3 85 4 88
Conclusion
Use df.loc[:, df.sum(axis=0) > threshold] to filter DataFrame columns based on their sum values. This approach combines column aggregation with boolean indexing for efficient data filtering.
