Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to Collapse Multiple Columns in Python Pandas?
Pandas is a popular data manipulation library in Python that is widely used for working with structured data. One common task when working with data is to clean and transform it in order to prepare it for analysis.
Sometimes, the data might contain multiple columns that have similar information or are related to each other. In such cases, it might be useful to collapse these columns into a single column for easier analysis or visualization.
Pandas provides several methods to collapse multiple columns into a single column. In this tutorial, we will explore the two most common methods: melt() and stack().
Using the melt() Method
The melt() method in Pandas is a powerful tool for transforming wide data into long data. It unpivots a DataFrame from wide format to long format by collapsing multiple columns into key-value pairs.
Syntax
DataFrame.melt(id_vars=None, value_vars=None, var_name=None, value_name='value')
Parameters
id_vars Columns to keep as identifier variables
value_vars Columns to collapse into a single column
var_name Name for the new variable column
value_name Name for the new value column
Example
Let's create a DataFrame with income data across different years and collapse the income columns ?
import pandas as pd
# Create sample data
data = {
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'income_2019': [50000, 60000, 70000],
'income_2020': [55000, 65000, 75000]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Melt the income columns
melted_df = df.melt(
id_vars=['name', 'age'],
value_vars=['income_2019', 'income_2020'],
var_name='year',
value_name='income'
)
print("\nAfter melting income columns:")
print(melted_df)
Original DataFrame:
name age income_2019 income_2020
0 Alice 25 50000 55000
1 Bob 30 60000 65000
2 Charlie 35 70000 75000
After melting income columns:
name age year income
0 Alice 25 income_2019 50000
1 Bob 30 income_2019 60000
2 Charlie 35 income_2019 70000
3 Alice 25 income_2020 55000
4 Bob 30 income_2020 65000
5 Charlie 35 income_2020 75000
Using the stack() Method
The stack() method is another way to collapse multiple columns. It pivots a level of column labels to create a hierarchical index and returns a Series or DataFrame with a MultiIndex.
Example
Let's demonstrate stack() with a DataFrame that has hierarchical column structure ?
import pandas as pd
# Create DataFrame with hierarchical columns
data = {
('Sales', 'Q1'): [100, 120, 90],
('Sales', 'Q2'): [110, 130, 95],
('Profit', 'Q1'): [20, 25, 18],
('Profit', 'Q2'): [22, 28, 19]
}
df = pd.DataFrame(data, index=['Store_A', 'Store_B', 'Store_C'])
df.columns = pd.MultiIndex.from_tuples(df.columns)
print("Original DataFrame with MultiIndex columns:")
print(df)
# Stack the columns
stacked_df = df.stack()
print("\nAfter stacking:")
print(stacked_df)
Original DataFrame with MultiIndex columns:
Sales Profit
Q1 Q2 Q1 Q2
Store_A 100 110 20 22
Store_B 120 130 25 28
Store_C 90 95 18 19
After stacking:
Profit Sales
Store_A Q1 20 100
Q2 22 110
Store_B Q1 25 120
Q2 28 130
Store_C Q1 18 90
Q2 19 95
Simple stack() Example
Here's a simpler example using stack() with regular columns ?
import pandas as pd
# Create simple DataFrame
data = {
'Product_A': [10, 20, 30],
'Product_B': [15, 25, 35],
'Product_C': [12, 22, 32]
}
df = pd.DataFrame(data, index=['Jan', 'Feb', 'Mar'])
print("Original DataFrame:")
print(df)
# Stack the columns
stacked = df.stack()
print("\nAfter stacking:")
print(stacked)
# Convert back to DataFrame
stacked_df = stacked.reset_index()
stacked_df.columns = ['Month', 'Product', 'Sales']
print("\nAs DataFrame:")
print(stacked_df)
Original DataFrame:
Product_A Product_B Product_C
Jan 10 15 12
Feb 20 25 22
Mar 30 35 32
After stacking:
Jan Product_A 10
Product_B 15
Product_C 12
Feb Product_A 20
Product_B 25
Product_C 22
Mar Product_A 30
Product_B 35
Product_C 32
dtype: int64
As DataFrame:
Month Product Sales
0 Jan Product_A 10
1 Jan Product_B 15
2 Jan Product_C 12
3 Feb Product_A 20
4 Feb Product_B 25
5 Feb Product_C 22
6 Mar Product_A 30
7 Mar Product_B 35
8 Mar Product_C 32
Comparison
| Method | Best For | Output Type | Control |
|---|---|---|---|
melt() |
Wide to long transformation | DataFrame | More control over column names |
stack() |
Hierarchical data structures | Series/DataFrame | Preserves MultiIndex structure |
Conclusion
Use melt() for transforming wide data to long format with full control over column naming. Use stack() when working with hierarchical data or when you need to preserve MultiIndex structures. Both methods are essential tools for data reshaping in pandas.
