Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Write a program in Python to merge two dataframes based on matching data in a column
In data analysis, you often need to combine data from multiple DataFrames based on common columns. Pandas provides the merge() function to join DataFrames similar to SQL joins.
Understanding DataFrame Merging
When you have two DataFrames with a common column, you can merge them to create a single DataFrame containing matching records. The merge operation finds rows where the specified column values match between both DataFrames.
Creating Sample DataFrames
Let's start by creating two DataFrames with a common 'id' column ?
import pandas as pd
# First DataFrame
first_df = pd.DataFrame({
'id': [1, 2, 3, 4],
'country': ['India', 'UK', 'US', 'China']
})
print("First DataFrame:")
print(first_df)
# Second DataFrame
second_df = pd.DataFrame({
'id': [1, 11, 22, 4],
'City': ['Chennai', 'Cambridge', 'Chicago', 'Chengdu']
})
print("\nSecond DataFrame:")
print(second_df)
First DataFrame: id country 0 1 India 1 2 UK 2 3 US 3 4 China Second DataFrame: id City 0 1 Chennai 1 11 Cambridge 2 22 Chicago 3 4 Chengdu
Merging DataFrames on Common Column
Use pd.merge() with the on parameter to specify the common column ?
import pandas as pd
# Create DataFrames
first_df = pd.DataFrame({
'id': [1, 2, 3, 4],
'country': ['India', 'UK', 'US', 'China']
})
second_df = pd.DataFrame({
'id': [1, 11, 22, 4],
'City': ['Chennai', 'Cambridge', 'Chicago', 'Chengdu']
})
# Merge based on 'id' column
merged_df = pd.merge(first_df, second_df, how='inner', on='id')
print("Merged DataFrame:")
print(merged_df)
Merged DataFrame: id country City 0 1 India Chennai 1 4 China Chengdu
Types of Joins
The how parameter determines the join type ?
import pandas as pd
first_df = pd.DataFrame({
'id': [1, 2, 3, 4],
'country': ['India', 'UK', 'US', 'China']
})
second_df = pd.DataFrame({
'id': [1, 11, 22, 4],
'City': ['Chennai', 'Cambridge', 'Chicago', 'Chengdu']
})
# Inner join (default) - only matching records
inner_join = pd.merge(first_df, second_df, how='inner', on='id')
print("Inner Join:")
print(inner_join)
# Left join - all records from first DataFrame
left_join = pd.merge(first_df, second_df, how='left', on='id')
print("\nLeft Join:")
print(left_join)
Inner Join: id country City 0 1 India Chennai 1 4 China Chengdu Left Join: id country City 0 1 India Chennai 1 2 UK NaN 2 3 US NaN 3 4 China Chengdu
Join Types Comparison
| Join Type | Description | Result |
|---|---|---|
inner |
Only matching records | Records with common IDs only |
left |
All from left DataFrame | All first_df records + matches |
right |
All from right DataFrame | All second_df records + matches |
outer |
All records from both | All records with NaN for missing |
Conclusion
Use pd.merge() with the on parameter to join DataFrames on common columns. Choose the appropriate join type based on whether you want only matching records (inner) or all records from one or both DataFrames.
