Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to merge two csv files by specific column using Pandas in Python?
CSV (Comma Separated Values) files are widely used for storing and exchanging data in a simple format. In many data processing tasks, it is necessary to merge two or more CSV files based on a specific column. Fortunately, this can be easily achieved using the Pandas library in Python.
In this article, we will learn how to merge two CSV files by a specific column using Pandas in Python.
What is Pandas Library?
Pandas is an open-source library for data manipulation and analysis in Python. It provides high-performance data structures and tools for working with structured data, such as tabular, time-series, and multidimensional data. Pandas is widely used in finance, data science, machine learning, and other fields that require data manipulation.
Using merge() Function
The most common way to merge CSV files is using the merge() function. Let's create sample data and demonstrate the merge process ?
import pandas as pd
# Create sample employee data
employees_data = {
'EmployeeID': [1, 2, 3, 4, 5],
'Name': ['John', 'Sarah', 'David', 'Alex', 'Emily'],
'DepartmentID': [1, 2, 1, 3, 2],
'Salary': [50000, 60000, 55000, 65000, 55000]
}
# Create sample department data
departments_data = {
'DepartmentID': [1, 2, 3, 4],
'DepartmentName': ['IT', 'Sales', 'Marketing', 'HR']
}
# Create DataFrames
employees = pd.DataFrame(employees_data)
departments = pd.DataFrame(departments_data)
print("Employees DataFrame:")
print(employees)
print("\nDepartments DataFrame:")
print(departments)
Employees DataFrame: EmployeeID Name DepartmentID Salary 0 1 John 1 50000 1 2 Sarah 2 60000 2 3 David 1 55000 3 4 Alex 3 65000 4 5 Emily 2 55000 Departments DataFrame: DepartmentID DepartmentName 0 1 IT 1 2 Sales 2 3 Marketing 3 4 HR
Now let's merge the two DataFrames based on the DepartmentID column ?
import pandas as pd
# Create sample data (same as above)
employees_data = {
'EmployeeID': [1, 2, 3, 4, 5],
'Name': ['John', 'Sarah', 'David', 'Alex', 'Emily'],
'DepartmentID': [1, 2, 1, 3, 2],
'Salary': [50000, 60000, 55000, 65000, 55000]
}
departments_data = {
'DepartmentID': [1, 2, 3, 4],
'DepartmentName': ['IT', 'Sales', 'Marketing', 'HR']
}
employees = pd.DataFrame(employees_data)
departments = pd.DataFrame(departments_data)
# Merge DataFrames based on DepartmentID column
merged_df = pd.merge(employees, departments, on='DepartmentID')
print("Merged DataFrame:")
print(merged_df)
Merged DataFrame: EmployeeID Name DepartmentID Salary DepartmentName 0 1 John 1 50000 IT 1 3 David 1 55000 IT 2 2 Sarah 2 60000 Sales 3 5 Emily 2 55000 Sales 4 4 Alex 3 65000 Marketing
Different Types of Joins
Pandas merge() supports different types of joins. Here's how to perform different merge types ?
import pandas as pd
# Create sample data with some non-matching values
employees_data = {
'EmployeeID': [1, 2, 3, 4],
'Name': ['John', 'Sarah', 'David', 'Alex'],
'DepartmentID': [1, 2, 1, 5] # Note: DepartmentID 5 doesn't exist in departments
}
departments_data = {
'DepartmentID': [1, 2, 3, 4],
'DepartmentName': ['IT', 'Sales', 'Marketing', 'HR']
}
employees = pd.DataFrame(employees_data)
departments = pd.DataFrame(departments_data)
# Inner join (default) - only matching records
inner_join = pd.merge(employees, departments, on='DepartmentID', how='inner')
print("Inner Join:")
print(inner_join)
# Left join - all records from left DataFrame
left_join = pd.merge(employees, departments, on='DepartmentID', how='left')
print("\nLeft Join:")
print(left_join)
# Right join - all records from right DataFrame
right_join = pd.merge(employees, departments, on='DepartmentID', how='right')
print("\nRight Join:")
print(right_join)
Inner Join: EmployeeID Name DepartmentID DepartmentName 0 1 John 1 IT 1 2 Sarah 2 Sales 2 3 David 1 IT Left Join: EmployeeID Name DepartmentID DepartmentName 0 1 John 1 IT 1 2 Sarah 2 Sales 2 3 David 1 IT 3 4 Alex 5 NaN Right Join: EmployeeID Name DepartmentID DepartmentName 0 1.0 John 1 IT 1 2.0 Sarah 2 Sales 2 3.0 David 1 IT 3 NaN NaN 3 Marketing 4 NaN NaN 4 HR
Using join() Method
The join() method is another way to combine DataFrames, typically using the index ?
import pandas as pd
# Create sample order and customer data
orders_data = {
'OrderID': [1, 2, 3, 4, 5],
'CustomerID': [1, 3, 2, 1, 4],
'OrderDate': ['2022-05-01', '2022-05-02', '2022-05-03', '2022-05-04', '2022-05-05'],
'TotalAmount': [100.0, 150.0, 200.0, 75.0, 120.0]
}
customers_data = {
'CustomerID': [1, 2, 3, 4],
'CustomerName': ['John', 'Sarah', 'David', 'Emily'],
'Email': ['john@example.com', 'sarah@example.com', 'david@example.com', 'emily@example.com']
}
orders = pd.DataFrame(orders_data)
customers = pd.DataFrame(customers_data)
# Join DataFrames using CustomerID as index
joined_df = orders.set_index('CustomerID').join(customers.set_index('CustomerID'))
print("Joined DataFrame:")
print(joined_df)
Joined DataFrame:
OrderID OrderDate TotalAmount CustomerName Email
CustomerID
1 1 2022-05-01 100.0 John john@example.com
3 2 2022-05-02 150.0 David david@example.com
2 3 2022-05-03 200.0 Sarah sarah@example.com
1 4 2022-05-04 75.0 John john@example.com
4 5 2022-05-05 120.0 Emily emily@example.com
Handling Missing Values
When merging DataFrames, missing values may occur. You can handle them using fillna() or dropna() methods ?
import pandas as pd
# Create data with potential missing values after merge
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [1, 3, 4], 'Age': [25, 30, 35]})
# Merge with outer join to show all records
merged = pd.merge(df1, df2, on='ID', how='outer')
print("Merged with missing values:")
print(merged)
# Fill missing values
merged_filled = merged.fillna('Unknown')
print("\nAfter filling missing values:")
print(merged_filled)
# Drop rows with missing values
merged_dropped = merged.dropna()
print("\nAfter dropping missing values:")
print(merged_dropped)
Merged with missing values: ID Name Age 0 1 Alice 25.0 1 2 Bob NaN 2 3 Charlie 30.0 3 4 NaN 35.0 After filling missing values: ID Name Age 0 1 Alice 25 1 2 Bob Unknown 2 3 Charlie 30 3 4 Unknown 35 After dropping missing values: ID Name Age 0 1 Alice 25.0 2 3 Charlie 30.0
Comparison
| Method | Best For | Key Feature |
|---|---|---|
merge() |
Column-based joins | Most flexible, supports all join types |
join() |
Index-based joins | Simpler syntax for index merges |
Conclusion
Merging CSV files by specific columns is straightforward using Pandas' merge() and join() methods. Use merge() for column-based joins and join() for index-based operations. Handle missing values appropriately based on your data requirements.
