Article Categories

Selected Reading

How to merge two csv files by specific column using Pandas in Python?

Pandas Python Server Side Programming Programming

CSV (Comma Separated Values) files are widely used for storing and exchanging data in a simple format. In many data processing tasks, it is necessary to merge two or more CSV files based on a specific column. Fortunately, this can be easily achieved using the Pandas library in Python.

In this article, we will learn how to merge two CSV files by a specific column using Pandas in Python.

What is Pandas Library?

Pandas is an open-source library for data manipulation and analysis in Python. It provides high-performance data structures and tools for working with structured data, such as tabular, time-series, and multidimensional data. Pandas is widely used in finance, data science, machine learning, and other fields that require data manipulation.

Using merge() Function

The most common way to merge CSV files is using the merge() function. Let's create sample data and demonstrate the merge process ?

import pandas as pd

# Create sample employee data
employees_data = {
    'EmployeeID': [1, 2, 3, 4, 5],
    'Name': ['John', 'Sarah', 'David', 'Alex', 'Emily'],
    'DepartmentID': [1, 2, 1, 3, 2],
    'Salary': [50000, 60000, 55000, 65000, 55000]
}

# Create sample department data  
departments_data = {
    'DepartmentID': [1, 2, 3, 4],
    'DepartmentName': ['IT', 'Sales', 'Marketing', 'HR']
}

# Create DataFrames
employees = pd.DataFrame(employees_data)
departments = pd.DataFrame(departments_data)

print("Employees DataFrame:")
print(employees)
print("\nDepartments DataFrame:")
print(departments)

Employees DataFrame:
   EmployeeID   Name  DepartmentID  Salary
0           1   John             1   50000
1           2  Sarah             2   60000
2           3  David             1   55000
3           4   Alex             3   65000
4           5  Emily             2   55000

Departments DataFrame:
   DepartmentID DepartmentName
0             1             IT
1             2          Sales
2             3      Marketing
3             4             HR

Now let's merge the two DataFrames based on the DepartmentID column ?

import pandas as pd

# Create sample data (same as above)
employees_data = {
    'EmployeeID': [1, 2, 3, 4, 5],
    'Name': ['John', 'Sarah', 'David', 'Alex', 'Emily'],
    'DepartmentID': [1, 2, 1, 3, 2],
    'Salary': [50000, 60000, 55000, 65000, 55000]
}

departments_data = {
    'DepartmentID': [1, 2, 3, 4],
    'DepartmentName': ['IT', 'Sales', 'Marketing', 'HR']
}

employees = pd.DataFrame(employees_data)
departments = pd.DataFrame(departments_data)

# Merge DataFrames based on DepartmentID column
merged_df = pd.merge(employees, departments, on='DepartmentID')

print("Merged DataFrame:")
print(merged_df)

Merged DataFrame:
   EmployeeID   Name  DepartmentID  Salary DepartmentName
0           1   John             1   50000             IT
1           3  David             1   55000             IT
2           2  Sarah             2   60000          Sales
3           5  Emily             2   55000          Sales
4           4   Alex             3   65000      Marketing

Different Types of Joins

Pandas merge() supports different types of joins. Here's how to perform different merge types ?

import pandas as pd

# Create sample data with some non-matching values
employees_data = {
    'EmployeeID': [1, 2, 3, 4],
    'Name': ['John', 'Sarah', 'David', 'Alex'],
    'DepartmentID': [1, 2, 1, 5]  # Note: DepartmentID 5 doesn't exist in departments
}

departments_data = {
    'DepartmentID': [1, 2, 3, 4],
    'DepartmentName': ['IT', 'Sales', 'Marketing', 'HR']
}

employees = pd.DataFrame(employees_data)
departments = pd.DataFrame(departments_data)

# Inner join (default) - only matching records
inner_join = pd.merge(employees, departments, on='DepartmentID', how='inner')
print("Inner Join:")
print(inner_join)

# Left join - all records from left DataFrame
left_join = pd.merge(employees, departments, on='DepartmentID', how='left')
print("\nLeft Join:")
print(left_join)

# Right join - all records from right DataFrame  
right_join = pd.merge(employees, departments, on='DepartmentID', how='right')
print("\nRight Join:")
print(right_join)

Inner Join:
   EmployeeID   Name  DepartmentID DepartmentName
0           1   John             1             IT
1           2  Sarah             2          Sales
2           3  David             1             IT

Left Join:
   EmployeeID   Name  DepartmentID DepartmentName
0           1   John             1             IT
1           2  Sarah             2          Sales
2           3  David             1             IT
3           4   Alex             5            NaN

Right Join:
   EmployeeID   Name  DepartmentID DepartmentName
0          1.0   John             1             IT
1          2.0  Sarah             2          Sales
2          3.0  David             1             IT
3          NaN    NaN             3      Marketing
4          NaN    NaN             4             HR

Using join() Method

The join() method is another way to combine DataFrames, typically using the index ?

import pandas as pd

# Create sample order and customer data
orders_data = {
    'OrderID': [1, 2, 3, 4, 5],
    'CustomerID': [1, 3, 2, 1, 4],
    'OrderDate': ['2022-05-01', '2022-05-02', '2022-05-03', '2022-05-04', '2022-05-05'],
    'TotalAmount': [100.0, 150.0, 200.0, 75.0, 120.0]
}

customers_data = {
    'CustomerID': [1, 2, 3, 4],
    'CustomerName': ['John', 'Sarah', 'David', 'Emily'],
    'Email': ['john@example.com', 'sarah@example.com', 'david@example.com', 'emily@example.com']
}

orders = pd.DataFrame(orders_data)
customers = pd.DataFrame(customers_data)

# Join DataFrames using CustomerID as index
joined_df = orders.set_index('CustomerID').join(customers.set_index('CustomerID'))

print("Joined DataFrame:")
print(joined_df)

Joined DataFrame:
            OrderID   OrderDate  TotalAmount CustomerName              Email
CustomerID                                                                  
1                 1  2022-05-01        100.0         John   john@example.com
3                 2  2022-05-02        150.0        David  david@example.com
2                 3  2022-05-03        200.0        Sarah  sarah@example.com
1                 4  2022-05-04         75.0         John   john@example.com
4                 5  2022-05-05        120.0        Emily  emily@example.com

Handling Missing Values

When merging DataFrames, missing values may occur. You can handle them using fillna() or dropna() methods ?

import pandas as pd

# Create data with potential missing values after merge
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [1, 3, 4], 'Age': [25, 30, 35]})

# Merge with outer join to show all records
merged = pd.merge(df1, df2, on='ID', how='outer')
print("Merged with missing values:")
print(merged)

# Fill missing values
merged_filled = merged.fillna('Unknown')
print("\nAfter filling missing values:")
print(merged_filled)

# Drop rows with missing values
merged_dropped = merged.dropna()
print("\nAfter dropping missing values:")
print(merged_dropped)

Merged with missing values:
   ID     Name   Age
0   1    Alice  25.0
1   2      Bob   NaN
2   3  Charlie  30.0
3   4      NaN  35.0

After filling missing values:
   ID     Name   Age
0   1    Alice    25
1   2      Bob  Unknown
2   3  Charlie    30
3   4  Unknown    35

After dropping missing values:
   ID     Name   Age
0   1    Alice  25.0
2   3  Charlie  30.0

Comparison

Method	Best For	Key Feature
`merge()`	Column-based joins	Most flexible, supports all join types
`join()`	Index-based joins	Simpler syntax for index merges

Conclusion

Merging CSV files by specific columns is straightforward using Pandas' merge() and join() methods. Use merge() for column-based joins and join() for index-based operations. Handle missing values appropriately based on your data requirements.

Tarun Singh

Updated on: 2026-03-27T10:27:31+05:30

5K+ Views

Previous Next