How to merge two csv files by specific column using Pandas in Python?


CSV (Comma Separated Values) files are widely used for storing and exchanging data in a simple format. In many data processing tasks, it is necessary to merge two or more CSV files based on a specific column. Fortunately, this can be easily achieved using the Pandas library in Python.

In this article, we will learn how to merge two CSV files by a specific column using Pandas in Python.

What is Pandas Library?

Pandas is an open-source library for information control and examination in Python. It offers tools for working with structured data, such as tabular, time-series, and multidimensional data, as well as high-performance data structures. Pandas is widely used in finance, data science, machine learning, and other fields that require data manipulation.

Steps to Merge two CSV Files by Specific Column in Python

Below are the complete steps to merge the two CSV files by their specific column in Python using the Pandas library −

Step 1: Import Pandas library

The first step to merging two CSV files is to import the pandas library. Pandas is a powerful data analysis library for Python. It provides data structures for efficiently storing and manipulating large datasets. To use Pandas, we first need to import it in our Python program. We can do this using the following command −

import pandas as pd

Step 2: Read CSV Files

Our next step is to read the two CSV files that we want to merge. We can use the read_csv() function of Pandas to read CSV files into a Pandas DataFrame. We need to provide the file path or URL of the CSV file as an argument to the read_csv() function. For example −

df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')

Step 3: Merging CSV Files

After reading the CSV files into Pandas DataFrames, now its time to merge them based on a specific column using the merge() function. The merge() function takes two DataFrames as input and merges them based on a common column. Here's the basic syntax −

merged_df = pd.merge(df1, df2, on='column_name')

In this example, merged_df is the new DataFrame that contains the merged data from df1 and df2. The on parameter specifies the name of the common column on which the two DataFrames are merged.

For example, let's say we have two CSV files sales.csv and customers.csv. Both files contain a column named CustomerID. We want to merge the two files based on the CustomerID column. Here's how we can do it −

sales = pd.read_csv('sales.csv')
customers = pd.read_csv('customers.csv')
merged_df = pd.merge(sales, customers, on='CustomerID')

Step 4: Handling Missing Values (if any)

When merging two CSV files based on a specific column, there may be missing values in the merged DataFrame. These missing values may occur if there are no matching values in the corresponding column of one of the CSV files.

To handle missing values, we can use the fillna() function of Pandas to replace them with a default value or a calculated value. For example, we can replace missing values with a string 'Unknown' using the following code −

merged_df.fillna('Unknown', inplace=True)

In this example, we have used the fillna() function to replace missing values in the merged DataFrame with the string 'Unknown'. We have also specified the inplace parameter as True, which means that the original DataFrame will be modified instead of creating a new one.

Example 1: Using the Merge() Function

In this example, we will use the merge() function of Pandas to merge two CSV files based on a specific column.

CSV Files

Suppose we have two CSV files: employees.csv and departments.csv. The employees.csv file contains the following data −

EmployeeID Name DepartmentID Salary
1 John 1 50000
2 Sarah 2 60000
3 David 1 55000
4 Alex 3 65000
5 Emily 2 55000

The departments.csv file contains the following data −

DepartmentID DepartmentName
1 IT
2 Sales
3 Marketing
4 HR

Example

import pandas as pd

# Load CSV files
employees = pd.read_csv('employees.csv')
departments = pd.read_csv('departments.csv')

# Merge dataframes based on DepartmentID column
merged_df = pd.merge(employees, departments, on='DepartmentID')

# Print merged dataframe
print(merged_df.head())

# Save merged dataframe to a new CSV file
merged_df.to_csv('merged_employees_departments.csv', index=False)

Output

   EmployeeID   Name  DepartmentID  Salary DepartmentName
0           1   John             1   50000             IT
1           3  David             1   55000             IT
2           2  Sarah             2   60000          Sales
3           5  Emily             2   55000          Sales
4           4   Alex             3   65000      Marketing

Example 2: Using the Join() Function

In this example, we will use the join() method of Pandas to merge two CSV files based on a specific column.

CSV Files

Suppose we have two CSV files: orders.csv and customers.csv. The orders.csv file contains the following data −

OrderID CustomerID OrderDate TotalAmount
1 1 2022-05-01 100.0
2 3 2022-05-02 150.0
3 2 2022-05-03 200.0
4 1 2022-05-04 75.0
5 4 2022-05-05 120.0

The customers.csv file contains the following data −

CustomerID CustomerName Email
1 John john@example.com
2 Sarah sarah@example.com
3 David david@example.com
4 Emily emily@example.com

Example

import pandas as pd

# Load CSV files
orders = pd.read_csv('orders.csv')
customers = pd.read_csv('customers.csv')

# Join dataframes based on CustomerID column
joined_df = orders.set_index('CustomerID').join(customers.set_index('CustomerID'))

# Print joined dataframe
print(joined_df.head())

# Save joined dataframe to a new CSV file
joined_df.to_csv('joined_orders_customers.csv')

Output

            OrderID   OrderDate  TotalAmount CustomerName              Email
CustomerID                                                                  
1                 1  2022-05-01        100.0         John   john@example.com
1                 4  2022-05-04         75.0         John   john@example.com
2                 3  2022-05-03        200.0        Sarah  sarah@example.com
3                 2  2022-05-02        150.0        David  david@example.com
4                 5  2022-05-05        120.0        Emily  emily@example.com

Using Pandas' merge() function, we have merged two CSV files based on the "id" column in this example. The DataFrame that is the result of merging the two CSV files includes the "id" column in addition to the "name_x," "email_x," "name_y," and "email_y" columns.

Note that there are missing values in the combined DataFrame for the 'name_y' and 'email_y' segments, which relate to the lines where there were no matching qualities in the second CSV record. As demonstrated in the previous step, the Pandas fillna() and dropna() functions can be used to handle these missing values.

Conclusion

Merging two CSV files based on a specific column is a common data processing task, and it can be easily achieved using the Pandas library in Python. In this article, we learned how to merge two CSV files using the merge() function of Pandas. We have also discussed how to handle missing values and how to save the merged DataFrame to a new CSV file.

Updated on: 31-Jul-2023

3K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements