Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to merge multiple excel files into a single file with Python?
Excel is one of the most popular tools for data analysis and management. Often, we need to merge multiple Excel files into a single file for analysis or sharing with others. Manually merging these files can be time?consuming and prone to errors, especially when dealing with large datasets. Luckily, Python provides an efficient and flexible way to merge multiple Excel files into a single file.
In this article, we will learn how to merge multiple Excel files using Python. We will be using the Pandas library, which is a powerful and easy?to?use data analysis library for Python to merge the different Excel files in Python.
What is Pandas library in Python?
Pandas is a Python library used for working with data sets. It consists of functions for analyzing, cleaning, exploring, and manipulating data. Apart from this, it allows us to analyze big data and make conclusions based on statistical theories. It can also clean messy data sets, and make them readable and relevant.
When merging Excel files with Pandas, we can use the read_excel() function to read the data from each file into a Pandas DataFrame object. Once we have all the data in separate DataFrames, we can use Pandas' built?in functions like merge(), join(), and concat() to combine the data frames into a single DataFrame.
Pandas' merge() function provides the ability to merge DataFrames based on specific columns, similar to the VLOOKUP function in Excel. We can specify which columns to use as keys to join the data frames and the type of join to perform (inner, outer, left, or right).
Steps to merge multiple Excel files into a single file
Step 1: Import the libraries
The first step to merge the different Excel files is to import the required libraries. We will be using Pandas to read and merge the Excel files, and OS to navigate the file system ?
import pandas as pd import os
Step 2: Identifying the Files to Merge
The next step is to identify the files that we want to merge. We can either specify the exact file names or use a pattern to match multiple files. In this example, we will use the latter approach, where we will match all files in a specific directory that have the extension ".xlsx" ?
import pandas as pd
import os
path = "/path/to/files/"
files = os.listdir(path)
files_xlsx = [f for f in files if f.endswith('.xlsx')]
print("Excel files found:", files_xlsx)
Excel files found: ['file1.xlsx', 'file2.xlsx', 'file3.xlsx']
Step 3: Reading your Excel Files
After identifying the files to merge, we will now read the different Excel files into Pandas data frames. We can use a loop to read all files into separate data frames and then concatenate them into a single data frame.
Let's create sample data to demonstrate the merging process ?
import pandas as pd
# Create sample DataFrames representing different Excel files
df1 = pd.DataFrame({
'Product': ['Apple', 'Banana'],
'Quantity': [10, 5],
'Price': [0.5, 0.25]
})
df2 = pd.DataFrame({
'Product': ['Orange', 'Grapes'],
'Quantity': [8, 12],
'Price': [0.4, 0.3]
})
df3 = pd.DataFrame({
'Product': ['Mango', 'Pineapple'],
'Quantity': [15, 7],
'Price': [0.6, 0.8]
})
print("Sample data from file 1:")
print(df1)
Sample data from file 1: Product Quantity Price 0 Apple 10 0.50 1 Banana 5 0.25
Step 4: Create a new merged Excel file
The final step to merge the files is to write the merged data frame to a new Excel file. We can use the Pandas to_excel() method to do this ?
# Merge all DataFrames
df_list = [df1, df2, df3]
merged_df = pd.concat(df_list, ignore_index=True)
print("Merged DataFrame:")
print(merged_df)
Merged DataFrame:
Product Quantity Price
0 Apple 10 0.50
1 Banana 5 0.25
2 Orange 8 0.40
3 Grapes 12 0.30
4 Mango 15 0.60
5 Pineapple 7 0.80
Method 1: Using pd.concat() with List Comprehension
This is the most efficient approach for merging multiple Excel files. We read all files into a list and concatenate them in one operation ?
import pandas as pd
import os
# Identifying the files to merge
path = "/path/to/files/"
files = os.listdir(path)
files_xlsx = [f for f in files if f.endswith('.xlsx')]
# Reading and concatenating Excel files in one step
df_list = [pd.read_excel(os.path.join(path, f)) for f in files_xlsx]
merged_df = pd.concat(df_list, ignore_index=True)
# Writing the merged data frame to a new Excel file
output_path = "/path/to/merged_file.xlsx"
merged_df.to_excel(output_path, index=False)
Method 2: Using a for loop with pd.concat()
This approach processes files one by one, which can be useful when dealing with very large files or when you need to perform operations on each file before merging ?
import pandas as pd
# Sample DataFrames representing Excel files
files_data = [
pd.DataFrame({'Product': ['Apple', 'Banana'], 'Quantity': [10, 5], 'Price': [0.5, 0.25]}),
pd.DataFrame({'Product': ['Orange', 'Grapes'], 'Quantity': [8, 12], 'Price': [0.4, 0.3]}),
pd.DataFrame({'Product': ['Mango', 'Pineapple'], 'Quantity': [15, 7], 'Price': [0.6, 0.8]})
]
# Initialize an empty DataFrame
merged_df = pd.DataFrame()
# Reading and concatenating files using a loop
for i, data in enumerate(files_data):
print(f"Processing file {i+1}:")
print(data)
merged_df = pd.concat([merged_df, data], ignore_index=True)
print("\nFinal merged DataFrame:")
print(merged_df)
Processing file 1:
Product Quantity Price
0 Apple 10 0.50
1 Banana 5 0.25
Processing file 2:
Product Quantity Price
0 Orange 8 0.40
1 Grapes 12 0.30
Processing file 3:
Product Quantity Price
0 Mango 15 0.60
1 Pineapple 7 0.80
Final merged DataFrame:
Product Quantity Price
0 Apple 10 0.50
1 Banana 5 0.25
2 Orange 8 0.40
3 Grapes 12 0.30
4 Mango 15 0.60
5 Pineapple 7 0.80
Method 3: Using pd.merge() for Column-based Merging
When you need to merge files based on common columns (similar to SQL joins), use pd.merge() instead of concat() ?
import pandas as pd
# Create sample DataFrames with common columns
df1 = pd.DataFrame({
'Product': ['Apple', 'Banana', 'Orange'],
'Quantity': [10, 5, 8]
})
df2 = pd.DataFrame({
'Product': ['Apple', 'Banana', 'Orange'],
'Price': [0.5, 0.25, 0.4]
})
# Merge based on common 'Product' column
merged_df = pd.merge(df1, df2, on='Product')
print("Merged DataFrame based on Product column:")
print(merged_df)
Merged DataFrame based on Product column: Product Quantity Price 0 Apple 10 0.50 1 Banana 5 0.25 2 Orange 8 0.40
Comparison
| Method | Best For | Performance | Use Case |
|---|---|---|---|
pd.concat() with list comprehension |
Multiple files with same structure | Fastest | Stacking data vertically |
For loop with pd.concat()
|
Large files or custom processing | Moderate | When memory is limited |
pd.merge() |
Files with common columns | Depends on data size | Joining based on keys |
Conclusion
Merging multiple Excel files into a single file using Python saves time and reduces errors when dealing with large datasets. Use pd.concat() for stacking similar structured files and pd.merge() for combining files based on common columns. With Python's Pandas library, merging Excel files becomes an efficient and automated process.
