Joining Excel Data from Multiple Files using Python Pandas


Joining Excel Data From Multiple Files Using Python Pandas: Introduction

Effectively merging and analyzing this data becomes essential as firms deal with enormous amounts of data contained in numerous Excel files. An elegant method for combining Excel data from various files is provided by Python's potent Pandas data processing package. With step-by-step directions and an explanation of the grammar, this article will walk you through the procedure. We'll also look at two distinct methods for doing this work, their examples, executable code, and results.

Excel Data From Multiple Files

The process of combining or concatenating data from various Excel files into a single dataset is known as joining Excel data. Data analysts, scientists, and business people can run thorough analyses and get insightful information thanks to this consolidation. Finding links, spotting trends, and drawing insightful conclusions are all made feasible by merging data from many sources, like as sales records, customer information, or financial data.

Additionally, merging Excel data makes it easier to clean and standardise the data because any discrepancies or missing values may be fixed during the consolidation process. This guarantees data integrity and accuracy, setting the groundwork for data-driven decision-making, strategic planning, and suggestions that can actually be implemented to increase organisational effectiveness. In the end, combining Excel data enables users to fully use their data assets and draw out useful intelligence from complicated datasets.

Syntax and Explanation

We'll mostly utilise the pandas.read_excel() function to read the Excel files and the pandas.concat() function to concatenate the dataframes in order to connect Excel data using Python Pandas.

The syntax for reading an Excel file using Pandas is as follows −

import pandas as pd
df = pd.read_excel('file_path.xlsx')	

Pd is used here to refer to the pandas library. The read_excel() function extracts the data from the Excel file indicated by the 'file_path.xlsx' parameter and returns a dataframe (df).

The syntax for concatenating multiple dataframes using Pandas is as follows −

combined_df = pd.concat([df1, df2, df3, ...])

A list of dataframes ([df1, df2, df3,...]) is sent to the concat() function, which concatenates the dataframes vertically. All the rows from the input dataframes will be included in the final dataframe, called combined_df.

Algorithm

  • Step 1 − Pandas and any other essential libraries should be imported.

  • Step 2 − The file paths of the Excel files you want to connect must be specified.

  • Step 3 − Use the read_excel() method to open each Excel file, then store the dataframes in different variables.

  • Step 4 − Use the concat() function to combine the dataframes, passing them as a list.

  • Step 5 − If more analysis is required, export the merged dataframe to a new Excel file.

Approach

  • Approach 1 − Joining Excel Data using a For Loop

  • Approach 2 − Joining Excel Data using List Comprehension

Approach 1: Joining Excel Data Using a For Loop

To store individual dataframes, we first define an empty list, dfs, in Approach 1. Then, using a for loop, we read each Excel file using pd.read_excel() by iterating over the file_paths array. The outcome is a dataframe that is added to the dfs list. Finally, we combine all of the dataframes in dfs into a single dataframe called combined_df using pd.concat().

Let's assume we have three Excel files (file1.xlsx, file2.xlsx, and file3.xlsx) with the following data −

file1.xlsx
   ID  Name    Age
0   1   Tom     28
1   2   Emma    35
file2.xlsx
   ID   Name    Age
0   3   John    42
1   4   Sarah   31
file3.xlsx
   ID    Name   Age
0   5    Liam   29
1   6    Lily   27

Example

import pandas as pd

file_paths = ['file1.xlsx', 'file2.xlsx', 'file3.xlsx']

dfs = []  # To store individual dataframes

for file in file_paths:
   df = pd.read_excel(file)
   dfs.append(df)

combined_df = pd.concat(dfs)

Output

   ID   Name    Age
0   1   Tom     28
1   2   Emma    35
0   3   John    42
1   4   Sarah   31
0   5   Liam    29
1   6   Lily    27

The rows from each file are combined vertically in the resulting dataframe. The first row of the data in file1.xlsx has an ID of 1, the name "Tom," and the age 28. The ID 2 on the second row is listed as "Emma," with a 35-year-old age.

After that, the resulting dataframe is supplemented with the data from file2.xlsx. The third row bears the ID 3, "John," and a 42-year-old age. The fourth row contains an ID of 4, the name "Sarah," and a 31-year-old age.

The information from file3.xlsx is then added. The fifth row has the ID 5 and the values "Liam," "29," and "Name." The sixth and last row's ID is 6, and its namesake, "Lily," is 27 years old.

This results in a comprehensive dataset that is prepared for additional analysis, processing, or exporting as required. As a consequence, we have a combined dataframe that contains all the rows from the three Excel files.

Approach 2: Joining Excel Data Using List Comprehension

The second approach, which uses a list comprehension, offers a clear and sophisticated approach to make lists in Python. Here, we use pd.read_excel(file) as the expression within the list comprehension to build a list of dataframes directly from the file_paths list. By doing away with an explicit for loop, this method results in more compact code.

Let's assume we have three Excel files (file1.xlsx, file2.xlsx, and file3.xlsx) with the following data −

file1.xlsx
   ID  Name    Age
0   1   Tom     28
1   2   Emma    35
file2.xlsx
   ID   Name    Age
0   3   John    42
1   4   Sarah   31
file3.xlsx
   ID    Name   Age
0   5    Liam   29
1   6    Lily   27

Example

import pandas as pd

file_paths = ['file1.xlsx', 'file2.xlsx', 'file3.xlsx']

dfs = [pd.read_excel(file) for file in file_paths]

combined_df = pd.concat(dfs)

Output

   ID   Name    Age
0   1   Tom     28
1   2   Emma    35
0   3   John    42
1   4   Sarah   31
0   5   Liam    29
1   6   Lily    27

The explanation for Approach 2 will be the same as Approach 1 for the output since both approaches yield the same output.

Conclusion

A strong and adaptable method to combine and analyze big datasets is to use Python Pandas to join Excel data from many sources. In this post, we looked at two different methods for accomplishing this operation as well as the syntax and step-by-step instructions for integrating Excel data. Data professionals may effectively extract useful insights from complicated datasets by utilizing the capabilities of Pandas, giving them the ability to make data-driven decisions and promote company success.

Updated on: 11-Oct-2023

94 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements