Joining two Pandas DataFrames using merge()


Joining two Pandas DataFrames using merge():Introduction

Pandas is a well-known Python data manipulation package that offers a variety of data structures for working with data, including Series and DataFrame. To unite two or more data frames based on shared columns or indices, utilise Pandas' merge() function. With the help of this function, you may modify the join process and combine the data frames in a variety of ways. This article will offer a thorough tutorial on utilising the merge() function to unite two Pandas data frames.

Joining two Pandas

Definition

To unite two or more data frames based on shared columns or indices, utilise Pandas' merge() function. Based on a defined join criteria, it joins the rows of two data frames into a single data frame.

Syntax

pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
  • `left`− A data frame object.

  • `right`− Another data frame object.

  • `how`− The type of join to be performed (inner, outer, left, or right).

  • `on`− The column or index name to join the data frames on.

  • `left_on`− The column(s) from the left data frame to use as keys.

  • `right_on`− The column(s) from the right data frame to use as keys.

  • `left_index` − If True, use the index (row labels) from the left data frame as the join key(s).

  • `right_index` − If True, use the index (row labels) from the right data frame as the join key(s).

  • `sort` − Sort the joined data frame by the join keys. Defaults to False.

  • `suffixes` − A tuple of suffixes to apply to overlapping column names.

  • `copy` − If True, create a new data frame rather than modifying the existing ones.

  • `indicator` − Adds a column to output DataFrame called "_merge" with information on the source of each row.

  • `validate` − Checks if merge keys exist in both left and right datasets.

Algorithm

  • Step 1 − Add the Pandas library.

  • Step 2 − Two data frames with comparable columns or indexes should be created.

  • Step 3 − Use the merge() method to specify the join criteria.

  • Step 4 − Select the join type that will be used.

  • Step 5 − Use the merge() function to combine the data frames, then display the outcome.

Approach

  • Approach 1 − Using shared columns

  • Approach 2 − Using shared indices

Approach 1: Using Shared Columns

Utilising shared columns is the first method for utilising the merge() function to unite two Pandas data frames. With this method, two data frames are joined based on shared columns. Suppose we have two data frames `df1` and `df2`, as follows −

Example

import pandas as pd

df1 = pd.DataFrame({
   'ID': ['001', '002', '003', '004'],
   'Name': ['John', 'Marry', 'Peter', 'Jack'],
   'Age': [21, 32, 25, 19],
   'Gender': ['Male', 'Female', 'Male', 'Male']
})

df2 = pd.DataFrame({
   'ID': ['002', '003', '005', '006'],
   'Salary': [55000, 65000, 75000, 85000],
   'Department': ['IT', 'Marketing', 'Sales', 'HR']
})
merged_df = pd.merge(df1, df2, on='ID', how='inner')
print(merged_df)

Output

    ID   Name   Age  Gender   Salary   Department
0  002  Marry   32   Female   55000         IT
1  003  Peter   25   Male     65000    Marketing

Information on the employees is contained in the 'df1' data frame, while details about their pay and departments are contained in the 'df2' data frame. In order to connect the data frames depending on the ID column, we supply the 'on' argument in this code as ''ID''. In order to execute an inner join, we also specify the 'how' parameter as ''inner,'' which implies that only the rows from both data frames with matching IDs will be added to the combined data frame. Only the rows from both data frames with matching IDs are included in this merged data frame, which also includes all of the columns from both data frames.

Approach 2: Using Shared Indices

Utilising common indices is the second method for utilising the merge() function to unite two Pandas data frames. With this method, two data frames are joined based on shared indices. Suppose we have two data frames `df1` and `df2`, as follows −

Example

import pandas as pd

df1 = pd.DataFrame({
   'ID': ['001', '002', '003', '004'],
   'Name': ['John', 'Marry', 'Peter', 'Jack'],
   'Age': [21, 32, 25, 19],
   'Gender': ['Male', 'Female', 'Male', 'Male']
}, index=['a', 'b', 'c', 'd'])

df2 = pd.DataFrame({
   'Salary': [55000, 65000, 75000, 85000],
   'Department': ['IT', 'Marketing', 'Sales', 'HR']
}, index=['b', 'c', 'e', 'f'])
merged_df = pd.merge(df1, df2, left_index=True, right_index=True, how='inner')
print(merged_df)

Output

    ID   Name   Age  Gender   Salary  Department
b  002  Marry   32   Female   55000         IT
c  003  Peter   25    Male    65000   Marketing

The 'df1' data frame in this instance comprises data about the employees, but the 'df2' data frame contains data on their salaries and departments. The index of the 'df1' data frame is '['a', 'b', 'c', 'd']', while that of the 'df2' data frame is '['b', 'c', 'e', 'f']'. To combine the data frames based on their indices, we set the 'left_index' and 'right_index' parameters in this code to 'True'. In order to execute an inner join, we also specify the 'how' parameter as ''inner,'' which implies that only the rows from both data frames that have matching indices will be added to the combined data frame. This merged data frame includes all the columns from both data frames and only the rows with matching indices from both data frames.

Conclusion

This article has covered using the merge() function to combine two Pandas data frames. We discussed the merge() function's syntax and parameters and offered a straightforward 5-step strategy for combining data frames. Additionally, we've shown how to combine data frames using two different methods: one that uses shared columns and another that uses shared indices. Additionally, we have offered complete executable code examples for both strategies, together with output and in-depth justifications. In data analysis, joining data frames is a frequent activity that can be accomplished with the merge() method in Pandas. By combining data frames with the merge() method, we can do so based on shared columns or shared indices and we can choose whether an inner join, outer join, left join, or right join is made.

Overall, any data analyst or data scientist working with Pandas should be familiar with the syntax and capabilities of the merge() function because it is a crucial tool for working with data frames. With the ability to link data frames using the merge() function, we may do more intricate and perceptive analysis on our data and draw more accurate conclusions from them.

Updated on: 12-Oct-2023

138 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements