How to merge many TSV files by common key using Python Pandas?


If you work with data, you've probably had to deal with the challenge of merging multiple files into one cohesive dataset. This task can be particularly difficult if you're working with tab−separated values (TSV) files. Fortunately, the Python Pandas library provides a straightforward solution for merging TSV files by a common key.

In this article, we'll learn how to merge multiple TSV files using Python Pandas. To begin with, we will first see what TSV files are and how they differ from CSV files. Next, we'll see the Pandas library and explain its capabilities for working with TSV files. And finally, we'll go through a step−by−step process to merge multiple TSV files using a common key with the help of Pandas.

What are TSV Files?

TSV files are a type of delimited text file that uses tabs as the delimiter between values. This format is similar to comma−separated values (CSV) files, which use commas as the delimiter. TSV files are often used to represent data that contains commas, such as addresses or names that include a comma. By using tabs instead of commas, TSV files can avoid the confusion that might arise from the presence of commas in the data.

Pandas provides a wide range of functions for manipulating and analyzing data, including functions for reading and writing TSV files.

Steps to merge different TSV files by common key using Pandas

Now that we've introduced TSV files and Pandas, let's dive into the process of merging multiple TSV files by a common key. In this article, we assume that you have several TSV files that end with. tsv (like youdatafile.tsv, etc.) that you want to merge into a single cohesive dataset using Python pandas.

Let’s see the steps to merge different TSV files by common key using Pandas:

Step 1: Load different TSV Files into Pandas DataFrames

Our first step to merging the files is to load each TSV file into a Pandas DataFrame. We can do this by using the Pandas read_csv() function, which can read TSV files as well as CSV files.

Now assume that we have three TSV files that we want to merge: persons.tsv, orderdetails.tsv, and productdetails.tsv. Each of these files contains a different aspect of our data:

  • persons.tsv contains information about our persons, such as their name, email address, and phone number

  • orderdetails.tsv contains information about the order details placed by our persons, such as the date of the order and the total cost

  • productdetails.tsv contains information about the product details that we sell, such as the product name and the price

Now we will load each file into a Pandas DataFrame. To do this, we'll use the following code:

import pandas as pd
persons = pd.read_csv('persons.tsv', sep='\t')
orderdetails = pd.read_csv('orderdetails.tsv', sep='\t')
productdetails = pd.read_csv('productdetails.tsv', sep='\t')

Step 2: Inspect the DataFrames

The next step after loading each TSV file into a separate Pandas DataFrame is to inspect each DataFrame to ensure that the data was loaded correctly. To do this, we will use the head() function used in displaying the first few rows of the data frame.

print(persons.head())
print(orderdetails.head())
print(productdetails.head())

The above code will display the first five rows of each DataFrame in the console. We can use this information to verify that the data was loaded correctly and to get a sense of what each DataFrame contains.

Step 3: Merge the DataFrames

Now that we have each TSV file loaded into a separate Pandas DataFrame, we can merge the data into a single cohesive dataset. To do this, we'll use the merge() function provided by Pandas.

The merge() function combines two DataFrames into a single DataFrame based on a common key. In our case, the common key is the customer ID. We'll merge the persons DataFrame with the orderdetails DataFrame first, and then merge the resulting DataFrame with the productdetails DataFrame.

# Merge persons and orderdetails
customer_orderdetails = pd.merge(persons, orderdetails, on='customer_id')

# Merge customer_orderdetails and productdetails
merged_data = pd.merge(customer_orderdetails, productdetails, on='product_id')

The merge() function takes two DataFrames as input and an argument that specifies the common key to use for the merge. In our case, the common key is the customer_id column in the persons DataFrame and the orderdetails DataFrame, and the product_id column in the customer_orderdetails DataFrame and the productdetails DataFrame.

The resulting merged_data DataFrame contains all of the information from the original TSV files, merged into a single cohesive dataset.

Step 4: Saving the File

Finally, we can save the merged data to a TSV file using the to_csv() function provided by Pandas.

merged_data.to_csv('merged_data.tsv', sep='\t', index=False)

Example

Let's consider the following example. We have three TSV files, 'mydata1.tsv, 'mydata2.tsv, and 'mydata3.tsv. Each file contains the same columns: id, name, and contact.

In the below example, we have used the concat() function which takes a list of DataFrames as input and combines them into a single DataFrame. By default, concat() combines DataFrames vertically, meaning that it stacks them on top of each other. In this example, since the three TSV files have the same columns, the resulting merged DataFrame will have the same columns as well.

import pandas as pd
# Load different TSV files into DataFrames
mydf1 = pd.read_csv('mydata1.tsv', sep='\t')
mydf2 = pd.read_csv('mydata2.tsv', sep='\t')
mydf3 = pd.read_csv('mydata3.tsv', sep='\t')

# Merge DataFrames using concat()
mymerged_data = pd.concat([mydf1, mydf2, mydf3])

# Save merged data to TSV file
mymerged_data.to_csv('merged_data.tsv', sep='\t', index=False)

Output

Name    Age    City
John    25     London
Emily   32     New York
David   41     Paris
Sophie  29     Berlin
Michael 37     Sydney
Emma    31     Toronto

Example

Let's consider the following example. We have two TSV files, myproductdetails.tsv and mysales.tsv. The myproductdetails.tsv file contains information about productdetails, while the mysales.tsv file contains information about sales transactions. Both files contain a product_id column that we will use as the common key to merge the files.

import pandas as pd

# Load TSV files into DataFrames
myproductdetailsdf = pd.read_csv('myproductdetails.tsv', sep='\t')
mysalesdf = pd.read_csv('mysales.tsv', sep='\t')

# Merge DataFrames using merge()
mymerged_data = pd.merge(myproductdetailsdf, mysalesdf, on='product_id')

# Save merged data to TSV file
mymerged_data.to_csv('mymerged_data.tsv', sep='\t', index=False)

Output

Name    Age    City
John    25     London
Emily   32     New York
David   41     Paris
Sophie  29     Berlin
Michael 37     Sydney
Emma    31     Toronto

In the above example, we have used the merge() function to take two DataFrames as input and an argument that specifies the common key to use for the merge. In this example, we are merging the myproductdetailsdf and mysalesdf DataFrames based on the product_id column. The resulting merged DataFrame will contain all the columns from both DataFrames, with the rows matched based on the common key.

Conclusion

In this article, we have learned how to merge multiple TSV files by a common key using Python Pandas through a step−by−step tutorial on how to merge multiple TSV files by a common key using Pandas.

Updated on: 31-Aug-2023

311 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements