Join Pandas Dataframes matching by substring

Joining Pandas DataFrames based on substring matching allows you to merge datasets where exact matches aren't possible. This technique is useful when dealing with text data that may have variations in spelling, formatting, or when you need to match based on partial text content.

Understanding Substring-Based Joins

A substring-based join combines two or more DataFrames by matching portions of text within specified columns, rather than requiring exact matches. This approach is particularly valuable when working with messy text data or when you need flexible matching criteria.

Basic Syntax

# General pattern for substring-based joins
filtered_df = df1[df1['column'].str.contains('substring')]
result = pd.merge(filtered_df, df2, on='key_column', how='join_type')

Using str.contains() Method

The str.contains() method is the most straightforward approach for substring matching. It returns a boolean mask indicating which rows contain the specified substring ?

import pandas as pd

# Create sample DataFrames
df1 = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Name': ['Alice Johnson', 'Bob Smith', 'Claire Wilson', 'David Brown'],
    'Age': [25, 30, 28, 32]
})

df2 = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Department': ['Marketing', 'HR', 'Finance', 'IT']
})

print("Original DataFrames:")
print("df1:")
print(df1)
print("\ndf2:")
print(df2)
Original DataFrames:
df1:
   ID          Name  Age
0   1  Alice Johnson   25
1   2     Bob Smith   30
2   3  Claire Wilson   28
3   4   David Brown   32

df2:
   ID Department
0   1  Marketing
1   2         HR
2   3    Finance
3   4         IT

Now let's join based on names containing the substring 'son' ?

# Filter rows where Name contains 'son'
filtered_df1 = df1[df1['Name'].str.contains('son', case=False)]

# Merge with df2
result = pd.merge(filtered_df1, df2, on='ID', how='inner')
print("Joined DataFrame (names containing 'son'):")
print(result)
Joined DataFrame (names containing 'son'):
   ID          Name  Age Department
0   1  Alice Johnson   25  Marketing
1   3  Claire Wilson   28    Finance

Using Regular Expressions for Complex Matching

Regular expressions provide more sophisticated pattern matching capabilities for complex substring operations ?

import pandas as pd
import re

# Create DataFrames with email-like data
df1 = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Email': ['alice@company.com', 'bob@gmail.com', 'claire@company.org', 'david@yahoo.com']
})

df2 = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Status': ['Active', 'Inactive', 'Active', 'Pending']
})

# Use regex to find emails from company domain
pattern = r'@company\.'
filtered_df1 = df1[df1['Email'].str.contains(pattern, regex=True)]

result = pd.merge(filtered_df1, df2, on='ID', how='inner')
print("Employees with company emails:")
print(result)
Employees with company emails:
   ID               Email  Status
0   1    alice@company.com  Active
1   3  claire@company.org  Active

Handling Multiple Substrings

You can match multiple substrings using the pipe operator (|) in regex patterns ?

# Create product DataFrame
products_df = pd.DataFrame({
    'ProductID': [1, 2, 3, 4, 5],
    'ProductName': ['iPhone 12', 'Samsung Galaxy', 'iPad Pro', 'MacBook Air', 'Surface Pro'],
    'Price': [699, 599, 799, 999, 899]
})

sales_df = pd.DataFrame({
    'ProductID': [1, 2, 3, 4, 5],
    'Quantity': [50, 30, 25, 40, 20]
})

# Filter products containing 'iPhone' or 'iPad'
apple_products = products_df[products_df['ProductName'].str.contains('iPhone|iPad', regex=True)]

result = pd.merge(apple_products, sales_df, on='ProductID', how='inner')
print("Apple products with sales data:")
print(result)
Apple products with sales data:
   ProductID ProductName  Price  Quantity
0          1    iPhone 12    699        50
1          3     iPad Pro    799        25

Comparison of Methods

Method Use Case Complexity Performance
str.contains() Simple substring matching Low Fast
Regular Expressions Complex pattern matching High Moderate
Multiple patterns Matching several substrings Medium Moderate

Best Practices

When performing substring-based joins, consider these important points ?

# Handle case sensitivity and null values
df_example = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Text': ['Hello World', 'hello python', None, 'HELLO THERE']
})

# Case-insensitive matching with null handling
result = df_example[df_example['Text'].str.contains('hello', case=False, na=False)]
print("Case-insensitive matching (excluding nulls):")
print(result)
Case-insensitive matching (excluding nulls):
   ID          Text
0   1   Hello World
1   2  hello python
3   4   HELLO THERE

Conclusion

Substring-based joins in Pandas provide flexible data integration capabilities for text-heavy datasets. Use str.contains() for simple matching and regular expressions for complex patterns. Always consider case sensitivity and null value handling for robust data processing.

Updated on: 2026-03-27T14:59:30+05:30

2K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements