How to do Fuzzy Matching on Pandas Dataframe Column Using Python?

Fuzzy matching is a technique for finding approximate string matches in datasets where exact matches may not exist due to variations in spelling, formatting, or data entry errors. In pandas DataFrames, fuzzy matching helps identify similar entries across columns or datasets using similarity algorithms.

The fuzzywuzzy library provides fuzzy string matching capabilities using Levenshtein distance to calculate similarity ratios between strings. We'll demonstrate matching car names between two DataFrames using a similarity threshold of 70%.

Setting Up the Data

First, let's create two DataFrames with similar but not identical car names:

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

# dictionaries
d1 = {'Car': ["BMW", "Audi", "Lexus", "Mercedes", "Rolls"]}
d2 = {'Car': ["BM", "Audi", "Le", "MERCEDES", "Rolls Royce"]}

# convert dictionaries to pandas dataframes
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)

Convert the DataFrame columns to lists for easier processing:

myList1 = df1['Car'].tolist()
myList2 = df2['Car'].tolist()

How Fuzzy Matching Works

Fuzzy String Matching Process BMW Calculate Ratio BM (66%) No Match Threshold = 70% Matches above 70% similarity are accepted Mercedes ? MERCEDES (90%) Rolls ? Rolls Royce (85%)

Complete Implementation

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

# dictionaries
d1 = {'Car': ["BMW", "Audi", "Lexus", "Mercedes", "Rolls"]}
d2 = {'Car': ["BM", "Audi", "Le", "MERCEDES", "Rolls Royce"]}

# convert dictionaries to pandas dataframes
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)

# printing the pandas dataframes
print("Dataframe 1 = <br>", df1)
print("Dataframe 2 = <br>", df2)

# empty lists for storing the matches later
match1 = []
match2 = []
k = []

# converting dataframe column to list of elements for fuzzy matching
myList1 = df1['Car'].tolist()
myList2 = df2['Car'].tolist()

threshold = 70

# iterating myList1 to extract closest match from myList2
for i in myList1:
    match1.append(process.extractOne(i, myList2, scorer=fuzz.ratio))
df1['matches'] = match1
for j in df1['matches']:
    if j[1] >= threshold:
        k.append(j[0])
    match2.append(",".join(k))
    k = []

# saving matches to df1 
df1['matches'] = match2 
print("\nMatches...")
print(df1)

Output

Dataframe 1 = 
      Car
0     BMW
1    Audi
2   Lexus
3 Mercedes
4   Rolls
Dataframe 2 = 
         Car
0         BM
1       Audi
2         Le
3  MERCEDES
4 Rolls Royce

Matches...
        Car   matches
0       BMW        BM
1      Audi      Audi
2     Lexus          
3  Mercedes  MERCEDES
4     Rolls          

Key Features

  • Similarity scoring − Uses Levenshtein distance to calculate string similarity ratios

  • Threshold filtering − Only matches above the specified threshold (70%) are considered valid

  • Best match extraction − The process.extractOne() function finds the single best match for each string

Common Use Cases

  • Data cleaning − Identifying and merging duplicate records with slight variations

  • Record linkage − Matching entities across different databases or datasets

  • Search functionality − Implementing approximate search in applications

Conclusion

Fuzzy matching in pandas DataFrames enables identification of similar strings using similarity thresholds and algorithms like Levenshtein distance. This technique is essential for data cleaning, deduplication, and approximate matching tasks where exact string matches are unlikely.

Updated on: 2026-03-16T23:36:12+05:30

2K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements