Article Categories

Selected Reading

How to do Fuzzy Matching on Pandas Dataframe Column Using Python?

Computer Network Internet MCA

Fuzzy matching is a technique for finding approximate string matches in datasets where exact matches may not exist due to variations in spelling, formatting, or data entry errors. In pandas DataFrames, fuzzy matching helps identify similar entries across columns or datasets using similarity algorithms.

The fuzzywuzzy library provides fuzzy string matching capabilities using Levenshtein distance to calculate similarity ratios between strings. We'll demonstrate matching car names between two DataFrames using a similarity threshold of 70%.

Setting Up the Data

First, let's create two DataFrames with similar but not identical car names:

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

# dictionaries
d1 = {'Car': ["BMW", "Audi", "Lexus", "Mercedes", "Rolls"]}
d2 = {'Car': ["BM", "Audi", "Le", "MERCEDES", "Rolls Royce"]}

# convert dictionaries to pandas dataframes
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)

Convert the DataFrame columns to lists for easier processing:

myList1 = df1['Car'].tolist()
myList2 = df2['Car'].tolist()

How Fuzzy Matching Works

Complete Implementation

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

# dictionaries
d1 = {'Car': ["BMW", "Audi", "Lexus", "Mercedes", "Rolls"]}
d2 = {'Car': ["BM", "Audi", "Le", "MERCEDES", "Rolls Royce"]}

# convert dictionaries to pandas dataframes
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)

# printing the pandas dataframes
print("Dataframe 1 = <br>", df1)
print("Dataframe 2 = <br>", df2)

# empty lists for storing the matches later
match1 = []
match2 = []
k = []

# converting dataframe column to list of elements for fuzzy matching
myList1 = df1['Car'].tolist()
myList2 = df2['Car'].tolist()

threshold = 70

# iterating myList1 to extract closest match from myList2
for i in myList1:
    match1.append(process.extractOne(i, myList2, scorer=fuzz.ratio))
df1['matches'] = match1
for j in df1['matches']:
    if j[1] >= threshold:
        k.append(j[0])
    match2.append(",".join(k))
    k = []

# saving matches to df1 
df1['matches'] = match2 
print("\nMatches...")
print(df1)

Output

Dataframe 1 = 
      Car
0     BMW
1    Audi
2   Lexus
3 Mercedes
4   Rolls
Dataframe 2 = 
         Car
0         BM
1       Audi
2         Le
3  MERCEDES
4 Rolls Royce

Matches...
        Car   matches
0       BMW        BM
1      Audi      Audi
2     Lexus          
3  Mercedes  MERCEDES
4     Rolls

Key Features

Similarity scoring − Uses Levenshtein distance to calculate string similarity ratios
Threshold filtering − Only matches above the specified threshold (70%) are considered valid
Best match extraction − The process.extractOne() function finds the single best match for each string

Common Use Cases

Data cleaning − Identifying and merging duplicate records with slight variations
Record linkage − Matching entities across different databases or datasets
Search functionality − Implementing approximate search in applications

Conclusion

Fuzzy matching in pandas DataFrames enables identification of similar strings using similarity thresholds and algorithms like Levenshtein distance. This technique is essential for data cleaning, deduplication, and approximate matching tasks where exact string matches are unlikely.

AmitDiwan

Updated on: 2026-03-16T23:36:12+05:30

2K+ Views

Previous Next