Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to do Fuzzy Matching on Pandas Dataframe Column Using Python?
Fuzzy matching is a technique for finding approximate string matches in datasets where exact matches may not exist due to variations in spelling, formatting, or data entry errors. In pandas DataFrames, fuzzy matching helps identify similar entries across columns or datasets using similarity algorithms.
The fuzzywuzzy library provides fuzzy string matching capabilities using Levenshtein distance to calculate similarity ratios between strings. We'll demonstrate matching car names between two DataFrames using a similarity threshold of 70%.
Setting Up the Data
First, let's create two DataFrames with similar but not identical car names:
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
# dictionaries
d1 = {'Car': ["BMW", "Audi", "Lexus", "Mercedes", "Rolls"]}
d2 = {'Car': ["BM", "Audi", "Le", "MERCEDES", "Rolls Royce"]}
# convert dictionaries to pandas dataframes
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
Convert the DataFrame columns to lists for easier processing:
myList1 = df1['Car'].tolist() myList2 = df2['Car'].tolist()
How Fuzzy Matching Works
Complete Implementation
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
# dictionaries
d1 = {'Car': ["BMW", "Audi", "Lexus", "Mercedes", "Rolls"]}
d2 = {'Car': ["BM", "Audi", "Le", "MERCEDES", "Rolls Royce"]}
# convert dictionaries to pandas dataframes
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
# printing the pandas dataframes
print("Dataframe 1 = <br>", df1)
print("Dataframe 2 = <br>", df2)
# empty lists for storing the matches later
match1 = []
match2 = []
k = []
# converting dataframe column to list of elements for fuzzy matching
myList1 = df1['Car'].tolist()
myList2 = df2['Car'].tolist()
threshold = 70
# iterating myList1 to extract closest match from myList2
for i in myList1:
match1.append(process.extractOne(i, myList2, scorer=fuzz.ratio))
df1['matches'] = match1
for j in df1['matches']:
if j[1] >= threshold:
k.append(j[0])
match2.append(",".join(k))
k = []
# saving matches to df1
df1['matches'] = match2
print("\nMatches...")
print(df1)
Output
Dataframe 1 =
Car
0 BMW
1 Audi
2 Lexus
3 Mercedes
4 Rolls
Dataframe 2 =
Car
0 BM
1 Audi
2 Le
3 MERCEDES
4 Rolls Royce
Matches...
Car matches
0 BMW BM
1 Audi Audi
2 Lexus
3 Mercedes MERCEDES
4 Rolls
Key Features
-
Similarity scoring − Uses Levenshtein distance to calculate string similarity ratios
-
Threshold filtering − Only matches above the specified threshold (70%) are considered valid
-
Best match extraction − The
process.extractOne()function finds the single best match for each string
Common Use Cases
-
Data cleaning − Identifying and merging duplicate records with slight variations
-
Record linkage − Matching entities across different databases or datasets
-
Search functionality − Implementing approximate search in applications
Conclusion
Fuzzy matching in pandas DataFrames enables identification of similar strings using similarity thresholds and algorithms like Levenshtein distance. This technique is essential for data cleaning, deduplication, and approximate matching tasks where exact string matches are unlikely.
