How to find and filter Duplicate rows in Pandas ?

Sometimes during data analysis, we need to examine duplicate rows to understand patterns in our data rather than dropping them immediately. Pandas provides several methods to find, filter, and handle duplicate rows effectively.

The duplicated() Method

The duplicated() method identifies duplicate rows in a DataFrame. Let's work with an HR dataset to demonstrate this functionality ?

import pandas as pd
import numpy as np

# Import HR Dataset with certain columns
df = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/HRDataset.csv",
                 usecols=["Employee_Name", "PerformanceScore", "Position", "CitizenDesc"])

# Sort the values on employee name and make it permanent
df.sort_values("Employee_Name", inplace=True)
print(df.head(3))
   Employee_Name               Position CitizenDesc PerformanceScore
0       Adinolfi  Production Technician I  US Citizen          Exceeds
1       Adinolfi               Sr. DBA   US Citizen     Fully Meets
2       Adinolfi Production Technician II  US Citizen     Fully Meets

The duplicated() method marks the first occurrence of each value as non-duplicate, and subsequent occurrences as duplicates. Here's how it works ?

# Check first 3 employee names
print("Employee names:")
print(df["Employee_Name"].head(3))

print("\nDuplicate status:")
print(df["Employee_Name"].duplicated().head(3))
Employee names:
0    Adinolfi
1    Adinolfi
2    Adinolfi
Name: Employee_Name, dtype: object

Duplicate status:
0    False
1     True
2     True
Name: Employee_Name, dtype: bool

Finding Duplicate Rows

To extract rows with duplicate names, we filter the DataFrame using the boolean mask ?

# Get duplicate rows based on Employee_Name
duplicate_rows = df[df["Employee_Name"].duplicated()]
print(f"Total rows: {df.shape[0]}")
print(f"Duplicate rows: {duplicate_rows.shape[0]}")
print(duplicate_rows.head())
Total rows: 310
Duplicate rows: 79
   Employee_Name               Position CitizenDesc PerformanceScore
1       Adinolfi               Sr. DBA   US Citizen     Fully Meets
2       Adinolfi Production Technician II  US Citizen     Fully Meets
3       Adinolfi  Production Technician I  US Citizen     Fully Meets
4       Adinolfi       Production Manager  US Citizen     Fully Meets
6       Anderson  Production Technician I  US Citizen          Exceeds

Using keep Parameter

The keep parameter controls which duplicates to mark. Options include 'first', 'last', or False ?

Keep='last'

# Mark all but the last occurrence as duplicates
duplicate_last = df[df["Employee_Name"].duplicated(keep="last")]
print(f"Duplicates keeping last: {duplicate_last.shape[0]}")
Duplicates keeping last: 79

Keep=False

# Mark all occurrences of duplicated values as duplicates
all_duplicates = df[df["Employee_Name"].duplicated(keep=False)]
print(f"All duplicate occurrences: {all_duplicates.shape[0]}")
print(all_duplicates.head())
All duplicate occurrences: 118
   Employee_Name               Position CitizenDesc PerformanceScore
0       Adinolfi  Production Technician I  US Citizen          Exceeds
1       Adinolfi               Sr. DBA   US Citizen     Fully Meets
2       Adinolfi Production Technician II  US Citizen     Fully Meets
3       Adinolfi  Production Technician I  US Citizen     Fully Meets
4       Adinolfi       Production Manager  US Citizen     Fully Meets

Extracting Unique Rows

Use the tilde (~) operator to negate the duplicated condition and get unique rows ?

# Get rows with unique employee names only
df_unique = df[~df["Employee_Name"].duplicated(keep=False)]
print(f"Unique employee records: {df_unique.shape[0]}")
print(df_unique.head())
Unique employee records: 192
    Employee_Name            Position CitizenDesc PerformanceScore
7        Andreola    Software Engineer  US Citizen     Fully Meets
25          Bozzi   Production Manager  US Citizen     Fully Meets
26       Bramante Director of Operations  US Citizen          Exceeds
27          Brill  Production Technician I  US Citizen     Fully Meets
34        Burkett Production Technician II  US Citizen     Fully Meets

The drop_duplicates() Method

The drop_duplicates() method removes duplicate rows from the entire DataFrame ?

# Remove duplicate rows across all columns
df_no_duplicates = df.drop_duplicates()
print(f"Original rows: {len(df)}")
print(f"After removing duplicates: {len(df_no_duplicates)}")
Original rows: 310
After removing duplicates: 290

Using subset Parameter

The subset parameter allows you to check for duplicates based on specific columns ?

# Remove duplicates based on Employee_Name only
df_unique_names = df.drop_duplicates(subset=["Employee_Name"], keep="first")
print(f"Unique employee names: {len(df_unique_names)}")
print(df_unique_names.head())
Unique employee names: 231
   Employee_Name               Position CitizenDesc PerformanceScore
0       Adinolfi  Production Technician I  US Citizen          Exceeds
5       Anderson  Production Technician I  US Citizen     Fully Meets
7       Andreola           Software Engineer  US Citizen     Fully Meets
14        Athwal  Production Technician I  US Citizen     Fully Meets
20          Beak  Production Technician I  US Citizen     Fully Meets

Methods for Finding Unique Values

unique() Method

Returns an array of unique values in a Series ?

# Get unique employee names
unique_names = df["Employee_Name"].unique()
print(f"Total employee records: {len(df['Employee_Name'])}")
print(f"Unique employee names: {len(unique_names)}")
print(f"First 10 unique names: {unique_names[:10]}")
Total employee records: 310
Unique employee names: 231
First 10 unique names: ['Adinolfi' 'Anderson' 'Andreola' 'Athwal' 'Beak' 'Bondwell' 'Bozzi'
 'Bramante' 'Brill' 'Brown']

nunique() Method

Returns the count of unique values, excluding NaN by default ?

# Count unique values
unique_count = df["Employee_Name"].nunique()
print(f"Number of unique employees: {unique_count}")

# Include NaN values in count
unique_with_nan = df["Employee_Name"].nunique(dropna=False)
print(f"Unique count including NaN: {unique_with_nan}")
Number of unique employees: 231
Unique count including NaN: 231

Comparison of Methods

Method Purpose Returns Use Case
duplicated() Identify duplicates Boolean Series Filter duplicate rows
drop_duplicates() Remove duplicates DataFrame Clean dataset
unique() Get unique values Array Explore unique values
nunique() Count unique values Integer Data summary statistics

Conclusion

Pandas provides comprehensive tools for handling duplicates: duplicated() for identification, drop_duplicates() for removal, and unique()/nunique() for exploring unique values. Use the keep and subset parameters to customize behavior based on your analysis needs.

Updated on: 2026-03-25T11:50:51+05:30

7K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements