Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
How to find and filter Duplicate rows in Pandas ?
Sometimes during data analysis, we need to examine duplicate rows to understand patterns in our data rather than dropping them immediately. Pandas provides several methods to find, filter, and handle duplicate rows effectively.
The duplicated() Method
The duplicated() method identifies duplicate rows in a DataFrame. Let's work with an HR dataset to demonstrate this functionality ?
import pandas as pd
import numpy as np
# Import HR Dataset with certain columns
df = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/HRDataset.csv",
usecols=["Employee_Name", "PerformanceScore", "Position", "CitizenDesc"])
# Sort the values on employee name and make it permanent
df.sort_values("Employee_Name", inplace=True)
print(df.head(3))
Employee_Name Position CitizenDesc PerformanceScore 0 Adinolfi Production Technician I US Citizen Exceeds 1 Adinolfi Sr. DBA US Citizen Fully Meets 2 Adinolfi Production Technician II US Citizen Fully Meets
The duplicated() method marks the first occurrence of each value as non-duplicate, and subsequent occurrences as duplicates. Here's how it works ?
# Check first 3 employee names
print("Employee names:")
print(df["Employee_Name"].head(3))
print("\nDuplicate status:")
print(df["Employee_Name"].duplicated().head(3))
Employee names: 0 Adinolfi 1 Adinolfi 2 Adinolfi Name: Employee_Name, dtype: object Duplicate status: 0 False 1 True 2 True Name: Employee_Name, dtype: bool
Finding Duplicate Rows
To extract rows with duplicate names, we filter the DataFrame using the boolean mask ?
# Get duplicate rows based on Employee_Name
duplicate_rows = df[df["Employee_Name"].duplicated()]
print(f"Total rows: {df.shape[0]}")
print(f"Duplicate rows: {duplicate_rows.shape[0]}")
print(duplicate_rows.head())
Total rows: 310 Duplicate rows: 79 Employee_Name Position CitizenDesc PerformanceScore 1 Adinolfi Sr. DBA US Citizen Fully Meets 2 Adinolfi Production Technician II US Citizen Fully Meets 3 Adinolfi Production Technician I US Citizen Fully Meets 4 Adinolfi Production Manager US Citizen Fully Meets 6 Anderson Production Technician I US Citizen Exceeds
Using keep Parameter
The keep parameter controls which duplicates to mark. Options include 'first', 'last', or False ?
Keep='last'
# Mark all but the last occurrence as duplicates
duplicate_last = df[df["Employee_Name"].duplicated(keep="last")]
print(f"Duplicates keeping last: {duplicate_last.shape[0]}")
Duplicates keeping last: 79
Keep=False
# Mark all occurrences of duplicated values as duplicates
all_duplicates = df[df["Employee_Name"].duplicated(keep=False)]
print(f"All duplicate occurrences: {all_duplicates.shape[0]}")
print(all_duplicates.head())
All duplicate occurrences: 118 Employee_Name Position CitizenDesc PerformanceScore 0 Adinolfi Production Technician I US Citizen Exceeds 1 Adinolfi Sr. DBA US Citizen Fully Meets 2 Adinolfi Production Technician II US Citizen Fully Meets 3 Adinolfi Production Technician I US Citizen Fully Meets 4 Adinolfi Production Manager US Citizen Fully Meets
Extracting Unique Rows
Use the tilde (~) operator to negate the duplicated condition and get unique rows ?
# Get rows with unique employee names only
df_unique = df[~df["Employee_Name"].duplicated(keep=False)]
print(f"Unique employee records: {df_unique.shape[0]}")
print(df_unique.head())
Unique employee records: 192
Employee_Name Position CitizenDesc PerformanceScore
7 Andreola Software Engineer US Citizen Fully Meets
25 Bozzi Production Manager US Citizen Fully Meets
26 Bramante Director of Operations US Citizen Exceeds
27 Brill Production Technician I US Citizen Fully Meets
34 Burkett Production Technician II US Citizen Fully Meets
The drop_duplicates() Method
The drop_duplicates() method removes duplicate rows from the entire DataFrame ?
# Remove duplicate rows across all columns
df_no_duplicates = df.drop_duplicates()
print(f"Original rows: {len(df)}")
print(f"After removing duplicates: {len(df_no_duplicates)}")
Original rows: 310 After removing duplicates: 290
Using subset Parameter
The subset parameter allows you to check for duplicates based on specific columns ?
# Remove duplicates based on Employee_Name only
df_unique_names = df.drop_duplicates(subset=["Employee_Name"], keep="first")
print(f"Unique employee names: {len(df_unique_names)}")
print(df_unique_names.head())
Unique employee names: 231 Employee_Name Position CitizenDesc PerformanceScore 0 Adinolfi Production Technician I US Citizen Exceeds 5 Anderson Production Technician I US Citizen Fully Meets 7 Andreola Software Engineer US Citizen Fully Meets 14 Athwal Production Technician I US Citizen Fully Meets 20 Beak Production Technician I US Citizen Fully Meets
Methods for Finding Unique Values
unique() Method
Returns an array of unique values in a Series ?
# Get unique employee names
unique_names = df["Employee_Name"].unique()
print(f"Total employee records: {len(df['Employee_Name'])}")
print(f"Unique employee names: {len(unique_names)}")
print(f"First 10 unique names: {unique_names[:10]}")
Total employee records: 310 Unique employee names: 231 First 10 unique names: ['Adinolfi' 'Anderson' 'Andreola' 'Athwal' 'Beak' 'Bondwell' 'Bozzi' 'Bramante' 'Brill' 'Brown']
nunique() Method
Returns the count of unique values, excluding NaN by default ?
# Count unique values
unique_count = df["Employee_Name"].nunique()
print(f"Number of unique employees: {unique_count}")
# Include NaN values in count
unique_with_nan = df["Employee_Name"].nunique(dropna=False)
print(f"Unique count including NaN: {unique_with_nan}")
Number of unique employees: 231 Unique count including NaN: 231
Comparison of Methods
| Method | Purpose | Returns | Use Case |
|---|---|---|---|
duplicated() |
Identify duplicates | Boolean Series | Filter duplicate rows |
drop_duplicates() |
Remove duplicates | DataFrame | Clean dataset |
unique() |
Get unique values | Array | Explore unique values |
nunique() |
Count unique values | Integer | Data summary statistics |
Conclusion
Pandas provides comprehensive tools for handling duplicates: duplicated() for identification, drop_duplicates() for removal, and unique()/nunique() for exploring unique values. Use the keep and subset parameters to customize behavior based on your analysis needs.
