How to find and filter Duplicate rows in Pandas ?

Python Server Side Programming Programming

Sometimes during our data analysis, we need to look at the duplicate rows to understand more about our data rather than dropping them straight away.

Luckily, in pandas we have few methods to play with the duplicates.

.duplciated()

This method allows us to extract duplicate rows in a DataFrame. We will use a new dataset with duplicates. I have downloaded the Hr Dataset from link.

import pandas as pd
import numpy as np

# Import HR Dataset with certain columns
df = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/HRDataset.csv",
usecols = ("Employee_Name""PerformanceScore","Position","CitizenDesc"))

#Sort the values on employee name and make it permanent
df.sort_values("Employee_Name"inplace = True)
df.head(3)

	Employee_Name	Position	CitizenDesc	PerformanceScore
0	Adinolfi	Production Technician I	US Citizen	Exceeds
1	Adinolfi	Sr. DBA	US Citizen	Fully Meets
2	Adinolfi	Production Technician II	US Citizen	Fully Meets

The way duplicated() works by default is by keep parameter , This parameter is going to mark the very first occurrence of each value as a non-duplicate.

This method does not mark a row as duplicate if it exists more than once, rather it marks each subsequent row after the first row as duplicate. Confused? Let me try to explain one more time with an example, suppose there are 3 apples in a basket what this method does is mark the first apple as non-duplicate and the rest of the two apples as duplicates.

Example

df["Employee_Name"].head(3)

Output

0 Adinolfi
1 Adinolfi
2 Adinolfi
Name: Employee_Name, dtype: object

Example

df["Employee_Name"].duplicated().head(3)

Output

0 False
1 True
2 True
Name: Employee_Name, dtype: bool

Now to extract the duplicates out (remember the first occurrence is not a duplicate rather the subsequence occurrence are duplicates and will be outputted by this method) we need to pass this method to a data frame.

df.shape

(310, 4)

df[df["Employee_Name"].duplicated()]

	Employee_Name	Position	CitizenDesc	PerformanceScore
1	Adinolfi	Sr. DBA	US Citizen	Fully Meets
2	Adinolfi	Production Technician II	US Citizen	Fully Meets
3	Adinolfi	Production Technician I	US Citizen	Fully Meets
4	Adinolfi	Production Manager	US Citizen	Fully Meets
6	Anderson	Production Technician I	US Citizen	Exceeds
...	...	...	...	...
303	Wang	Production Technician II	US Citizen	Fully Meets
304	Wang	Production Technician II	US Citizen	Fully Meets
305	Wang	Production Technician I	US Citizen	PIP
306	Wang	CIO	US Citizen	Exceeds
307	Wang	Data Analyst	US Citizen	Fully Meets

79 rows × 4 columns

From the output above there are 310 rows with 79 duplicates which are extracted by using the .duplicated() method.

ARGUMENT-"LAST"

By default, this method is going to mark the first occurrence of the value as non-duplicate, we can change this behavior by passing the argument keep = last.

What this parameter is going to do is to mark the first two apples as duplicates and the last one as non-duplicate.

df[df["Employee_Name"].duplicated(keep="last")]

	Employee_Name	Position	CitizenDesc	PerformanceScore
0	Adinolfi	Production Technician I	US Citizen	Exceeds
1	Adinolfi	Sr. DBA	US Citizen	Fully Meets
2	Adinolfi	Production Technician II	US Citizen	Fully Meets
3	Adinolfi	Production Technician I	US Citizen	Fully Meets
5	Anderson	Production Technician I	US Citizen	Fully Meets
...	...	...	...	...
302	Wang	Production Technician II	US Citizen	Exceeds
303	Wang	Production Technician II	US Citizen	Fully Meets
304	Wang	Production Technician II	US Citizen	Fully Meets
305	Wang	Production Technician I	US Citizen	PIP
306	Wang	CIO	US Citizen	Exceeds

ARGUMENT - FALSE

The keep parameter will also accept an additional argument “false” which will mark all the values occurring more than once as duplicates, in our case all the 3 apples will be marked as duplicates rather the first or last as shown in the above examples.

Note – When specifying the false parameter do not use the quotes.

df[df"Employee_Name"].duplicated(keep=False)]

	Employee_Name	Position	CitizenDesc	PerformanceScore
0	Adinolfi	Production Technician I	US Citizen	Exceeds
1	Adinolfi	Sr. DBA	US Citizen	Fully Meets
2	Adinolfi	Production Technician II	US Citizen	Fully Meets
3	Adinolfi	Production Technician I	US Citizen	Fully Meets
4	Adinolfi	Production Manager	US Citizen	Fully Meets
...	...	...	...	...
303	Wang	Production Technician II	US Citizen	Fully Meets
304	Wang	Production Technician II	US Citizen	Fully Meets
305	Wang	Production Technician I	US Citizen	PIP
306	Wang	CIO	US Citizen	Exceeds
307	Wang	Data Analyst	US Citizen	Fully Meets

Now finally, to extract the unique values from a dataset we can use the “~” (tilda) symbol to negate the values

df_unique~df["Employee_Name"].duplicated(keep=False)df[df_unique]

	Employee_Name	Position	CitizenDesc	PerformanceScore
7	Andreola	Software Engineer	US Citizen	Fully Meets
25	Bozzi	Production Manager	US Citizen	Fully Meets
26	Bramante	Director of Operations	US Citizen	Exceeds
27	Brill	Production Technician I	US Citizen	Fully Meets
34	Burkett	Production Technician II	US Citizen	Fully Meets
...	...	...	...	...
276	Sweetwater	Software Engineer	US Citizen	Exceeds
277	Szabo	Production Technician I	Non-Citizen	Fully Meets
278	Tavares	Production Technician II	US Citizen	Fully Meets
308	Zhou	Production Technician I	US Citizen	Fully Meets
309	Zima	NaN	NaN	NaN

drop_duplicates()

This method is pretty similar to the previous method, however this method can be on a DataFrame rather than on a single series.

NOTE :- This method looks for the duplicates rows on all the columns of a DataFrame and drops them.

len(df)

Output

len(df.drop_duplicates())

Output

SUBSET PARAMTER

The subset parameter accepts a list of column names as string values in which we can check for duplicates.

df1=df.drop_duplicates(subset=["Employee_Name"],keep="first")df1

	Employee_Name	Position	CitizenDesc	PerformanceScore
0	Adinolfi	Production Technician I	US Citizen	Exceeds
5	Anderson	Production Technician I	US Citizen	Fully Meets
7	Andreola	Software Engineer	US Citizen	Fully Meets
14	Athwal	Production Technician I	US Citizen	Fully Meets
20	Beak	Production Technician I	US Citizen	Fully Meets
...	...	...	...	...
293	Von Massenbach	Production Technician II	US Citizen	Fully Meets
295	Wallace	Production Technician I	US Citizen	Needs Improvement
300	Wang	Production Technician I	Eligible NonCitizen	Fully Meets
308	Zhou	Production Technician I	US Citizen	Fully Meets
309	Zima	NaN	NaN	NaN

We can specify multiple columns and use all the keep parameters discussed in the previous section.

df1=df.drop_duplicates(subset="Employee_Name""CitizenDesc"],keep=False)df1

	Employee_Name	Position	CitizenDesc	PerformanceScore
7	Andreola	Software Engineer	US Citizen	Fully Meets
16	Beak	Production Technician I	Eligible NonCitizen	Fully Meets
25	Bozzi	Production Manager	US Citizen	Fully Meets
26	Bramante	Director of Operations	US Citizen	Exceeds
27	Brill	Production Technician I	US Citizen	Fully Meets
...	...	...	...	...
287	Tejeda	Network Engineer	Eligible NonCitizen	Fully Meets
286	Tejeda	Software Engineer	Non-Citizen	Fully Meets
300	Wang	Production Technician I	Eligible NonCitizen	Fully Meets
308	Zhou	Production Technician I	US Citizen	Fully Meets
309	Zima	NaN	NaN	NaN

unique() Method

The unique methods find the unique values in a series and return the unique values as an Array. This method does not exclude missing values.

len(df["Employee_Name"])

Output

df["Employee_Name"].unique()

array(['Adinolfi', 'Anderson', 'Andreola', 'Athwal', 'Beak', 'Bondwell',
'Bozzi', 'Bramante', 'Brill', 'Brown', 'Burkett', 'Butler',
'Carabbio', 'Carey', 'Carr', 'Carter', 'Chace', 'Champaigne',
'Chan', 'Chang', 'Chivukula', 'Cierpiszewski', 'Cisco', 'Clayton',
'Cloninger', 'Close', 'Clukey', 'Cockel', 'Cole', 'Cornett',
'Costa', 'Crimmings', 'Daneault', 'Daniele', 'Darson', 'Davis',
'DeGweck', 'Del Bosque', 'Demita', 'Desimone', 'DiNocco',
'Dickinson', 'Dietrich', 'Digitale', 'Dobrin', 'Dolan', 'Dougall',
'Dunn', 'Eaton', 'Employee_Name', 'Engdahl', 'England', 'Erilus',
'Estremera', 'Evensen', 'Exantus', 'Faller', 'Fancett', 'Favis',
'Ferguson', 'Fernandes', 'Ferreira', 'Fidelia', 'Fitzpatrick',
'Foreman', 'Foss', 'Foster-Baker', 'Fraval', 'Friedman', 'Galia',
'Garcia', 'Garneau', 'Gaul', 'Gentry', 'Gerke', 'Gill', 'Gonzales',
'Gonzalez', 'Good', 'Handschiegl', 'Hankard', 'Harrison',
'Heitzman', 'Horton', 'Houlihan', 'Howard', 'Hubert', 'Hunts',
'Hutter', 'Huynh', 'Immediato', 'Ivey', 'Jackson', 'Jacobi',
'Jeannite', 'Jeremy Prater', 'Jhaveri', 'Johnson', 'Johnston',
'Jung', 'Kampew', 'Keatts', 'Khemmich', 'King', 'Kinsella',
'Kirill', 'Knapp', 'Kretschmer', 'LaRotonda', 'Lajiri', 'Langford',
'Langton', 'Latif', 'Le', 'LeBel', 'LeBlanc', 'Leach', 'Leruth',
'Liebig', 'Linares', 'Linden', 'Lindsay', 'Lundy', 'Lunquist',
'Lydon', 'Lynch', 'MacLennan', 'Mahoney', 'Manchester', 'Mancuso',
'Mangal', 'Martin', 'Martins', 'Maurice', 'McCarthy', 'McKinzie',
'Mckenna', 'Meads', 'Medeiros', 'Merlos', 'Miller', 'Monkfish',
'Monroe', 'Monterro', 'Moran', 'Morway', 'Motlagh', 'Moumanil',
'Mullaney', 'Murray', 'Navathe', 'Ndzi', 'Newman', 'Ngodup',
'Nguyen', 'Nowlan', 'O'hare', 'Oliver', 'Onque', 'Osturnka',
'Owad', 'Ozark', 'Panjwani', 'Patronick', 'Pearson', 'Pelech',
'Pelletier', 'Perry', 'Peters', 'Peterson', 'Petingill',
'Petrowsky', 'Pham', 'Pitt', 'Potts', 'Power', 'Punjabhi',
'Purinton', 'Quinn', 'Rachael', 'Rarrick', 'Rhoads', 'Riordan',
'Rivera', 'Roberson', 'Robertson', 'Robinson', 'Roby', 'Roehrich',
'Rogers', 'Roper', 'Rose', 'Rossetti', 'Roup', 'Ruiz', 'Saada',
'Saar-Beckles', 'Sadki', 'Sahoo', 'Salter', 'Sander', 'Semizoglou',
'Sewkumar', 'Shepard', 'Shields', 'Simard', 'Singh', 'Sloan',
'Smith', 'Soto', 'South', 'Sparks', 'Spirea', 'Squatrito',
'Stanford', 'Stanley', 'Steans', 'Stoica', 'Strong', 'Sullivan',
'Sutwell', 'Sweetwater', 'Szabo', 'Tavares', 'Tejeda', 'Veera',
'Von Massenbach', 'Wallace', 'Wang', 'Zhou', 'Zima'], dtype=object)

len(df["Employee_Name"].unique())

Output

.nunique() Method

This method returns the number of unique values in a series. This method by default excludes the missing values using the parameter dropna = True.

You can pass the False argument to dropna parameter to not drop the missing values.

df["Employee_Name"].nunique()

Output

df["Employee_Name"].nunique(dropna=False)

Output

Kiran P

Updated on: 10-Nov-2020

6K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started