How to find and filter Duplicate rows in Pandas ?


Sometimes during our data analysis, we need to look at the duplicate rows to understand more about our data rather than dropping them straight away.

Luckily, in pandas we have few methods to play with the duplicates.

.duplciated()

This method allows us to extract duplicate rows in a DataFrame. We will use a new dataset with duplicates. I have downloaded the Hr Dataset from link.

import pandas as pd
import numpy as np

# Import HR Dataset with certain columns
df = pd.read_csv("https://raw.githubusercontent.com/sasankac/TestDataSet/master/HRDataset.csv",
usecols = ("Employee_Name""PerformanceScore","Position","CitizenDesc"))

#Sort the values on employee name and make it permanent
df.sort_values("Employee_Name"inplace = True)
df.head(3)



Employee_Name
Position
CitizenDesc
PerformanceScore
0
Adinolfi
Production Technician I
US Citizen
Exceeds
1
Adinolfi
Sr. DBA
US Citizen
Fully Meets
2
Adinolfi
Production Technician II
US Citizen
Fully Meets


The way duplicated() works by default is by keep parameter , This parameter is going to mark the very first occurrence of each value as a non-duplicate.

This method does not mark a row as duplicate if it exists more than once, rather it marks each subsequent row after the first row as duplicate. Confused? Let me try to explain one more time with an example, suppose there are 3 apples in a basket what this method does is mark the first apple as non-duplicate and the rest of the two apples as duplicates.

Example

df["Employee_Name"].head(3)

Output

0 Adinolfi
1 Adinolfi
2 Adinolfi
Name: Employee_Name, dtype: object

Example

df["Employee_Name"].duplicated().head(3)

Output

0 False
1 True
2 True
Name: Employee_Name, dtype: bool

Now to extract the duplicates out (remember the first occurrence is not a duplicate rather the subsequence occurrence are duplicates and will be outputted by this method) we need to pass this method to a data frame.

df.shape


(310, 4)


df[df["Employee_Name"].duplicated()]



Employee_Name
Position
CitizenDesc
PerformanceScore
1
Adinolfi
Sr. DBA
US Citizen
Fully Meets
2
Adinolfi
Production Technician II
US Citizen
Fully Meets
3
Adinolfi
Production Technician I
US Citizen
Fully Meets
4
Adinolfi
Production Manager
US Citizen
Fully Meets
6
Anderson
Production Technician I
US Citizen
Exceeds
...
...
...
...
...
303
Wang
Production Technician II
US Citizen
Fully Meets
304
Wang
Production Technician II
US Citizen
Fully Meets
305
Wang
Production Technician I
US Citizen
PIP
306
Wang
CIO
US Citizen
Exceeds
307
Wang
Data Analyst
US Citizen
Fully Meets

79 rows × 4 columns

From the output above there are 310 rows with 79 duplicates which are extracted by using the .duplicated() method.

ARGUMENT-"LAST"

By default, this method is going to mark the first occurrence of the value as non-duplicate, we can change this behavior by passing the argument keep = last.

What this parameter is going to do is to mark the first two apples as duplicates and the last one as non-duplicate.

df[df["Employee_Name"].duplicated(keep="last")]



Employee_Name
Position
CitizenDesc
PerformanceScore
0
Adinolfi
Production Technician I
US Citizen
Exceeds
1
Adinolfi
Sr. DBA
US Citizen
Fully Meets
2
Adinolfi
Production Technician II
US Citizen
Fully Meets
3
Adinolfi
Production Technician I
US Citizen
Fully Meets
5
Anderson
Production Technician I
US Citizen
Fully Meets
...
...
...
...
...
302
Wang
Production Technician II
US Citizen
Exceeds
303
Wang
Production Technician II
US Citizen
Fully Meets
304
Wang
Production Technician II
US Citizen
Fully Meets
305
Wang
Production Technician I
US Citizen
PIP
306
Wang
CIO
US Citizen
Exceeds

ARGUMENT - FALSE

The keep parameter will also accept an additional argument “false” which will mark all the values occurring more than once as duplicates, in our case all the 3 apples will be marked as duplicates rather the first or last as shown in the above examples.

Note – When specifying the false parameter do not use the quotes.

df[df"Employee_Name"].duplicated(keep=False)]



Employee_Name
Position
CitizenDesc
PerformanceScore
0
Adinolfi
Production Technician I
US Citizen
Exceeds
1
Adinolfi
Sr. DBA
US Citizen
Fully Meets
2
Adinolfi
Production Technician II
US Citizen
Fully Meets
3
Adinolfi
Production Technician I
US Citizen
Fully Meets
4
Adinolfi
Production Manager
US Citizen
Fully Meets
...
...
...
...
...
303
Wang
Production Technician II
US Citizen
Fully Meets
304
Wang
Production Technician II
US Citizen
Fully Meets
305
Wang
Production Technician I
US Citizen
PIP
306
Wang
CIO
US Citizen
Exceeds
307
Wang
Data Analyst
US Citizen
Fully Meets

Now finally, to extract the unique values from a dataset we can use the “~” (tilda) symbol to negate the values

df_unique~df["Employee_Name"].duplicated(keep=False)df[df_unique]



Employee_Name
Position
CitizenDesc
PerformanceScore
7
Andreola
Software Engineer
US Citizen
Fully Meets
25
Bozzi
Production Manager
US Citizen
Fully Meets
26
Bramante
Director of Operations
US Citizen
Exceeds
27
Brill
Production Technician I
US Citizen
Fully Meets
34
Burkett
Production Technician II
US Citizen
Fully Meets
...
...
...
...
...
276
Sweetwater
Software Engineer
US Citizen
Exceeds
277
Szabo
Production Technician I
Non-Citizen
Fully Meets
278
Tavares
Production Technician II
US Citizen
Fully Meets
308
Zhou
Production Technician I
US Citizen
Fully Meets
309
Zima
NaN
NaN
NaN

drop_duplicates()

This method is pretty similar to the previous method, however this method can be on a DataFrame rather than on a single series.

NOTE :- This method looks for the duplicates rows on all the columns of a DataFrame and drops them.

len(df)

Output

310


len(df.drop_duplicates())

Output

290

SUBSET PARAMTER

The subset parameter accepts a list of column names as string values in which we can check for duplicates.

df1=df.drop_duplicates(subset=["Employee_Name"],keep="first")df1

Employee_Name
Position
CitizenDesc
PerformanceScore
0
Adinolfi
Production Technician I
US Citizen
Exceeds
5
Anderson
Production Technician I
US Citizen
Fully Meets
7
Andreola
Software Engineer
US Citizen
Fully Meets
14
Athwal
Production Technician I
US Citizen
Fully Meets
20
Beak
Production Technician I
US Citizen
Fully Meets
...
...
...
...
...
293
Von Massenbach
Production Technician II
US Citizen
Fully Meets
295
Wallace
Production Technician I
US Citizen
Needs Improvement
300
Wang
Production Technician I
Eligible NonCitizen
Fully Meets
308
Zhou
Production Technician I
US Citizen
Fully Meets
309
Zima
NaN
NaN
NaN

We can specify multiple columns and use all the keep parameters discussed in the previous section.

df1=df.drop_duplicates(subset="Employee_Name""CitizenDesc"],keep=False)df1



Employee_Name
Position
CitizenDesc
PerformanceScore
7
Andreola
Software Engineer
US Citizen
Fully Meets
16
Beak
Production Technician I
Eligible NonCitizen
Fully Meets
25
Bozzi
Production Manager
US Citizen
Fully Meets
26
Bramante
Director of Operations
US Citizen
Exceeds
27
Brill
Production Technician I
US Citizen
Fully Meets
...
...
...
...
...
287
Tejeda
Network Engineer
Eligible NonCitizen
Fully Meets
286
Tejeda
Software Engineer
Non-Citizen
Fully Meets
300
Wang
Production Technician I
Eligible NonCitizen
Fully Meets
308
Zhou
Production Technician I
US Citizen
Fully Meets
309
Zima
NaN
NaN
NaN

unique() Method

The unique methods find the unique values in a series and return the unique values as an Array. This method does not exclude missing values.

len(df["Employee_Name"])

Output

310


df["Employee_Name"].unique()


array(['Adinolfi', 'Anderson', 'Andreola', 'Athwal', 'Beak', 'Bondwell',
'Bozzi', 'Bramante', 'Brill', 'Brown', 'Burkett', 'Butler',
'Carabbio', 'Carey', 'Carr', 'Carter', 'Chace', 'Champaigne',
'Chan', 'Chang', 'Chivukula', 'Cierpiszewski', 'Cisco', 'Clayton',
'Cloninger', 'Close', 'Clukey', 'Cockel', 'Cole', 'Cornett',
'Costa', 'Crimmings', 'Daneault', 'Daniele', 'Darson', 'Davis',
'DeGweck', 'Del Bosque', 'Demita', 'Desimone', 'DiNocco',
'Dickinson', 'Dietrich', 'Digitale', 'Dobrin', 'Dolan', 'Dougall',
'Dunn', 'Eaton', 'Employee_Name', 'Engdahl', 'England', 'Erilus',
'Estremera', 'Evensen', 'Exantus', 'Faller', 'Fancett', 'Favis',
'Ferguson', 'Fernandes', 'Ferreira', 'Fidelia', 'Fitzpatrick',
'Foreman', 'Foss', 'Foster-Baker', 'Fraval', 'Friedman', 'Galia',
'Garcia', 'Garneau', 'Gaul', 'Gentry', 'Gerke', 'Gill', 'Gonzales',
'Gonzalez', 'Good', 'Handschiegl', 'Hankard', 'Harrison',
'Heitzman', 'Horton', 'Houlihan', 'Howard', 'Hubert', 'Hunts',
'Hutter', 'Huynh', 'Immediato', 'Ivey', 'Jackson', 'Jacobi',
'Jeannite', 'Jeremy Prater', 'Jhaveri', 'Johnson', 'Johnston',
'Jung', 'Kampew', 'Keatts', 'Khemmich', 'King', 'Kinsella',
'Kirill', 'Knapp', 'Kretschmer', 'LaRotonda', 'Lajiri', 'Langford',
'Langton', 'Latif', 'Le', 'LeBel', 'LeBlanc', 'Leach', 'Leruth',
'Liebig', 'Linares', 'Linden', 'Lindsay', 'Lundy', 'Lunquist',
'Lydon', 'Lynch', 'MacLennan', 'Mahoney', 'Manchester', 'Mancuso',
'Mangal', 'Martin', 'Martins', 'Maurice', 'McCarthy', 'McKinzie',
'Mckenna', 'Meads', 'Medeiros', 'Merlos', 'Miller', 'Monkfish',
'Monroe', 'Monterro', 'Moran', 'Morway', 'Motlagh', 'Moumanil',
'Mullaney', 'Murray', 'Navathe', 'Ndzi', 'Newman', 'Ngodup',
'Nguyen', 'Nowlan', 'O'hare', 'Oliver', 'Onque', 'Osturnka',
'Owad', 'Ozark', 'Panjwani', 'Patronick', 'Pearson', 'Pelech',
'Pelletier', 'Perry', 'Peters', 'Peterson', 'Petingill',
'Petrowsky', 'Pham', 'Pitt', 'Potts', 'Power', 'Punjabhi',
'Purinton', 'Quinn', 'Rachael', 'Rarrick', 'Rhoads', 'Riordan',
'Rivera', 'Roberson', 'Robertson', 'Robinson', 'Roby', 'Roehrich',
'Rogers', 'Roper', 'Rose', 'Rossetti', 'Roup', 'Ruiz', 'Saada',
'Saar-Beckles', 'Sadki', 'Sahoo', 'Salter', 'Sander', 'Semizoglou',
'Sewkumar', 'Shepard', 'Shields', 'Simard', 'Singh', 'Sloan',
'Smith', 'Soto', 'South', 'Sparks', 'Spirea', 'Squatrito',
'Stanford', 'Stanley', 'Steans', 'Stoica', 'Strong', 'Sullivan',
'Sutwell', 'Sweetwater', 'Szabo', 'Tavares', 'Tejeda', 'Veera',
'Von Massenbach', 'Wallace', 'Wang', 'Zhou', 'Zima'], dtype=object)


len(df["Employee_Name"].unique())

Output

231

.nunique() Method

This method returns the number of unique values in a series. This method by default excludes the missing values using the parameter dropna = True.

You can pass the False argument to dropna parameter to not drop the missing values.

df["Employee_Name"].nunique()

Output

231


df["Employee_Name"].nunique(dropna=False)

Output

231

Updated on: 10-Nov-2020

6K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements