How can we detect duplicate labels using the Python Pandas library?

PandasServer Side ProgrammingProgramming

Pandas used to deal with large data sets, in that large data tables columns and rows are indexed with some names and those names are called labels. When we are working with datasets there may be some duplicate labels present in the data set.

The duplication can lead to making incorrect conclusions on our data, it may impact our desired outputs. Here we are talking about label duplication, nothing but rows and column index names repeated more than 1 time.

Let’s take an example to identify the duplicate labels in a DataFrame.

Identifying duplicates in column labels

Example

df1 = pd.DataFrame([[6, 1, 2, 7], [8, 4, 5,9]], columns=["A", "A", "B","C"])
print(df1)
print(df1.columns.is_unique)

Explanation

Created a DataFrame with a 2X4 shape. to verify if there are any duplicate labels present in columns, here we use DataFrame.columns.is_unique. this will return a boolean data either True or False based on the presence of duplicates.

Output

    A   A   B   C
0   6   1   2   7
1   8   4   5   9
False

This output block represents the DataFrame df and the boolean False is representing there is a duplicate label present in columns of DataFrame df.

By using the duplicated method we can also get the duplicate labels in our DataFrame. Below block.

df1.columns[~df1.columns.duplicated()]

df1.columns is only taking column names as an array and the duplicated() method gives you an array of boolean values representing duplicates. By using the above code we can get a unique list of column labels.

Index(['A', 'B', 'C'], dtype='object')

Identifying duplicates in index labels

Same as the above process of identifying duplicates in column labels we can also identify duplicates in the index (rows).

Example

f = pd.DataFrame({"A": [0, 1, 2, 3, 4]}, index=["x", "y", "x", "y","z"])
print(f)
print()
print(f.index.duplicated()) # getting boolean string
unique_f = f[~f.index.duplicated()] # filtering duplicates
print()
print(unique_f) # removed duplicated data

Explanation

The DataFrame “f” has been created with some duplicate data in indexes. We can identify duplicates using f.index.duplicated() and it will return a list of boolean values representing duplicates indexes. By using this duplicated method we can remove duplicate labels from our DataFrame.

Output

   A
x   0
y   1
x   2
y   3
z   4

[False False True True False]

   A
x   0
y   1
z   4

The first block is DataFrame “f” with duplicated values and the array of boolean values represents duplicates. And the final block represents the unique index labels from our DataFrame “f”.

raja
Updated on 18-Nov-2021 10:22:30

Advertisements