How to Count Duplicates in Pandas Dataframe?


Pandas is a popular Python library used for data manipulation and analysis. One common task in data analysis is to count the number of duplicate values in a Pandas DataFrame. Duplicates can occur when multiple rows have the same values in all columns, or in a subset of columns.

There are different ways to count duplicates in a Pandas DataFrame, depending on the specific requirements of the analysis. One common approach is to use the duplicated() method, which returns a Boolean Series indicating whether each row is a duplicate of a previous row. By default, the method considers all columns when checking for duplicates, but this behaviour can be customised by passing a subset of columns to the subset parameter.

Once duplicates have been identified, they can be counted using the sum() method of the Boolean Series returned by duplicated(). Alternatively, we can also make use of the pivot_table() method to calculate the count of the duplicates as well.

Another way to count duplicates is to use the groupby() method to group the DataFrame by the columns of interest and then count the number of occurrences of each group. This can be achieved using the size() method of the resulting grouped DataFrame.

Let's first make use of the pivot_table() method to count the duplicates that might be present in a Pandas DataFrame.

pivot_table

The pivot_table() method is a powerful tool for summarising and aggregating data in a Pandas DataFrame. It allows us to group the data by one or more columns and calculate summary statistics for each group. In the case of counting duplicates, we can group the data by the columns that we want to check for duplicates, and then count the number of occurrences of each group using the count() method.

Consider the code shown below in which we will use the pivot_table() method.

Example

import pandas as pd

# create a sample DataFrame
df = pd.DataFrame({'A': [1, 2, 2, 3, 3, 3],
               	'B': ['a', 'a', 'b', 'b', 'b', 'c'],
               	'C': ['x', 'x', 'y', 'y', 'z', 'z']})

# group the data by columns A and B, and count the number of occurrences of each group
dup_counts = df.pivot_table(index=['A', 'B'], aggfunc='size')

print(dup_counts)

Explanation

In this example, we grouped the DataFrame by columns A and B using pivot_table() and calculated the size of each group using the size() method. The resulting dup_counts variable is a Series that shows the number of occurrences of each unique combination of values in columns A and B. This information can be useful for identifying and analysing duplicate data in the DataFrame.

To run the code we need to first install the pandas library and then run the command shown below.

Command

pip3 install pandas
python3 main.py

The output of the above command will be:

Output

A  B
1  a	1
2  a	1
   b	1
3  b	2
   c	1
dtype: int64

duplicated()

The duplicated() method is a function in Python's pandas library that is used to identify and return duplicate rows in a DataFrame. The method identifies duplicate rows by comparing each row to all the other rows in the DataFrame and returns a Boolean series where True indicates that the row is a duplicate.

Now let's make use of the duplicated() method with the help of an example.

Consider the code shown below.

Example

import pandas as pd

# create a sample DataFrame
df = pd.DataFrame({'A': [1, 2, 2, 3, 3, 3],
               	'B': ['a', 'a', 'b', 'b', 'b', 'c'],
               	'C': ['x', 'x', 'y', 'y', 'z', 'z']})

# identify the duplicated rows based on all columns
dups = df.duplicated()

# count the number of duplicated rows
dup_count = dups.sum()

print("Number of duplicated rows: ", dup_count)

Explanation

In this example, we used the duplicated() method to identify the duplicated rows in the DataFrame. By default, the method considers all columns when checking for duplicates. We then used the sum() method to count the number of duplicated rows, which is 3 in this case.

Output

Number of duplicated rows:  3

Conclusion

In conclusion, counting duplicates in a Pandas DataFrame is an important step in data cleaning and analysis. With the help of various methods such as groupby(), value_counts(), pivot_table(), and duplicated(), identifying and handling duplicate data can be made easier and more efficient.

By using these methods, it is possible to identify and remove duplicate rows, count the number of occurrences of each unique value in a column, and group data by specific criteria to better understand patterns and relationships in the data.

It is important to note that different methods may be more appropriate depending on the specific task at hand, and it is always a good practice to check the documentation and experiment with different approaches to find the most effective solution.

Updated on: 03-Aug-2023

4K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements