- Trending Categories
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
Physics
Chemistry
Biology
Mathematics
English
Economics
Psychology
Social Studies
Fashion Studies
Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
How to Count Duplicates in Pandas Dataframe?
Pandas is a popular Python library used for data manipulation and analysis. One common task in data analysis is to count the number of duplicate values in a Pandas DataFrame. Duplicates can occur when multiple rows have the same values in all columns, or in a subset of columns.
There are different ways to count duplicates in a Pandas DataFrame, depending on the specific requirements of the analysis. One common approach is to use the duplicated() method, which returns a Boolean Series indicating whether each row is a duplicate of a previous row. By default, the method considers all columns when checking for duplicates, but this behaviour can be customised by passing a subset of columns to the subset parameter.
Once duplicates have been identified, they can be counted using the sum() method of the Boolean Series returned by duplicated(). Alternatively, we can also make use of the pivot_table() method to calculate the count of the duplicates as well.
Another way to count duplicates is to use the groupby() method to group the DataFrame by the columns of interest and then count the number of occurrences of each group. This can be achieved using the size() method of the resulting grouped DataFrame.
Let's first make use of the pivot_table() method to count the duplicates that might be present in a Pandas DataFrame.
pivot_table
The pivot_table() method is a powerful tool for summarising and aggregating data in a Pandas DataFrame. It allows us to group the data by one or more columns and calculate summary statistics for each group. In the case of counting duplicates, we can group the data by the columns that we want to check for duplicates, and then count the number of occurrences of each group using the count() method.
Consider the code shown below in which we will use the pivot_table() method.
Example
import pandas as pd # create a sample DataFrame df = pd.DataFrame({'A': [1, 2, 2, 3, 3, 3], 'B': ['a', 'a', 'b', 'b', 'b', 'c'], 'C': ['x', 'x', 'y', 'y', 'z', 'z']}) # group the data by columns A and B, and count the number of occurrences of each group dup_counts = df.pivot_table(index=['A', 'B'], aggfunc='size') print(dup_counts)
Explanation
In this example, we grouped the DataFrame by columns A and B using pivot_table() and calculated the size of each group using the size() method. The resulting dup_counts variable is a Series that shows the number of occurrences of each unique combination of values in columns A and B. This information can be useful for identifying and analysing duplicate data in the DataFrame.
To run the code we need to first install the pandas library and then run the command shown below.
Command
pip3 install pandas python3 main.py
The output of the above command will be:
Output
A B 1 a 1 2 a 1 b 1 3 b 2 c 1 dtype: int64
duplicated()
The duplicated() method is a function in Python's pandas library that is used to identify and return duplicate rows in a DataFrame. The method identifies duplicate rows by comparing each row to all the other rows in the DataFrame and returns a Boolean series where True indicates that the row is a duplicate.
Now let's make use of the duplicated() method with the help of an example.
Consider the code shown below.
Example
import pandas as pd # create a sample DataFrame df = pd.DataFrame({'A': [1, 2, 2, 3, 3, 3], 'B': ['a', 'a', 'b', 'b', 'b', 'c'], 'C': ['x', 'x', 'y', 'y', 'z', 'z']}) # identify the duplicated rows based on all columns dups = df.duplicated() # count the number of duplicated rows dup_count = dups.sum() print("Number of duplicated rows: ", dup_count)
Explanation
In this example, we used the duplicated() method to identify the duplicated rows in the DataFrame. By default, the method considers all columns when checking for duplicates. We then used the sum() method to count the number of duplicated rows, which is 3 in this case.
Output
Number of duplicated rows: 3
Conclusion
In conclusion, counting duplicates in a Pandas DataFrame is an important step in data cleaning and analysis. With the help of various methods such as groupby(), value_counts(), pivot_table(), and duplicated(), identifying and handling duplicate data can be made easier and more efficient.
By using these methods, it is possible to identify and remove duplicate rows, count the number of occurrences of each unique value in a column, and group data by specific criteria to better understand patterns and relationships in the data.
It is important to note that different methods may be more appropriate depending on the specific task at hand, and it is always a good practice to check the documentation and experiment with different approaches to find the most effective solution.