Handling duplicate values from datasets in python


Introduction

The handling of duplicate values in datasets using Python is covered in this article. It defines duplicate values, shows how to spot them in a Pandas DataFrame, and offers many solutions for dealing with them, including removing duplicates, maintaining the first or last occurrence, and substituting alternative values for duplicates. The need of managing duplicate values is emphasized throughout the paper to support correct data analysis and machine learning models.

In every project involving data analysis or machine learning, cleansing the data is a crucial step. The occurrence of duplicate values in datasets is one of the most prevalent problems with data quality. In data analysis and machine learning models, duplicates can result in bias and inaccuracies. Due to this, it is crucial to spot and manage duplication in datasets. We will go through how to manage duplicate values from datasets in Python in this article.

In datasets, duplicate values are frequently found, and they might interfere with data analysis. We will look at handling duplicate values from datasets in Python in this post.

What are Duplicate Values?

Data points in a dataset that have the same values for all or part of the characteristics are said to have duplicate values. Due to problems with data input, data collecting, or other circumstances, duplicate values may appear.

Identifying Duplicate Values

Finding duplicates in the dataset is the first step in addressing them. A number of functions are available in the pandas library to find duplicates. If a row is a duplicate of another row, the duplicated method returns a Boolean Series that says so. Duplicate rows are removed from a dataset using the drop duplicates function.

An illustration of how to spot duplicate values in a pandas DataFrame is given below −

Example

import pandas as pd

# Create a sample DataFrame with duplicate values
data = pd.DataFrame({
   'name': ['John', 'Emily', 'John', 'Jane', 'John'],
   'age': [25, 28, 25, 30, 25],
   'salary': [50000, 60000, 50000, 70000, 50000]
})

# Identify duplicate rows
duplicates = data.duplicated()

# Print the duplicate rows
print(data[duplicates])

Output

   name  age  salary
2  John   25   50000
4  John   25   50000

Duplicate values in a Pandas DataFrame may be found and printed using the provided Python code. The code is broken down as follows −

  • The Pandas library is initially imported as pd.

  • There are duplicate entries in the three columns for name, age, and income in a sample DataFrame.

  • To find duplicate rows in the DataFrame, utilize the Pandas duplicated() function. For each row that is a duplicate of a prior row, the procedure produces a Boolean Series that contains the value True.

  • Square brackets are used to index the original DataFrame in the Boolean Series. Only the duplicate rows are returned in this case.

  • The final step is to print the DataFrame with duplicate rows to the console.

A DataFrame comprising the rows that are duplicates of earlier rows based on all the columns will be the result of this code.

Handling Duplicate Values

After locating the duplicate rows, we must deal with them. Depending on the particular use case, duplicates can be handled in a variety of ways. Here are a few typical methods −

  • Drop duplicate rows − This is a straightforward strategy. Duplicates in the DataFrame can be eliminated by using the drop duplicates method.

Example

# Drop duplicate rows
data = data.drop_duplicates()

# Print the updated DataFrame
print(data)

Output

    name  age  salary
0   John   25   50000
1  Emily   28   60000
3   Jane   30   70000
  • Keep first or last duplicate: Both the first and last duplicate values may be kept. To select which occurrence to maintain, use the keep option in the drop duplicates method.

Example

# Keep the first occurrence of the duplicates
data = data.drop_duplicates(keep='first')

# Print the updated DataFrame
print(data)

Output

    name  age  salary
0   John   25   50000
1  Emily   28   60000
3   Jane   30   70000
  • Replace duplicate values: We may swap out duplicate values for alternative values, such the column's mean or median. To group the data by a certain column and get the mean or median, we may use the groupby function.

Example

# Replace duplicate values with the median of the column
data['salary'] = data.groupby('name')['salary'].transform('median')

# Print the updated DataFrame
print(data)

Output

    name  age  salary
0   John   25   50000
1  Emily   28   60000
3   Jane   30   70000

Conclusion

Managing duplicate values in datasets is a crucial task for ensuring correct data analysis and machine learning models, in conclusion. Functions are available in the Python pandas package to locate and manage duplicates in datasets. If a row is a duplicate of another row, the duplicated() method returns a Boolean Series that shows this. A dataset's duplicate rows are eliminated using the drop duplicates() method. Duplicate values can be dealt with in a variety of methods, including removing duplicates, maintaining the first or last instance, and replacing duplicate values with other values like the mean or median of the column.

Updated on: 10-Mar-2023

4K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements