Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Selected Reading
Python - Remove duplicate values from a Pandas DataFrame
To remove duplicate values from a Pandas DataFrame, use the drop_duplicates() method. This method identifies rows with identical values across all columns and removes the duplicates, keeping only the first occurrence of each unique row.
Creating a DataFrame with Duplicates
Let's create a sample DataFrame containing duplicate rows ?
import pandas as pd
# Create DataFrame with duplicate rows
dataFrame = pd.DataFrame({
'Car': ['BMW', 'Mercedes', 'Lamborghini', 'BMW', 'Mercedes', 'Porsche'],
'Place': ['Delhi', 'Hyderabad', 'Chandigarh', 'Delhi', 'Hyderabad', 'Mumbai'],
'UnitsSold': [95, 70, 80, 95, 70, 90]
})
print("Original DataFrame...")
print(dataFrame)
Original DataFrame...
Car Place UnitsSold
0 BMW Delhi 95
1 Mercedes Hyderabad 70
2 Lamborghini Chandigarh 80
3 BMW Delhi 95
4 Mercedes Hyderabad 70
5 Porsche Mumbai 90
Removing Duplicate Rows
Use drop_duplicates() to remove rows that have identical values across all columns ?
import pandas as pd
dataFrame = pd.DataFrame({
'Car': ['BMW', 'Mercedes', 'Lamborghini', 'BMW', 'Mercedes', 'Porsche'],
'Place': ['Delhi', 'Hyderabad', 'Chandigarh', 'Delhi', 'Hyderabad', 'Mumbai'],
'UnitsSold': [95, 70, 80, 95, 70, 90]
})
print("Before removing duplicates:")
print("Car column counts:")
print(dataFrame['Car'].value_counts())
# Remove duplicate rows
dataFrame_clean = dataFrame.drop_duplicates()
print("\nAfter removing duplicates:")
print(dataFrame_clean)
print("\nCar column counts:")
print(dataFrame_clean['Car'].value_counts())
Before removing duplicates:
Car column counts:
BMW 2
Mercedes 2
Porsche 1
Lamborghini 1
Name: Car, dtype: int64
After removing duplicates:
Car Place UnitsSold
0 BMW Delhi 95
1 Mercedes Hyderabad 70
2 Lamborghini Chandigarh 80
5 Porsche Mumbai 90
Car column counts:
BMW 1
Mercedes 1
Porsche 1
Lamborghini 1
Name: Car, dtype: int64
Removing Duplicates Based on Specific Columns
You can also remove duplicates based on specific columns using the subset parameter ?
import pandas as pd
dataFrame = pd.DataFrame({
'Car': ['BMW', 'Mercedes', 'BMW', 'BMW'],
'Place': ['Delhi', 'Hyderabad', 'Mumbai', 'Delhi'],
'UnitsSold': [95, 70, 85, 95]
})
print("Original DataFrame:")
print(dataFrame)
# Remove duplicates based only on 'Car' column
unique_cars = dataFrame.drop_duplicates(subset=['Car'])
print("\nUnique cars (first occurrence):")
print(unique_cars)
Original DataFrame:
Car Place UnitsSold
0 BMW Delhi 95
1 Mercedes Hyderabad 70
2 BMW Mumbai 85
3 BMW Delhi 95
Unique cars (first occurrence):
Car Place UnitsSold
0 BMW Delhi 95
1 Mercedes Hyderabad 70
Key Parameters
| Parameter | Description | Default |
|---|---|---|
subset |
Columns to consider for duplicates | All columns |
keep |
'first', 'last', or False | 'first' |
inplace |
Modify original DataFrame | False |
Conclusion
The drop_duplicates() method efficiently removes duplicate rows from DataFrames. Use the subset parameter to focus on specific columns, and keep to control which duplicate to retain.
Advertisements
