Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Python Pandas – Create a subset and display only the last entry from duplicate values
To create a subset and display only the last entry from duplicate values, use the drop_duplicates() method with the keep parameter set to 'last'. This method removes duplicate rows based on specified columns and keeps only the last occurrence of each duplicate.
Creating the DataFrame
Let us first create a DataFrame with duplicate entries ?
import pandas as pd
# Create DataFrame with duplicate Car-Place combinations
dataFrame = pd.DataFrame({
'Car': ['BMW', 'Mercedes', 'Lamborghini', 'BMW', 'Mercedes', 'Porsche'],
'Place': ['Delhi', 'Hyderabad', 'Chandigarh', 'Delhi', 'Hyderabad', 'Mumbai'],
'UnitsSold': [85, 70, 80, 95, 55, 90]
})
print("Original DataFrame:")
print(dataFrame)
Original DataFrame:
Car Place UnitsSold
0 BMW Delhi 85
1 Mercedes Hyderabad 70
2 Lamborghini Chandigarh 80
3 BMW Delhi 95
4 Mercedes Hyderabad 55
5 Porsche Mumbai 90
Using drop_duplicates() with keep='last'
Now we'll remove duplicates based on the Car and Place columns, keeping only the last occurrence ?
import pandas as pd
# Create DataFrame
dataFrame = pd.DataFrame({
'Car': ['BMW', 'Mercedes', 'Lamborghini', 'BMW', 'Mercedes', 'Porsche'],
'Place': ['Delhi', 'Hyderabad', 'Chandigarh', 'Delhi', 'Hyderabad', 'Mumbai'],
'UnitsSold': [85, 70, 80, 95, 55, 90]
})
# Remove duplicates and keep last entry
# Using subset parameter to specify columns for duplicate detection
dataFrame2 = dataFrame.drop_duplicates(subset=['Car', 'Place'], keep='last').reset_index(drop=True)
print("DataFrame after removing duplicates (keeping last):")
print(dataFrame2)
DataFrame after removing duplicates (keeping last):
Car Place UnitsSold
0 Lamborghini Chandigarh 80
1 BMW Delhi 95
2 Mercedes Hyderabad 55
3 Porsche Mumbai 90
How It Works
The drop_duplicates() method with these parameters:
- subset: Specifies which columns to consider for identifying duplicates
- keep='last': Keeps the last occurrence of each duplicate group
- reset_index(drop=True): Resets the index after removing rows
Comparison of keep Parameter Values
| Parameter | Description | BMW-Delhi Result |
|---|---|---|
keep='first' |
Keep first occurrence | Index 0 (UnitsSold: 85) |
keep='last' |
Keep last occurrence | Index 3 (UnitsSold: 95) |
keep=False |
Remove all duplicates | Neither (both removed) |
Conclusion
Use drop_duplicates(subset=['columns'], keep='last') to keep only the last occurrence of duplicate values. The subset parameter defines which columns determine duplicates, while keep='last' preserves the final entry from each duplicate group.
