Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Write a program in Python to remove first duplicate rows in a given dataframe
Duplicate rows in a DataFrame can clutter your data analysis. In pandas, you can remove duplicate rows using the drop_duplicates() method. When you set keep='last', it removes the first occurrence of duplicates and keeps the last one.
Understanding the Problem
Let's start by creating a DataFrame with duplicate rows to see how duplicate removal works ?
import pandas as pd
df = pd.DataFrame({
'Id': [1, 2, 3, 4, 5, 6, 2, 7, 3, 9, 10],
'Age': [12, 13, 14, 13, 14, 12, 13, 16, 14, 15, 14]
})
print("Original DataFrame:")
print(df)
Original DataFrame:
Id Age
0 1 12
1 2 13
2 3 14
3 4 13
4 5 14
5 6 12
6 2 13
7 7 16
8 3 14
9 9 15
10 10 14
Removing First Duplicate Rows
To remove the first occurrence of duplicate rows, use drop_duplicates() with keep='last'. You can specify which columns to check for duplicates using the subset parameter ?
import pandas as pd
df = pd.DataFrame({
'Id': [1, 2, 3, 4, 5, 6, 2, 7, 3, 9, 10],
'Age': [12, 13, 14, 13, 14, 12, 13, 16, 14, 15, 14]
})
# Remove first duplicate rows based on Id and Age columns
result = df.drop_duplicates(subset=['Id', 'Age'], keep='last')
print("DataFrame after removing first duplicate rows:")
print(result)
DataFrame after removing first duplicate rows:
Id Age
0 1 12
3 4 13
4 5 14
5 6 12
6 2 13
7 7 16
8 3 14
9 9 15
10 10 14
How It Works
The drop_duplicates() method identifies duplicate rows based on the specified columns:
subset=['Id', 'Age'] ? Check for duplicates based on both Id and Age columns
keep='last' ? Keep the last occurrence and remove the first duplicate
Rows with Id=2,Age=13 (index 1 and 6) ? Keeps row at index 6
Rows with Id=3,Age=14 (index 2 and 8) ? Keeps row at index 8
Alternative Approaches
You can also remove duplicates based on all columns or specific behavior ?
import pandas as pd
df = pd.DataFrame({
'Id': [1, 2, 3, 4, 5, 6, 2, 7, 3, 9, 10],
'Age': [12, 13, 14, 13, 14, 12, 13, 16, 14, 15, 14]
})
# Method 1: Remove duplicates based on all columns
all_cols_result = df.drop_duplicates(keep='last')
# Method 2: Remove duplicates based only on 'Id' column
id_only_result = df.drop_duplicates(subset=['Id'], keep='last')
print("Removing duplicates based on all columns:")
print(all_cols_result)
print("\nRemoving duplicates based on Id only:")
print(id_only_result)
Removing duplicates based on all columns:
Id Age
0 1 12
1 2 13
2 3 14
3 4 13
4 5 14
5 6 12
6 2 13
7 7 16
8 3 14
9 9 15
10 10 14
Removing duplicates based on Id only:
Id Age
0 1 12
3 4 13
4 5 14
5 6 12
6 2 13
7 7 16
8 3 14
9 9 15
10 10 14
Conclusion
Use drop_duplicates(subset=['column1', 'column2'], keep='last') to remove first occurrences of duplicate rows. The subset parameter controls which columns to check, and keep='last' preserves the final occurrence of duplicates.
