Article Categories

Selected Reading

Write a program in Python to remove first duplicate rows in a given dataframe

Python Pandas Server Side Programming Programming

Duplicate rows in a DataFrame can clutter your data analysis. In pandas, you can remove duplicate rows using the drop_duplicates() method. When you set keep='last', it removes the first occurrence of duplicates and keeps the last one.

Understanding the Problem

Let's start by creating a DataFrame with duplicate rows to see how duplicate removal works ?

import pandas as pd

df = pd.DataFrame({
    'Id': [1, 2, 3, 4, 5, 6, 2, 7, 3, 9, 10],
    'Age': [12, 13, 14, 13, 14, 12, 13, 16, 14, 15, 14]
})
print("Original DataFrame:")
print(df)

Original DataFrame:
    Id  Age
0    1   12
1    2   13
2    3   14
3    4   13
4    5   14
5    6   12
6    2   13
7    7   16
8    3   14
9    9   15
10  10   14

Removing First Duplicate Rows

To remove the first occurrence of duplicate rows, use drop_duplicates() with keep='last'. You can specify which columns to check for duplicates using the subset parameter ?

import pandas as pd

df = pd.DataFrame({
    'Id': [1, 2, 3, 4, 5, 6, 2, 7, 3, 9, 10],
    'Age': [12, 13, 14, 13, 14, 12, 13, 16, 14, 15, 14]
})

# Remove first duplicate rows based on Id and Age columns
result = df.drop_duplicates(subset=['Id', 'Age'], keep='last')
print("DataFrame after removing first duplicate rows:")
print(result)

DataFrame after removing first duplicate rows:
    Id  Age
0    1   12
3    4   13
4    5   14
5    6   12
6    2   13
7    7   16
8    3   14
9    9   15
10  10   14

How It Works

The drop_duplicates() method identifies duplicate rows based on the specified columns:

subset=['Id', 'Age'] ? Check for duplicates based on both Id and Age columns
keep='last' ? Keep the last occurrence and remove the first duplicate
Rows with Id=2,Age=13 (index 1 and 6) ? Keeps row at index 6
Rows with Id=3,Age=14 (index 2 and 8) ? Keeps row at index 8

Alternative Approaches

You can also remove duplicates based on all columns or specific behavior ?

import pandas as pd

df = pd.DataFrame({
    'Id': [1, 2, 3, 4, 5, 6, 2, 7, 3, 9, 10],
    'Age': [12, 13, 14, 13, 14, 12, 13, 16, 14, 15, 14]
})

# Method 1: Remove duplicates based on all columns
all_cols_result = df.drop_duplicates(keep='last')

# Method 2: Remove duplicates based only on 'Id' column  
id_only_result = df.drop_duplicates(subset=['Id'], keep='last')

print("Removing duplicates based on all columns:")
print(all_cols_result)
print("\nRemoving duplicates based on Id only:")
print(id_only_result)

Removing duplicates based on all columns:
    Id  Age
0    1   12
1    2   13
2    3   14
3    4   13
4    5   14
5    6   12
6    2   13
7    7   16
8    3   14
9    9   15
10  10   14

Removing duplicates based on Id only:
    Id  Age
0    1   12
3    4   13
4    5   14
5    6   12
6    2   13
7    7   16
8    3   14
9    9   15
10  10   14

Conclusion

Use drop_duplicates(subset=['column1', 'column2'], keep='last') to remove first occurrences of duplicate rows. The subset parameter controls which columns to check, and keep='last' preserves the final occurrence of duplicates.

Vani Nalliappan

Updated on: 2026-03-25T16:22:52+05:30

407 Views

Previous Next