Drop Duplicate Rows - Problem
You work as a data analyst for an e-commerce company and have received a customer database that contains duplicate entries. Your task is to clean the data by removing duplicate rows based on email addresses.
Given a DataFrame customers with columns:
customer_id(int) - Unique identifier for each customer recordname(object) - Customer's nameemail(object) - Customer's email address
Goal: Remove all duplicate rows where the same email appears multiple times, keeping only the first occurrence of each unique email.
This is a common data preprocessing task in machine learning pipelines and business analytics where data quality is crucial for accurate insights.
Input & Output
example_1.py โ Basic duplicate removal
$
Input:
customers = pd.DataFrame({
'customer_id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Alice'],
'email': ['alice@email.com', 'bob@email.com', 'alice@email.com']
})
โบ
Output:
customer_id name email
0 1 Alice alice@email.com
1 2 Bob bob@email.com
๐ก Note:
The third row is removed because 'alice@email.com' already appeared in the first row. We keep the first occurrence and remove subsequent duplicates.
example_2.py โ Multiple duplicates
$
Input:
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 4, 5],
'name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'],
'email': ['alice@email.com', 'bob@email.com', 'alice@email.com', 'charlie@email.com', 'bob@email.com']
})
โบ
Output:
customer_id name email
0 1 Alice alice@email.com
1 2 Bob bob@email.com
2 4 Charlie charlie@email.com
๐ก Note:
Rows 3 and 5 are removed as duplicates. Row 3 has the same email as row 1, and row 5 has the same email as row 2. Only the first occurrence of each unique email is retained.
example_3.py โ No duplicates edge case
$
Input:
customers = pd.DataFrame({
'customer_id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie'],
'email': ['alice@email.com', 'bob@email.com', 'charlie@email.com']
})
โบ
Output:
customer_id name email
0 1 Alice alice@email.com
1 2 Bob bob@email.com
2 3 Charlie charlie@email.com
๐ก Note:
All emails are unique, so no rows are removed. The original DataFrame is returned unchanged.
Visualization
Tap to expand
Understanding the Visualization
1
Start with customer database
Begin with a DataFrame containing customer records with potential duplicates
2
Track seen emails
Use a hash set to remember which email addresses we've already encountered
3
Process each row
For each customer record, check if their email is already in our seen set
4
Keep unique records
If email is new, add it to seen set and include record in result
5
Skip duplicates
If email already exists, skip this record to eliminate the duplicate
Key Takeaway
๐ฏ Key Insight: Hash-based deduplication achieves O(n) time complexity by using constant-time lookups to track previously seen email addresses, making it optimal for large datasets.
Time & Space Complexity
Time Complexity
O(n)
Single pass through the data with hash table lookups
โ Linear Growth
Space Complexity
O(k)
Where k is the number of unique emails (typically much less than n)
โ Linear Space
Constraints
- 1 โค customers.length โค 104
- customer_id is a positive integer
- name and email are non-empty strings
- Email addresses are case-sensitive
- The first occurrence of each unique email should be preserved
๐ก
Explanation
AI Ready
๐ก Suggestion
Tab
to accept
Esc
to dismiss
// Output will appear here after running code