Drop Duplicate Rows - Problem

You work as a data analyst for an e-commerce company and have received a customer database that contains duplicate entries. Your task is to clean the data by removing duplicate rows based on email addresses.

Given a DataFrame customers with columns:

  • customer_id (int) - Unique identifier for each customer record
  • name (object) - Customer's name
  • email (object) - Customer's email address

Goal: Remove all duplicate rows where the same email appears multiple times, keeping only the first occurrence of each unique email.

This is a common data preprocessing task in machine learning pipelines and business analytics where data quality is crucial for accurate insights.

Input & Output

example_1.py โ€” Basic duplicate removal
$ Input: customers = pd.DataFrame({ 'customer_id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Alice'], 'email': ['alice@email.com', 'bob@email.com', 'alice@email.com'] })
โ€บ Output: customer_id name email 0 1 Alice alice@email.com 1 2 Bob bob@email.com
๐Ÿ’ก Note: The third row is removed because 'alice@email.com' already appeared in the first row. We keep the first occurrence and remove subsequent duplicates.
example_2.py โ€” Multiple duplicates
$ Input: customers = pd.DataFrame({ 'customer_id': [1, 2, 3, 4, 5], 'name': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob'], 'email': ['alice@email.com', 'bob@email.com', 'alice@email.com', 'charlie@email.com', 'bob@email.com'] })
โ€บ Output: customer_id name email 0 1 Alice alice@email.com 1 2 Bob bob@email.com 2 4 Charlie charlie@email.com
๐Ÿ’ก Note: Rows 3 and 5 are removed as duplicates. Row 3 has the same email as row 1, and row 5 has the same email as row 2. Only the first occurrence of each unique email is retained.
example_3.py โ€” No duplicates edge case
$ Input: customers = pd.DataFrame({ 'customer_id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie'], 'email': ['alice@email.com', 'bob@email.com', 'charlie@email.com'] })
โ€บ Output: customer_id name email 0 1 Alice alice@email.com 1 2 Bob bob@email.com 2 3 Charlie charlie@email.com
๐Ÿ’ก Note: All emails are unique, so no rows are removed. The original DataFrame is returned unchanged.

Visualization

Tap to expand
Customer Database Deduplication ProcessOriginal Database1: Alice (alice@...)2: Bob (bob@...)3: Alice (alice@...)4: Charlie (charlie@...)5: Bob (bob@...)Hash Set (Seen Emails)alice@bob@charlie@Process:1. Check if email exists2. If new, add to set3. Keep first occurrenceClean Database1: Alice (alice@...)2: Bob (bob@...)4: Charlie (charlie@...)Removed3: Alice (dup)5: Bob (dup)โœจ Result: 3 unique customers instead of 5 (40% reduction in duplicates)
Understanding the Visualization
1
Start with customer database
Begin with a DataFrame containing customer records with potential duplicates
2
Track seen emails
Use a hash set to remember which email addresses we've already encountered
3
Process each row
For each customer record, check if their email is already in our seen set
4
Keep unique records
If email is new, add it to seen set and include record in result
5
Skip duplicates
If email already exists, skip this record to eliminate the duplicate
Key Takeaway
๐ŸŽฏ Key Insight: Hash-based deduplication achieves O(n) time complexity by using constant-time lookups to track previously seen email addresses, making it optimal for large datasets.

Time & Space Complexity

Time Complexity
โฑ๏ธ
O(n)

Single pass through the data with hash table lookups

n
2n
โœ“ Linear Growth
Space Complexity
O(k)

Where k is the number of unique emails (typically much less than n)

n
2n
โœ“ Linear Space

Constraints

  • 1 โ‰ค customers.length โ‰ค 104
  • customer_id is a positive integer
  • name and email are non-empty strings
  • Email addresses are case-sensitive
  • The first occurrence of each unique email should be preserved
Asked in
Meta 45 Google 38 Amazon 32 Microsoft 28 Netflix 22
42.3K Views
High Frequency
~8 min Avg. Time
1.8K Likes
Ln 1, Col 1
Smart Actions
๐Ÿ’ก Explanation
AI Ready
๐Ÿ’ก Suggestion Tab to accept Esc to dismiss
// Output will appear here after running code
Code Editor Closed
Click the red button to reopen