Drop Duplicate Rows - Problem

Database Easy

You have a DataFrame called customers with the following structure:

Column Name	Type
customer_id	int
name	object
email	object

There are some duplicate rows in the DataFrame based on the email column.

Write a solution to remove these duplicate rows and keep only the first occurrence.

Return the cleaned DataFrame in the same format.

Input & Output

Example 1 — Basic Duplicate Removal

$ Input: customers = [{"customer_id": 1, "name": "Ella", "email": "emily@example.com"}, {"customer_id": 2, "name": "David", "email": "michael@example.com"}, {"customer_id": 3, "name": "Zachary", "email": "sarah@example.com"}, {"customer_id": 4, "name": "Alice", "email": "emily@example.com"}]

› Output: [{"customer_id": 1, "name": "Ella", "email": "emily@example.com"}, {"customer_id": 2, "name": "David", "email": "michael@example.com"}, {"customer_id": 3, "name": "Zachary", "email": "sarah@example.com"}]

💡 Note: Row with customer_id=4 (Alice) is removed because emily@example.com already exists in row with customer_id=1 (Ella). We keep the first occurrence.

Example 2 — Multiple Duplicates

$ Input: customers = [{"customer_id": 1, "name": "John", "email": "john@email.com"}, {"customer_id": 2, "name": "Bob", "email": "bob@email.com"}, {"customer_id": 3, "name": "Johnny", "email": "john@email.com"}, {"customer_id": 4, "name": "Robert", "email": "bob@email.com"}]

› Output: [{"customer_id": 1, "name": "John", "email": "john@email.com"}, {"customer_id": 2, "name": "Bob", "email": "bob@email.com"}]

💡 Note: Both john@email.com and bob@email.com have duplicates. We keep only the first occurrence of each email: John (ID=1) and Bob (ID=2).

Example 3 — No Duplicates

$ Input: customers = [{"customer_id": 1, "name": "Alice", "email": "alice@email.com"}, {"customer_id": 2, "name": "Bob", "email": "bob@email.com"}]

› Output: [{"customer_id": 1, "name": "Alice", "email": "alice@email.com"}, {"customer_id": 2, "name": "Bob", "email": "bob@email.com"}]

💡 Note: All emails are unique, so no rows are removed. The DataFrame remains unchanged.

Constraints

1 ≤ customers.length ≤ 10⁴
customer_id, name, and email are non-empty
All customer_id values are unique

Visualization

Tap to expand

Asked in

N Netflix 25 S Spotify 20 U Uber 15

The key insight is to track seen emails using a hash set for O(1) lookup time. The optimal approach uses pandas' drop_duplicates() function or manual hash set filtering. Time: O(n), Space: O(n).

Common Approaches

✓ Manual Duplicate Detection

⏱️ Time: O(n²) Space: O(n)

Compare each row's email with all previous rows to identify duplicates. Keep track of seen emails and only retain rows with emails we haven't encountered before.

Pandas drop_duplicates()

⏱️ Time: O(n) Space: O(n)

Leverage pandas' optimized drop_duplicates() function which is specifically designed for this task. Specify the 'email' column as the subset to check for duplicates and keep='first' to retain the first occurrence.

Manual Duplicate Detection — Algorithm Steps

Step 1: Create empty set to track seen emails
Step 2: Iterate through DataFrame rows
Step 3: For each row, check if email already seen
Step 4: If not seen, add to result and mark email as seen

Visualization

Tap to expand

Step-by-Step Walkthrough

Initialize

Create empty set for seen emails

Check Each Row

Compare email against seen set

Keep First

Add to result if email not seen before

Code -

solution.c — C

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef struct {
    int customer_id;
    char name[100];
    char email[100];
} Customer;

static Customer customers[1000];
static Customer result[1000];
static char seenEmails[1000][100];

int solution(Customer* customers, int n, Customer* result) {
    int seenCount = 0;
    int resultCount = 0;
    
    for (int i = 0; i < n; i++) {
        int found = 0;
        for (int j = 0; j < seenCount; j++) {
            if (strcmp(customers[i].email, seenEmails[j]) == 0) {
                found = 1;
                break;
            }
        }
        
        if (!found) {
            strcpy(seenEmails[seenCount], customers[i].email);
            seenCount++;
            result[resultCount] = customers[i];
            resultCount++;
        }
    }
    
    return resultCount;
}

char* extractString(char* json, char* key) {
    static char buffer[200];
    char searchKey[50];
    sprintf(searchKey, "\"%s\": \"", key);
    
    char* start = strstr(json, searchKey);
    if (!start) return NULL;
    
    start += strlen(searchKey);
    char* end = strchr(start, '"');
    if (!end) return NULL;
    
    int len = end - start;
    strncpy(buffer, start, len);
    buffer[len] = '\0';
    return buffer;
}

int extractInt(char* json, char* key) {
    char searchKey[50];
    sprintf(searchKey, "\"%s\": ", key);
    
    char* start = strstr(json, searchKey);
    if (!start) return 0;
    
    start += strlen(searchKey);
    return atoi(start);
}

int main() {
    char input[10000];
    char line[10000];
    
    // Read the full input line
    if (fgets(line, sizeof(line), stdin)) {
        line[strcspn(line, "\n")] = '\0';
        
        // Extract the JSON array from the input format
        char* start = strchr(line, '[');
        if (start) {
            strcpy(input, start);
        } else {
            strcpy(input, line);
        }
    }
    
    int n = 0;
    char* ptr = input;
    
    // Count objects by counting opening braces
    while (*ptr) {
        if (*ptr == '{') n++;
        ptr++;
    }
    
    // Parse each JSON object
    ptr = input;
    int customerIndex = 0;
    
    while (*ptr && customerIndex < n) {
        // Find next object
        char* objStart = strchr(ptr, '{');
        if (!objStart) break;
        
        char* objEnd = strchr(objStart, '}');
        if (!objEnd) break;
        
        // Extract object
        int objLen = objEnd - objStart + 1;
        char objStr[1000];
        strncpy(objStr, objStart, objLen);
        objStr[objLen] = '\0';
        
        // Parse fields
        customers[customerIndex].customer_id = extractInt(objStr, "customer_id");
        
        char* nameStr = extractString(objStr, "name");
        if (nameStr) strcpy(customers[customerIndex].name, nameStr);
        
        char* emailStr = extractString(objStr, "email");
        if (emailStr) strcpy(customers[customerIndex].email, emailStr);
        
        customerIndex++;
        ptr = objEnd + 1;
    }
    
    int resultCount = solution(customers, n, result);
    
    // Output JSON
    printf("[");
    for (int i = 0; i < resultCount; i++) {
        if (i > 0) printf(",");
        printf("{\"customer_id\":%d,\"name\":\"%s\",\"email\":\"%s\"}", 
               result[i].customer_id, result[i].name, result[i].email);
    }
    printf("]\n");
    
    return 0;
}

Time & Space Complexity

Time Complexity

⏱️

O(n²)

For each row, we potentially check against all previous rows

⚠ Quadratic Growth

Space Complexity

O(n)

Set to store seen emails and new DataFrame to store results

⚡ Linearithmic Space

22.3K Views

High Frequency

~10 min Avg. Time

890 Likes

Ln 1, Col 1

Smart Actions

💡 Explanation

AI Ready

💡 Suggestion Tab to accept Esc to dismiss

// Output will appear here after running code

Code Editor Closed

Click the red button to reopen

Drop Duplicate Rows - Problem

Input & Output

Constraints

Visualization

Related Problems

Common Approaches

Manual Duplicate Detection — Algorithm Steps

Visualization

Code -

Time & Space Complexity

Select Compiler