Article Categories

Selected Reading

Check Duplicate in a Stream of Strings

Python C++ Server Side Programming

A stream of strings is a sequential flow of string data where each element represents an individual string. In Python, we can efficiently check for duplicates in a stream using data structures like sets or dictionaries to track previously seen strings.

Using a Set to Track Duplicates

The most efficient approach uses a set to store unique strings we've already encountered. Sets provide O(1) average-case lookup time, making duplicate detection fast ?

Example

def check_duplicate_in_stream(strings):
    seen = set()
    results = []
    
    for string in strings:
        if string in seen:
            results.append(f"'{string}': Duplicate found")
        else:
            seen.add(string)
            results.append(f"'{string}': Unique element")
    
    return results

# Test with a stream of strings
stream = ["A", "B", "C", "C", "A", "D"]
results = check_duplicate_in_stream(stream)

for result in results:
    print(result)

'A': Unique element
'B': Unique element
'C': Unique element
'C': Duplicate found
'A': Duplicate found
'D': Unique element

Using a Dictionary to Count Occurrences

If you need to track how many times each string appears, use a dictionary to maintain counts ?

def check_duplicates_with_count(strings):
    counts = {}
    results = []
    
    for string in strings:
        if string in counts:
            counts[string] += 1
            results.append(f"'{string}': Duplicate (appears {counts[string]} times)")
        else:
            counts[string] = 1
            results.append(f"'{string}': First occurrence")
    
    return results

# Test with the same stream
stream = ["A", "B", "C", "C", "A", "D"]
results = check_duplicates_with_count(stream)

for result in results:
    print(result)

'A': First occurrence
'B': First occurrence
'C': First occurrence
'C': Duplicate (appears 2 times)
'A': Duplicate (appears 2 times)
'D': First occurrence

Generator Function for Large Streams

For memory-efficient processing of large streams, use a generator function that yields results one at a time ?

def duplicate_detector(stream):
    seen = set()
    
    for string in stream:
        if string in seen:
            yield f"'{string}': Duplicate"
        else:
            seen.add(string)
            yield f"'{string}': Unique"

# Process stream efficiently
stream = ["hello", "world", "hello", "python", "world"]

for result in duplicate_detector(stream):
    print(result)

'hello': Unique
'world': Unique
'hello': Duplicate
'python': Unique
'world': Duplicate

Comparison of Methods

Method	Time Complexity	Space Complexity	Best For
Set-based	O(n)	O(k)	Simple duplicate detection
Dictionary counts	O(n)	O(k)	Tracking occurrence counts
Generator	O(n)	O(k)	Memory-efficient processing

Note: n = total strings, k = unique strings

Conclusion

Use sets for efficient duplicate detection in string streams with O(1) lookup time. For large datasets, generator functions provide memory-efficient processing while maintaining the same time complexity.

Tapas Kumar Ghosh

Updated on: 2026-03-27T12:42:16+05:30

629 Views

Previous Next