Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Check Duplicate in a Stream of Strings
A stream of strings is a sequential flow of string data where each element represents an individual string. In Python, we can efficiently check for duplicates in a stream using data structures like sets or dictionaries to track previously seen strings.
Using a Set to Track Duplicates
The most efficient approach uses a set to store unique strings we've already encountered. Sets provide O(1) average-case lookup time, making duplicate detection fast ?
Example
def check_duplicate_in_stream(strings):
seen = set()
results = []
for string in strings:
if string in seen:
results.append(f"'{string}': Duplicate found")
else:
seen.add(string)
results.append(f"'{string}': Unique element")
return results
# Test with a stream of strings
stream = ["A", "B", "C", "C", "A", "D"]
results = check_duplicate_in_stream(stream)
for result in results:
print(result)
'A': Unique element 'B': Unique element 'C': Unique element 'C': Duplicate found 'A': Duplicate found 'D': Unique element
Using a Dictionary to Count Occurrences
If you need to track how many times each string appears, use a dictionary to maintain counts ?
def check_duplicates_with_count(strings):
counts = {}
results = []
for string in strings:
if string in counts:
counts[string] += 1
results.append(f"'{string}': Duplicate (appears {counts[string]} times)")
else:
counts[string] = 1
results.append(f"'{string}': First occurrence")
return results
# Test with the same stream
stream = ["A", "B", "C", "C", "A", "D"]
results = check_duplicates_with_count(stream)
for result in results:
print(result)
'A': First occurrence 'B': First occurrence 'C': First occurrence 'C': Duplicate (appears 2 times) 'A': Duplicate (appears 2 times) 'D': First occurrence
Generator Function for Large Streams
For memory-efficient processing of large streams, use a generator function that yields results one at a time ?
def duplicate_detector(stream):
seen = set()
for string in stream:
if string in seen:
yield f"'{string}': Duplicate"
else:
seen.add(string)
yield f"'{string}': Unique"
# Process stream efficiently
stream = ["hello", "world", "hello", "python", "world"]
for result in duplicate_detector(stream):
print(result)
'hello': Unique 'world': Unique 'hello': Duplicate 'python': Unique 'world': Duplicate
Comparison of Methods
| Method | Time Complexity | Space Complexity | Best For |
|---|---|---|---|
| Set-based | O(n) | O(k) | Simple duplicate detection |
| Dictionary counts | O(n) | O(k) | Tracking occurrence counts |
| Generator | O(n) | O(k) | Memory-efficient processing |
Note: n = total strings, k = unique strings
Conclusion
Use sets for efficient duplicate detection in string streams with O(1) lookup time. For large datasets, generator functions provide memory-efficient processing while maintaining the same time complexity.
