Python Program to Find the Number of Unique Words in Text File

Finding the number of unique words in a text file is a common text processing task. Python provides several approaches to accomplish this, with sets and dictionaries being the most efficient methods.

Sample Text File Content

For demonstration, we'll create a sample text file with repeated content ?

# Create a sample text file
sample_text = """This is a new file.
This is made for testing purposes only.
There are four lines in this file.
There are four lines in this file.
There are four lines in this file.
There are four lines in this file.
Oh! No.. there are seven lines now."""

# Write to file
with open('sample.txt', 'w') as f:
    f.write(sample_text)

print("Sample file created successfully!")
Sample file created successfully!

Method 1: Using Python Sets

Sets automatically eliminate duplicates, making this the most efficient approach ?

# Open and read the text file
with open('sample.txt', 'r') as file:
    content = file.read().lower()

# Split content into words
words = content.split()

print("Total words:", len(words))
print("First 10 words:", words[:10])

# Convert to set to get unique words
unique_words = set(words)

print("\nUnique words:", sorted(unique_words))
print("Number of unique words:", len(unique_words))
Total words: 47
First 10 words: ['this', 'is', 'a', 'new', 'file.', 'this', 'is', 'made', 'for', 'testing']

Unique words: ['a', 'are', 'file.', 'for', 'four', 'in', 'is', 'lines', 'made', 'new', 'no..', 'now.', 'oh!', 'only.', 'purposes', 'seven', 'testing', 'there', 'this']
Number of unique words: 19

Method 2: Using Dictionary Keys

Dictionary keys maintain insertion order while eliminating duplicates ?

# Open and read the text file
with open('sample.txt', 'r') as file:
    content = file.read().lower()

# Split content into words
words = content.split()

print("Original word count:", len(words))

# Use dict.fromkeys() to remove duplicates while preserving order
unique_words_list = list(dict.fromkeys(words))

print("Unique words (order preserved):", unique_words_list)
print("Number of unique words:", len(unique_words_list))
Original word count: 47
Unique words (order preserved): ['this', 'is', 'a', 'new', 'file.', 'made', 'for', 'testing', 'purposes', 'only.', 'there', 'are', 'four', 'lines', 'in', 'oh!', 'no..', 'seven', 'now.']
Number of unique words: 19

Method 3: Word Frequency Counter

For detailed analysis, we can count the frequency of each unique word ?

from collections import Counter

# Open and read the text file
with open('sample.txt', 'r') as file:
    content = file.read().lower()

# Split content into words
words = content.split()

# Count word frequencies
word_counter = Counter(words)

print("Word frequencies:")
for word, count in word_counter.most_common(5):
    print(f"'{word}': {count}")

print(f"\nTotal unique words: {len(word_counter)}")
print(f"Most common word: '{word_counter.most_common(1)[0][0]}' appears {word_counter.most_common(1)[0][1]} times")
Word frequencies:
'this': 6
'are': 5
'four': 4
'lines': 5
'in': 4

Total unique words: 19
Most common word: 'this' appears 6 times

Comparison of Methods

Method Preserves Order? Memory Usage Best For
Sets No Low Simple unique count
dict.fromkeys() Yes Medium Order-preserved uniqueness
Counter No Higher Frequency analysis

Handling Punctuation

For more accurate word counting, remove punctuation before processing ?

import string

# Open and read the text file
with open('sample.txt', 'r') as file:
    content = file.read().lower()

# Remove punctuation
translator = str.maketrans('', '', string.punctuation)
clean_content = content.translate(translator)

# Split into words
clean_words = clean_content.split()
unique_clean_words = set(clean_words)

print("Unique words (without punctuation):", sorted(unique_clean_words))
print("Count without punctuation:", len(unique_clean_words))
Unique words (without punctuation): ['a', 'are', 'file', 'for', 'four', 'in', 'is', 'lines', 'made', 'new', 'no', 'now', 'oh', 'only', 'purposes', 'seven', 'testing', 'there', 'this']
Count without punctuation: 19

Conclusion

Use sets for simple unique word counting as they provide the fastest performance. Use dict.fromkeys() when you need to preserve the original word order. For detailed text analysis with frequency information, Counter from collections is the best choice.

Updated on: 2026-03-27T07:21:10+05:30

4K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements