Find the k most frequent words from data set in Python

If there is a need to find the k most frequent words in a data set, Python can help us achieve this using the collections module. The collections module has a Counter class which counts the frequency of words after we supply a list of words to it. We also use the most_common() method to find the specified number of most frequent words.

Basic Approach Using Counter

In the below example we take a paragraph, create a list of words using split(), then apply Counter() to count word frequencies. Finally, most_common() returns the top k most frequent words ?

from collections import Counter

word_set = "This is a series of strings to count " \
    "many words. They sometime hurt and words sometime inspire " \
    "Also sometime fewer words convey more meaning than a bag of words " \
    "Be careful what you speak or what you write or even what you think of."

# Create list of all the words in the string
word_list = word_set.split()

# Get the count of each word
word_count = Counter(word_list)

# Use most_common() method to get top 3 words
print(word_count.most_common(3))

The output of the above code is ?

[('words', 4), ('sometime', 3), ('what', 3)]

Finding Top K Words from a File

For larger datasets stored in files, we can read and process the text efficiently ?

from collections import Counter
import re

# Sample text data (simulating file content)
text_data = """
Python is a powerful programming language. Python is easy to learn.
Many developers choose Python for data science and machine learning.
Python has excellent libraries for data analysis.
"""

# Clean and split text into words (remove punctuation)
words = re.findall(r'\b\w+\b', text_data.lower())

# Count word frequencies
word_counter = Counter(words)

# Find top 5 most frequent words
k = 5
top_k_words = word_counter.most_common(k)

print(f"Top {k} most frequent words:")
for word, count in top_k_words:
    print(f"{word}: {count}")

The output of the above code is ?

Top 5 most frequent words:
python: 4
is: 2
data: 2
for: 2
and: 1

Using Dictionary Approach

Alternative method using a dictionary to count word frequencies manually ?

def find_k_frequent_words(text, k):
    # Convert to lowercase and split into words
    words = text.lower().split()
    
    # Count frequencies using dictionary
    word_freq = {}
    for word in words:
        word_freq[word] = word_freq.get(word, 0) + 1
    
    # Sort by frequency (descending)
    sorted_words = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)
    
    return sorted_words[:k]

# Example usage
sample_text = "apple banana apple orange banana apple grape orange apple"
result = find_k_frequent_words(sample_text, 3)

print("Top 3 frequent words:")
for word, freq in result:
    print(f"{word}: {freq}")

The output of the above code is ?

Top 3 frequent words:
apple: 4
banana: 2
orange: 2

Comparison of Methods

Method Performance Best For
Counter Fast Simple word counting
Dictionary Moderate Custom counting logic
Counter + regex Fast Text preprocessing needed

Conclusion

The Counter class from collections module is the most efficient way to find k most frequent words. Use most_common(k) to get the top k results, and combine with regex for better text preprocessing.

Updated on: 2026-03-15T17:02:35+05:30

1K+ Views

Advertisements