Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
Find the k most frequent words from data set in Python
If there is a need to find the k most frequent words in a data set, Python can help us achieve this using the collections module. The collections module has a Counter class which counts the frequency of words after we supply a list of words to it. We also use the most_common() method to find the specified number of most frequent words.
Basic Approach Using Counter
In the below example we take a paragraph, create a list of words using split(), then apply Counter() to count word frequencies. Finally, most_common() returns the top k most frequent words ?
from collections import Counter
word_set = "This is a series of strings to count " \
"many words. They sometime hurt and words sometime inspire " \
"Also sometime fewer words convey more meaning than a bag of words " \
"Be careful what you speak or what you write or even what you think of."
# Create list of all the words in the string
word_list = word_set.split()
# Get the count of each word
word_count = Counter(word_list)
# Use most_common() method to get top 3 words
print(word_count.most_common(3))
The output of the above code is ?
[('words', 4), ('sometime', 3), ('what', 3)]
Finding Top K Words from a File
For larger datasets stored in files, we can read and process the text efficiently ?
from collections import Counter
import re
# Sample text data (simulating file content)
text_data = """
Python is a powerful programming language. Python is easy to learn.
Many developers choose Python for data science and machine learning.
Python has excellent libraries for data analysis.
"""
# Clean and split text into words (remove punctuation)
words = re.findall(r'\b\w+\b', text_data.lower())
# Count word frequencies
word_counter = Counter(words)
# Find top 5 most frequent words
k = 5
top_k_words = word_counter.most_common(k)
print(f"Top {k} most frequent words:")
for word, count in top_k_words:
print(f"{word}: {count}")
The output of the above code is ?
Top 5 most frequent words: python: 4 is: 2 data: 2 for: 2 and: 1
Using Dictionary Approach
Alternative method using a dictionary to count word frequencies manually ?
def find_k_frequent_words(text, k):
# Convert to lowercase and split into words
words = text.lower().split()
# Count frequencies using dictionary
word_freq = {}
for word in words:
word_freq[word] = word_freq.get(word, 0) + 1
# Sort by frequency (descending)
sorted_words = sorted(word_freq.items(), key=lambda x: x[1], reverse=True)
return sorted_words[:k]
# Example usage
sample_text = "apple banana apple orange banana apple grape orange apple"
result = find_k_frequent_words(sample_text, 3)
print("Top 3 frequent words:")
for word, freq in result:
print(f"{word}: {freq}")
The output of the above code is ?
Top 3 frequent words: apple: 4 banana: 2 orange: 2
Comparison of Methods
| Method | Performance | Best For |
|---|---|---|
Counter |
Fast | Simple word counting |
| Dictionary | Moderate | Custom counting logic |
Counter + regex |
Fast | Text preprocessing needed |
Conclusion
The Counter class from collections module is the most efficient way to find k most frequent words. Use most_common(k) to get the top k results, and combine with regex for better text preprocessing.
