Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
N-gram Language Modeling with NLTK
Machine translation, voice recognition, and text prediction all benefit significantly from language modeling, which is an integral aspect of NLP. The well-known statistical technique N-gram language modeling predicts the next word in a sequence given the previous n words. This tutorial explores N-gram language modeling using the Natural Language Toolkit (NLTK), a robust Python library for natural language processing tasks.
Understanding N-grams and Language Modeling
N-grams are sequences of n consecutive elements (usually words) from a text. Different types include:
- Unigrams (n=1): Individual words like "the", "cat", "runs"
- Bigrams (n=2): Word pairs like "the cat", "cat runs"
- Trigrams (n=3): Three-word sequences like "the cat runs"
N-gram models are based on the Markov assumption ? the probability of the next word depends only on the previous n-1 words, not the entire history.
Setting Up NLTK
First, install and import the required libraries ?
import nltk
from nltk import FreqDist
from collections import defaultdict
# Download required NLTK data
nltk.download('punkt')
[nltk_data] Downloading package punkt to /root/nltk_data... [nltk_data] Unzipped to /root/nltk_data/tokenizers/punkt.zip. True
Tokenization and Text Preprocessing
Tokenization breaks text into individual words or sentences. This is essential for N-gram generation ?
# Sample text corpus
text = "The cat sat on the mat. The dog ran in the park. The cat loves the park."
# Tokenize into words
tokens = nltk.word_tokenize(text.lower())
print("Tokens:", tokens)
Tokens: ['the', 'cat', 'sat', 'on', 'the', 'mat', '.', 'the', 'dog', 'ran', 'in', 'the', 'park', '.', 'the', 'cat', 'loves', 'the', 'park', '.']
Generating N-grams with NLTK
NLTK provides the ngrams() function to generate N-grams from tokenized text ?
# Generate different types of n-grams
unigrams = list(nltk.ngrams(tokens, 1))
bigrams = list(nltk.ngrams(tokens, 2))
trigrams = list(nltk.ngrams(tokens, 3))
print("Unigrams (first 5):", unigrams[:5])
print("Bigrams (first 5):", bigrams[:5])
print("Trigrams (first 5):", trigrams[:5])
Unigrams (first 5): [('the',), ('cat',), ('sat',), ('on',), ('the',)]
Bigrams (first 5): [('the', 'cat'), ('cat', 'sat'), ('sat', 'on'), ('on', 'the'), ('the', 'mat')]
Trigrams (first 5): [('the', 'cat', 'sat'), ('cat', 'sat', 'on'), ('sat', 'on', 'the'), ('on', 'the', 'mat'), ('the', 'mat', '.')]
Building a Unigram Language Model
A unigram model calculates word probabilities based on their frequency in the corpus ?
# Create frequency distribution for unigrams
unigram_freq = FreqDist(unigrams)
# Calculate probability for a specific word
word = 'the'
total_words = len(unigrams)
word_count = unigram_freq[('the',)]
probability = word_count / total_words
print(f"Word '{word}' appears {word_count} times out of {total_words}")
print(f"Probability of '{word}': {probability:.3f}")
# Show top 5 most frequent words
print("\nTop 5 most frequent words:")
for word, count in unigram_freq.most_common(5):
prob = count / total_words
print(f"{word[0]}: {prob:.3f}")
Word 'the' appears 6 times out of 20 Probability of 'the': 0.300 Top 5 most frequent words: the: 0.300 .: 0.150 cat: 0.100 park: 0.100 mat: 0.050
Building a Bigram Language Model
A bigram model predicts the next word based on the previous word ?
# Create bigram frequency distribution
bigram_freq = FreqDist(bigrams)
# Create conditional frequency distribution
# This counts how often word2 follows word1
conditional_freq = defaultdict(lambda: defaultdict(int))
for word1, word2 in bigrams:
conditional_freq[word1][word2] += 1
# Function to calculate bigram probability
def bigram_probability(word1, word2):
if word1 not in conditional_freq:
return 0
word1_count = sum(conditional_freq[word1].values())
word2_given_word1 = conditional_freq[word1][word2]
return word2_given_word1 / word1_count
# Test bigram probabilities
test_pairs = [('the', 'cat'), ('cat', 'sat'), ('the', 'dog')]
for w1, w2 in test_pairs:
prob = bigram_probability(w1, w2)
print(f"P('{w2}' | '{w1}') = {prob:.3f}")
P('cat' | 'the') = 0.333
P('sat' | 'cat') = 0.500
P('dog' | 'the') = 0.167
Text Generation with N-gram Models
Use the trained model to generate new text by predicting the most likely next word ?
import random
def generate_text(start_word, length=5):
current_word = start_word
result = [current_word]
for _ in range(length - 1):
if current_word not in conditional_freq:
break
# Get possible next words and their counts
next_words = conditional_freq[current_word]
if not next_words:
break
# Choose next word based on probability (weighted random)
words = list(next_words.keys())
weights = list(next_words.values())
current_word = random.choices(words, weights=weights)[0]
result.append(current_word)
return ' '.join(result)
# Generate text starting with "the"
generated = generate_text("the", 8)
print(f"Generated text: {generated}")
Generated text: the cat sat on the park . the
Model Evaluation
Evaluate model performance using perplexity ? a measure of how well the model predicts the test data ?
import math
def calculate_perplexity(test_bigrams, model):
log_prob_sum = 0
n = 0
for word1, word2 in test_bigrams:
prob = bigram_probability(word1, word2)
if prob > 0:
log_prob_sum += math.log(prob)
n += 1
if n == 0:
return float('inf')
# Perplexity = 2^(-log_prob_sum/n)
perplexity = 2 ** (-log_prob_sum / n)
return perplexity
# Test on our existing bigrams
perplexity = calculate_perplexity(bigrams, conditional_freq)
print(f"Model perplexity: {perplexity:.2f}")
print("(Lower perplexity indicates better performance)")
Model perplexity: 2.83 (Lower perplexity indicates better performance)
Comparison of N-gram Models
| Model Type | Context Size | Best For | Limitation |
|---|---|---|---|
| Unigram | 0 (no context) | Simple word frequency | No word order |
| Bigram | 1 word | Local dependencies | Limited context |
| Trigram | 2 words | Better context | Data sparsity |
Conclusion
N-gram language modeling with NLTK provides a foundation for understanding statistical language patterns. Bigram and trigram models capture local word dependencies, while unigram models focus on individual word frequencies. These models form the basis for more advanced NLP applications like text generation, machine translation, and spell checking.
