Article Categories

Selected Reading

N-gram Language Modeling with NLTK

Machine Learning Artificial Intelligence Python

Machine translation, voice recognition, and text prediction all benefit significantly from language modeling, which is an integral aspect of NLP. The well-known statistical technique N-gram language modeling predicts the next word in a sequence given the previous n words. This tutorial explores N-gram language modeling using the Natural Language Toolkit (NLTK), a robust Python library for natural language processing tasks.

Understanding N-grams and Language Modeling

N-grams are sequences of n consecutive elements (usually words) from a text. Different types include:

Unigrams (n=1): Individual words like "the", "cat", "runs"
Bigrams (n=2): Word pairs like "the cat", "cat runs"
Trigrams (n=3): Three-word sequences like "the cat runs"

N-gram models are based on the Markov assumption ? the probability of the next word depends only on the previous n-1 words, not the entire history.

Setting Up NLTK

First, install and import the required libraries ?

import nltk
from nltk import FreqDist
from collections import defaultdict

# Download required NLTK data
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipped to /root/nltk_data/tokenizers/punkt.zip.
True

Tokenization and Text Preprocessing

Tokenization breaks text into individual words or sentences. This is essential for N-gram generation ?

# Sample text corpus
text = "The cat sat on the mat. The dog ran in the park. The cat loves the park."

# Tokenize into words
tokens = nltk.word_tokenize(text.lower())
print("Tokens:", tokens)

Tokens: ['the', 'cat', 'sat', 'on', 'the', 'mat', '.', 'the', 'dog', 'ran', 'in', 'the', 'park', '.', 'the', 'cat', 'loves', 'the', 'park', '.']

Generating N-grams with NLTK

NLTK provides the ngrams() function to generate N-grams from tokenized text ?

# Generate different types of n-grams
unigrams = list(nltk.ngrams(tokens, 1))
bigrams = list(nltk.ngrams(tokens, 2))
trigrams = list(nltk.ngrams(tokens, 3))

print("Unigrams (first 5):", unigrams[:5])
print("Bigrams (first 5):", bigrams[:5])
print("Trigrams (first 5):", trigrams[:5])

Unigrams (first 5): [('the',), ('cat',), ('sat',), ('on',), ('the',)]
Bigrams (first 5): [('the', 'cat'), ('cat', 'sat'), ('sat', 'on'), ('on', 'the'), ('the', 'mat')]
Trigrams (first 5): [('the', 'cat', 'sat'), ('cat', 'sat', 'on'), ('sat', 'on', 'the'), ('on', 'the', 'mat'), ('the', 'mat', '.')]

Building a Unigram Language Model

A unigram model calculates word probabilities based on their frequency in the corpus ?

# Create frequency distribution for unigrams
unigram_freq = FreqDist(unigrams)

# Calculate probability for a specific word
word = 'the'
total_words = len(unigrams)
word_count = unigram_freq[('the',)]
probability = word_count / total_words

print(f"Word '{word}' appears {word_count} times out of {total_words}")
print(f"Probability of '{word}': {probability:.3f}")

# Show top 5 most frequent words
print("\nTop 5 most frequent words:")
for word, count in unigram_freq.most_common(5):
    prob = count / total_words
    print(f"{word[0]}: {prob:.3f}")

Word 'the' appears 6 times out of 20
Probability of 'the': 0.300

Top 5 most frequent words:
the: 0.300
.: 0.150
cat: 0.100
park: 0.100
mat: 0.050

Building a Bigram Language Model

A bigram model predicts the next word based on the previous word ?

# Create bigram frequency distribution
bigram_freq = FreqDist(bigrams)

# Create conditional frequency distribution
# This counts how often word2 follows word1
conditional_freq = defaultdict(lambda: defaultdict(int))

for word1, word2 in bigrams:
    conditional_freq[word1][word2] += 1

# Function to calculate bigram probability
def bigram_probability(word1, word2):
    if word1 not in conditional_freq:
        return 0
    
    word1_count = sum(conditional_freq[word1].values())
    word2_given_word1 = conditional_freq[word1][word2]
    
    return word2_given_word1 / word1_count

# Test bigram probabilities
test_pairs = [('the', 'cat'), ('cat', 'sat'), ('the', 'dog')]

for w1, w2 in test_pairs:
    prob = bigram_probability(w1, w2)
    print(f"P('{w2}' | '{w1}') = {prob:.3f}")

P('cat' | 'the') = 0.333
P('sat' | 'cat') = 0.500
P('dog' | 'the') = 0.167

Text Generation with N-gram Models

Use the trained model to generate new text by predicting the most likely next word ?

import random

def generate_text(start_word, length=5):
    current_word = start_word
    result = [current_word]
    
    for _ in range(length - 1):
        if current_word not in conditional_freq:
            break
            
        # Get possible next words and their counts
        next_words = conditional_freq[current_word]
        if not next_words:
            break
            
        # Choose next word based on probability (weighted random)
        words = list(next_words.keys())
        weights = list(next_words.values())
        
        current_word = random.choices(words, weights=weights)[0]
        result.append(current_word)
    
    return ' '.join(result)

# Generate text starting with "the"
generated = generate_text("the", 8)
print(f"Generated text: {generated}")

Generated text: the cat sat on the park . the

Model Evaluation

Evaluate model performance using perplexity ? a measure of how well the model predicts the test data ?

import math

def calculate_perplexity(test_bigrams, model):
    log_prob_sum = 0
    n = 0
    
    for word1, word2 in test_bigrams:
        prob = bigram_probability(word1, word2)
        if prob > 0:
            log_prob_sum += math.log(prob)
            n += 1
    
    if n == 0:
        return float('inf')
    
    # Perplexity = 2^(-log_prob_sum/n)
    perplexity = 2 ** (-log_prob_sum / n)
    return perplexity

# Test on our existing bigrams
perplexity = calculate_perplexity(bigrams, conditional_freq)
print(f"Model perplexity: {perplexity:.2f}")
print("(Lower perplexity indicates better performance)")

Model perplexity: 2.83
(Lower perplexity indicates better performance)

Comparison of N-gram Models

Model Type	Context Size	Best For	Limitation
Unigram	0 (no context)	Simple word frequency	No word order
Bigram	1 word	Local dependencies	Limited context
Trigram	2 words	Better context	Data sparsity

Conclusion

N-gram language modeling with NLTK provides a foundation for understanding statistical language patterns. Bigram and trigram models capture local word dependencies, while unigram models focus on individual word frequencies. These models form the basis for more advanced NLP applications like text generation, machine translation, and spell checking.

Someswar Pal

Updated on: 2026-03-27T15:02:24+05:30

1K+ Views

Previous Next