Article Categories
- All Categories
-
Data Structure
-
Networking
-
RDBMS
-
Operating System
-
Java
-
MS Excel
-
iOS
-
HTML
-
CSS
-
Android
-
Python
-
C Programming
-
C++
-
C#
-
MongoDB
-
MySQL
-
Javascript
-
PHP
-
Economics & Finance
NLP language models_ What are they_ Example with N-grams
Language models in NLP are computational models that capture relationships between words and phrases to predict and generate text. They calculate the probability of the next word in a sequence and determine how likely an entire sequence of words is to occur naturally.
These models power many everyday applications you use autocomplete on your phone, grammar checkers, translation tools, and voice assistants. When your keyboard suggests the next word or corrects a typo, it's using probabilistic language models working behind the scenes.
This article explores how language models work, focusing on N-gram models one of the fundamental statistical approaches to language modeling.
What are N-gram Language Models?
N-gram models belong to statistical language models that use the Markov assumption to simplify probability calculations. Instead of considering the entire history of words, they assume that a word depends only on the previous N-1 words.
Mathematical Foundation
Using the chain rule of probability, the probability of a word sequence is ?
P(W1, W2, ..., Wn) = P(W1) × P(W2|W1) × P(W3|W1,W2) × ... × P(Wn|W1,W2,...,Wn-1)
However, calculating this exactly requires enormous computational resources. The N-gram approach simplifies this using the Markov assumption ?
P(Wi | Wi-1, Wi-2, ..., W1) ? P(Wi | Wi-1, Wi-2, ..., Wi-n+1)
For a bigram model (N=2), this becomes ?
P(W1, W2, ..., Wn) ? P(W1) × P(W2|W1) × P(W3|W2) × ... × P(Wn|Wn-1)
Types of N-gram Models
| Model Type | N Value | Dependencies | Example |
|---|---|---|---|
| Unigram | 1 | No context | P(word) |
| Bigram | 2 | Previous 1 word | P(word | prev_word) |
| Trigram | 3 | Previous 2 words | P(word | prev_word1, prev_word2) |
Example: Bigram Model Implementation
Let's implement a simple bigram model to understand how probabilities are calculated ?
from collections import defaultdict, Counter
# Sample training text
text = "the cat sat on the mat the dog ran"
words = text.split()
# Add start and end tokens
words = ['<START>'] + words + ['<END>']
# Create bigram counts
bigram_counts = Counter()
unigram_counts = Counter()
for i in range(len(words) - 1):
bigram = (words[i], words[i + 1])
bigram_counts[bigram] += 1
unigram_counts[words[i]] += 1
print("Bigram counts:", dict(bigram_counts))
print("Unigram counts:", dict(unigram_counts))
Bigram counts: {('<START>', 'the'): 1, ('the', 'cat'): 1, ('cat', 'sat'): 1, ('sat', 'on'): 1, ('on', 'the'): 1, ('the', 'mat'): 1, ('mat', 'the'): 1, ('the', 'dog'): 1, ('dog', 'ran'): 1, ('ran', '<END>'): 1}
Unigram counts: {'<START>': 1, 'the': 3, 'cat': 1, 'sat': 1, 'on': 1, 'mat': 1, 'dog': 1, 'ran': 1}
Calculating Bigram Probabilities
The probability of a word given the previous word is calculated as ?
def calculate_bigram_probability(word1, word2):
bigram = (word1, word2)
if bigram in bigram_counts and word1 in unigram_counts:
return bigram_counts[bigram] / unigram_counts[word1]
return 0
# Calculate some probabilities
prob_the_cat = calculate_bigram_probability('the', 'cat')
prob_the_dog = calculate_bigram_probability('the', 'dog')
prob_the_mat = calculate_bigram_probability('the', 'mat')
print(f"P(cat | the) = {prob_the_cat:.3f}")
print(f"P(dog | the) = {prob_the_dog:.3f}")
print(f"P(mat | the) = {prob_the_mat:.3f}")
P(cat | the) = 0.333 P(dog | the) = 0.333 P(mat | the) = 0.333
Practical Example: Sentence Probability
Let's calculate the probability of the sentence "the cat sat" using our bigram model ?
def sentence_probability(sentence):
sentence_words = ['<START>'] + sentence.split() + ['<END>']
probability = 1.0
for i in range(len(sentence_words) - 1):
word1, word2 = sentence_words[i], sentence_words[i + 1]
prob = calculate_bigram_probability(word1, word2)
probability *= prob
print(f"P({word2} | {word1}) = {prob:.3f}")
return probability
sentence = "the cat sat"
final_prob = sentence_probability(sentence)
print(f"\nOverall probability of '{sentence}': {final_prob:.6f}")
P(the | <START>) = 1.000 P(cat | the) = 0.333 P(sat | cat) = 1.000 P(<END> | sat) = 0.000 Overall probability of sentence 'the cat sat': 0.000000
Choosing the Right N-gram Size
The choice of N depends on your dataset size and application requirements ?
- Small datasets: Use bigrams or trigrams to avoid sparsity
- Large datasets: Higher N values (4-grams, 5-grams) capture more context
- Real-time applications: Lower N for faster computation
- High accuracy needs: Higher N for better context understanding
Conclusion
N-gram models provide a foundational understanding of statistical language modeling using the Markov assumption to simplify probability calculations. While modern neural language models have largely superseded them, N-grams remain important for understanding the mathematical principles behind language modeling and are still useful for resource-constrained applications.
