NLP language models_ What are they_ Example with N-grams

Language models in NLP are computational models that capture relationships between words and phrases to predict and generate text. They calculate the probability of the next word in a sequence and determine how likely an entire sequence of words is to occur naturally.

These models power many everyday applications you use autocomplete on your phone, grammar checkers, translation tools, and voice assistants. When your keyboard suggests the next word or corrects a typo, it's using probabilistic language models working behind the scenes.

This article explores how language models work, focusing on N-gram models one of the fundamental statistical approaches to language modeling.

What are N-gram Language Models?

N-gram models belong to statistical language models that use the Markov assumption to simplify probability calculations. Instead of considering the entire history of words, they assume that a word depends only on the previous N-1 words.

Mathematical Foundation

Using the chain rule of probability, the probability of a word sequence is ?

P(W1, W2, ..., Wn) = P(W1) × P(W2|W1) × P(W3|W1,W2) × ... × P(Wn|W1,W2,...,Wn-1)

However, calculating this exactly requires enormous computational resources. The N-gram approach simplifies this using the Markov assumption ?

P(Wi | Wi-1, Wi-2, ..., W1) ? P(Wi | Wi-1, Wi-2, ..., Wi-n+1)

For a bigram model (N=2), this becomes ?

P(W1, W2, ..., Wn) ? P(W1) × P(W2|W1) × P(W3|W2) × ... × P(Wn|Wn-1)

Types of N-gram Models

Model Type N Value Dependencies Example
Unigram 1 No context P(word)
Bigram 2 Previous 1 word P(word | prev_word)
Trigram 3 Previous 2 words P(word | prev_word1, prev_word2)

Example: Bigram Model Implementation

Let's implement a simple bigram model to understand how probabilities are calculated ?

from collections import defaultdict, Counter

# Sample training text
text = "the cat sat on the mat the dog ran"
words = text.split()

# Add start and end tokens
words = ['<START>'] + words + ['<END>']

# Create bigram counts
bigram_counts = Counter()
unigram_counts = Counter()

for i in range(len(words) - 1):
    bigram = (words[i], words[i + 1])
    bigram_counts[bigram] += 1
    unigram_counts[words[i]] += 1

print("Bigram counts:", dict(bigram_counts))
print("Unigram counts:", dict(unigram_counts))
Bigram counts: {('<START>', 'the'): 1, ('the', 'cat'): 1, ('cat', 'sat'): 1, ('sat', 'on'): 1, ('on', 'the'): 1, ('the', 'mat'): 1, ('mat', 'the'): 1, ('the', 'dog'): 1, ('dog', 'ran'): 1, ('ran', '<END>'): 1}
Unigram counts: {'<START>': 1, 'the': 3, 'cat': 1, 'sat': 1, 'on': 1, 'mat': 1, 'dog': 1, 'ran': 1}

Calculating Bigram Probabilities

The probability of a word given the previous word is calculated as ?

def calculate_bigram_probability(word1, word2):
    bigram = (word1, word2)
    if bigram in bigram_counts and word1 in unigram_counts:
        return bigram_counts[bigram] / unigram_counts[word1]
    return 0

# Calculate some probabilities
prob_the_cat = calculate_bigram_probability('the', 'cat')
prob_the_dog = calculate_bigram_probability('the', 'dog')
prob_the_mat = calculate_bigram_probability('the', 'mat')

print(f"P(cat | the) = {prob_the_cat:.3f}")
print(f"P(dog | the) = {prob_the_dog:.3f}")
print(f"P(mat | the) = {prob_the_mat:.3f}")
P(cat | the) = 0.333
P(dog | the) = 0.333
P(mat | the) = 0.333

Practical Example: Sentence Probability

Let's calculate the probability of the sentence "the cat sat" using our bigram model ?

def sentence_probability(sentence):
    sentence_words = ['<START>'] + sentence.split() + ['<END>']
    probability = 1.0
    
    for i in range(len(sentence_words) - 1):
        word1, word2 = sentence_words[i], sentence_words[i + 1]
        prob = calculate_bigram_probability(word1, word2)
        probability *= prob
        print(f"P({word2} | {word1}) = {prob:.3f}")
    
    return probability

sentence = "the cat sat"
final_prob = sentence_probability(sentence)
print(f"\nOverall probability of '{sentence}': {final_prob:.6f}")
P(the | <START>) = 1.000
P(cat | the) = 0.333
P(sat | cat) = 1.000
P(<END> | sat) = 0.000

Overall probability of sentence 'the cat sat': 0.000000

Choosing the Right N-gram Size

The choice of N depends on your dataset size and application requirements ?

  • Small datasets: Use bigrams or trigrams to avoid sparsity
  • Large datasets: Higher N values (4-grams, 5-grams) capture more context
  • Real-time applications: Lower N for faster computation
  • High accuracy needs: Higher N for better context understanding

Conclusion

N-gram models provide a foundational understanding of statistical language modeling using the Markov assumption to simplify probability calculations. While modern neural language models have largely superseded them, N-grams remain important for understanding the mathematical principles behind language modeling and are still useful for resource-constrained applications.

Updated on: 2026-03-27T07:39:58+05:30

543 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements