NLP language models_ What are they_ Example with N-grams


Language models in NLP are statistically generated computational models that capture relations between words and phrases to generate new text. Essentially, they can find the probability of the next word in a given sequence of words and also the probability of a entire sequence of words.

These language models are important as they help in various NLP tasks such as machine translation, language generation, and word completion among others. You may not realize it, but when typing on computers or phones often the corrections made and your writing are essentially guided by NLP; Word completion and sometimes error detection in spelling and phrases are based on probabilistic language-models.

This article will go over how language models work, the statistical and probabilistic background, various techniques used and how they are applied in various contexts. Specifically, we will go over N-gram models which are one of the most basic types.

What is the N-gram Language Model and How is it Derived?

There are two types of language models: Statistical models and neural network based language models. N-grams belong to the statistical models and create a probability distribution for sentences and words in sentences with the help of the Markov property. Let’s gain an in-depth understanding of how these models can be formed. From our knowledge of conditional probability, we can find the probability of a sequence of words with the chain rule of probability, and the probability of the next word given the formula where Wi represents the ith word in a sentence :

Probability of a sequence will be as follows

Equation 1: P(W1, W2,....., Wn) = P(W1) x  P(W2|W1) x P(W3|W1,W2) …..x P(Wn|W1,W2,....,Wn-1)

Probability of a given word in a sequence will be

$$\mathrm{Equation\:2\::\:P(Wi\:|\:Wi-1,\:Wi-2,\:.....,\:W1)\:=\:\frac{Count(wi−1,wi−2,....,w1)}{Count(wi,wi−1,wi−2,...,w1)}}$$

However, the general idea of this type of method used in the above equarion, while dealing with a large scale corpus would require too much time and resources to compute. We can instead use our knowledge of probability to find new ways of approximating probabilities of words and sentences.

When dealing with every single preceding word we realize that can be quite a tedious task, so we can instead assume the aforementioned Markov property.

This means that instead of focusing on every single preceding word, we only focus on a fixed number that in turn are meant to represent all the preceding words. Hence we get the following estimation −

Equation 3: P(Wi = w1 | Wi-1 = wi-1, Wi-2 = wi-2, …. W1 = w1)  ≈ P(Wi = wi | Wi-1 = wi-1) 

Now, we can combine “Equation 1” and “Equation 3” to create a formula that we can more easily work with through approximations −

Equation 4: P(W1, W2,....., Wn)  ≈ P(W1) x  P(W2|W1) x P(W3|W2) …..x P(Wn|Wn-1) 

Notice in above equation how we have chosen only two words in the sequence of probabilities? Well, within n-gram models this would be a bi-gram. If we just focused on each particular word, it would be unigram, and if we chose three words in sequence we would get a trigram. This naming convention goes on to be named 4-gram, 5-gram, and so on.

When to Use Which Model

To know when to use which type of model, like bigram or A 5-gram model, would depend on the size of the dataset. Bigram models are fitting for small datasets because they are more likely to appear than larger n-grams on a small dataset, which could cause sparsity .On the other hand for larger datasets larger n - gram model n size would work.

To help understand this concretely, let’s replace the variables mentioned above in the equations with actual words. Below we have an example of a bigram model for a specific sentence (Note that START and END are used to indicate that the next words is the beginning or the previous word is the last) −

P(This is a test sentence) = P(This | START) x P(is | This) x P(a | This is) x P(test | This is a) x P(sentence | This is a test ) x P(END | This is a test sentence) ≈ P(This | START) x P(is|This) x P(a|is) x P(test|a) x P(sentence|test) x P(END |sentence)

Conclusion

In this tutorial we have gone over how a specific language model can be calculated without the use of deep learning which has grown sharply in popularity over the past few years. This should give you a good idea of how different ideas within a constantly changing field come about.

Updated on: 19-Jul-2023

168 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements