Training Unigram Tagger in NLP


Introduction

A single token is called a unigram. A unigram tagger is the type of tagger that requires only one word for inferring the Parts of Speech of a word. It has the context of a single word.NLTK library provides us with the UnigramTagger and is inherited from NgramTagger.

In this article let us understand the training process of Unigram Tagger in NLP.

Unigram Tagger and its training using NLTK

WORKING

  • The UnigramTagger is inherited from the ContextTagger. A context() method is implemented. The context method has the same arguments as the choose_tag()

  • From the context() method, a word token will be used to create the model. This word is used to look for the best tag.

  • The UnigramTagger will create a model with a context.

Python Implementation

import nltk
nltk.download('treebank')
from nltk.tag import UnigramTagger
from nltk.corpus import treebank as tb
sentences_trained = treebank.tagged_sents()[:4000]
uni_tagger = UnigramTagger(sentences_trained)
print("Sample Sentence : ",tb.sents()[1])
print("Tag sample sentence : ", uni_tagger.tag(tb.sents()[1]))

Output

Sample Sentence :  ['Mr.', 'Vinken', 'is', 'chairman', 'of', 'Elsevier', 'N.V.', ',', 'the', 'Dutch', 'publishing', 'group', '.']
Tag sample sentence :  [('Mr.', 'NNP'), ('Vinken', 'NNP'), ('is', 'VBZ'), ('chairman', 'NN'), ('of', 'IN'), ('Elsevier', 'NNP'), ('N.V.', 'NNP'), (',', ','), ('the', 'DT'), ('Dutch', 'JJ'), ('publishing', 'NN'), ('group', 'NN'), ('.', '.')] 

In the above code example, the first Unigram Tagger is trained on the first 4000 sentences from Treebank. Once the sentences are trained they are tagged using the same tagger for any of the sentences. In the above code example sentence 1 is used.

The below code example can be used to test the Unigram Tagger and evaluate it.

from nltk.corpus import treebank as tb
sentences_trained = treebank.tagged_sents()[:4000]
uni_tagger = UnigramTagger(sentences_trained)
sent_tested = treebank.tagged_sents()[3000:]
print("Test score : ",uni_tagger.evaluate(sent_tested))

Output

Test score :  0.96

In the above code example, the unigram tagger is trained and 4000 sentences and then evaluated on the last 1000 sentences.

Smoothing Techniques

In many cases, we need to build statistical models in NLP for example that can predict the next words based on training data or autocompletion of sentences. In the universe of so many combinations of words or possibilities, it is indispensable to get the most accurate words predicted. In such cases, smoothing can be used. Smoothing is a method of adjusting the probabilities in the trained model so that it can predict the words more accurately and even predict appropriate words not present in the training corpus.

Types of Smoothing

Laplace Smoothing

It is also known as add 1 one smoothing where we add 1 to the count of words in the denominator so that we do not incur a 0 value or divide by 0 condition

For example,

Problaplace (wi | w(i-1)) = (count(wi w(i-1)) +1 ) / (count(w(i-1)) + N)
N = total words in the training corpus
Prob("He likes coffee")
= Prob( I | <S>)* Prob( likes | I)* Prob( coffee | likes)* Prob(<E> | coffee)
= ((1+1) / (4+6))   *  ((1+1) / (1+8))  *  ((0+1) / (1+5))  *  ((1+1) / (4+8))
= 0.00123

Backoff and Interpolation

It involves two steps

Back off Process

  • We begin with n-gram,

  • We check for n-1 gram if observations are insufficient

  • If we have sufficient observation, we use n-2 gram

Interpolation process

  • We use an amalgamation of different n-gram models

For example, considering the sentence he went xxx, we can say that the tri-gram he went to that has occurred once, the probability of the word he went is 1 if the word is to and is 0 for all other words.

Conclusion

UnigramTagger is a useful NLTK tool to train a tagger that can use just a single word as context to determine the parts of speech of a sentence.UnigramTagger is made available in the NLTK toolkit which used Ngarm Tagger a sits parent class.

Updated on: 09-Aug-2023

79 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements