 
 Data Structure Data Structure
 Networking Networking
 RDBMS RDBMS
 Operating System Operating System
 Java Java
 MS Excel MS Excel
 iOS iOS
 HTML HTML
 CSS CSS
 Android Android
 Python Python
 C Programming C Programming
 C++ C++
 C# C#
 MongoDB MongoDB
 MySQL MySQL
 Javascript Javascript
 PHP PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Training Unigram Tagger in NLP
Introduction
A single token is called a unigram. A unigram tagger is the type of tagger that requires only one word for inferring the Parts of Speech of a word. It has the context of a single word.NLTK library provides us with the UnigramTagger and is inherited from NgramTagger.
In this article let us understand the training process of Unigram Tagger in NLP.
Unigram Tagger and its training using NLTK

WORKING
- The UnigramTagger is inherited from the ContextTagger. A context() method is implemented. The context method has the same arguments as the choose_tag() 
- From the context() method, a word token will be used to create the model. This word is used to look for the best tag. 
- The UnigramTagger will create a model with a context. 
Python Implementation
import nltk
nltk.download('treebank')
from nltk.tag import UnigramTagger
from nltk.corpus import treebank as tb
sentences_trained = treebank.tagged_sents()[:4000]
uni_tagger = UnigramTagger(sentences_trained)
print("Sample Sentence : ",tb.sents()[1])
print("Tag sample sentence : ", uni_tagger.tag(tb.sents()[1]))
Output
Sample Sentence :  ['Mr.', 'Vinken', 'is', 'chairman', 'of', 'Elsevier', 'N.V.', ',', 'the', 'Dutch', 'publishing', 'group', '.']
Tag sample sentence :  [('Mr.', 'NNP'), ('Vinken', 'NNP'), ('is', 'VBZ'), ('chairman', 'NN'), ('of', 'IN'), ('Elsevier', 'NNP'), ('N.V.', 'NNP'), (',', ','), ('the', 'DT'), ('Dutch', 'JJ'), ('publishing', 'NN'), ('group', 'NN'), ('.', '.')] 
In the above code example, the first Unigram Tagger is trained on the first 4000 sentences from Treebank. Once the sentences are trained they are tagged using the same tagger for any of the sentences. In the above code example sentence 1 is used.
The below code example can be used to test the Unigram Tagger and evaluate it.
from nltk.corpus import treebank as tb
sentences_trained = treebank.tagged_sents()[:4000]
uni_tagger = UnigramTagger(sentences_trained)
sent_tested = treebank.tagged_sents()[3000:]
print("Test score : ",uni_tagger.evaluate(sent_tested))
Output
Test score : 0.96
In the above code example, the unigram tagger is trained and 4000 sentences and then evaluated on the last 1000 sentences.
Smoothing Techniques
In many cases, we need to build statistical models in NLP for example that can predict the next words based on training data or autocompletion of sentences. In the universe of so many combinations of words or possibilities, it is indispensable to get the most accurate words predicted. In such cases, smoothing can be used. Smoothing is a method of adjusting the probabilities in the trained model so that it can predict the words more accurately and even predict appropriate words not present in the training corpus.
Types of Smoothing
Laplace Smoothing
It is also known as add 1 one smoothing where we add 1 to the count of words in the denominator so that we do not incur a 0 value or divide by 0 condition
For example,
Problaplace (wi | w(i-1)) = (count(wi w(i-1)) +1 ) / (count(w(i-1)) + N)
N = total words in the training corpus
Prob("He likes coffee")
= Prob( I | <S>)* Prob( likes | I)* Prob( coffee | likes)* Prob(<E> | coffee)
= ((1+1) / (4+6))   *  ((1+1) / (1+8))  *  ((0+1) / (1+5))  *  ((1+1) / (4+8))
= 0.00123
Backoff and Interpolation
It involves two steps
Back off Process
- We begin with n-gram, 
- We check for n-1 gram if observations are insufficient 
- If we have sufficient observation, we use n-2 gram 
Interpolation process
- We use an amalgamation of different n-gram models 
For example, considering the sentence he went xxx, we can say that the tri-gram he went to that has occurred once, the probability of the word he went is 1 if the word is to and is 0 for all other words.
Conclusion
UnigramTagger is a useful NLTK tool to train a tagger that can use just a single word as context to determine the parts of speech of a sentence.UnigramTagger is made available in the NLTK toolkit which used Ngarm Tagger a sits parent class.
