- Data Structure
- Networking
- RDBMS
- Operating System
- Java
- MS Excel
- iOS
- HTML
- CSS
- Android
- Python
- C Programming
- C++
- C#
- MongoDB
- MySQL
- Javascript
- PHP
- Physics
- Chemistry
- Biology
- Mathematics
- English
- Economics
- Psychology
- Social Studies
- Fashion Studies
- Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Training Unigram Tagger in NLP
Introduction
A single token is called a unigram. A unigram tagger is the type of tagger that requires only one word for inferring the Parts of Speech of a word. It has the context of a single word.NLTK library provides us with the UnigramTagger and is inherited from NgramTagger.
In this article let us understand the training process of Unigram Tagger in NLP.
Unigram Tagger and its training using NLTK
WORKING
The UnigramTagger is inherited from the ContextTagger. A context() method is implemented. The context method has the same arguments as the choose_tag()
From the context() method, a word token will be used to create the model. This word is used to look for the best tag.
The UnigramTagger will create a model with a context.
Python Implementation
import nltk nltk.download('treebank') from nltk.tag import UnigramTagger from nltk.corpus import treebank as tb sentences_trained = treebank.tagged_sents()[:4000] uni_tagger = UnigramTagger(sentences_trained) print("Sample Sentence : ",tb.sents()[1]) print("Tag sample sentence : ", uni_tagger.tag(tb.sents()[1]))
Output
Sample Sentence : ['Mr.', 'Vinken', 'is', 'chairman', 'of', 'Elsevier', 'N.V.', ',', 'the', 'Dutch', 'publishing', 'group', '.'] Tag sample sentence : [('Mr.', 'NNP'), ('Vinken', 'NNP'), ('is', 'VBZ'), ('chairman', 'NN'), ('of', 'IN'), ('Elsevier', 'NNP'), ('N.V.', 'NNP'), (',', ','), ('the', 'DT'), ('Dutch', 'JJ'), ('publishing', 'NN'), ('group', 'NN'), ('.', '.')]
In the above code example, the first Unigram Tagger is trained on the first 4000 sentences from Treebank. Once the sentences are trained they are tagged using the same tagger for any of the sentences. In the above code example sentence 1 is used.
The below code example can be used to test the Unigram Tagger and evaluate it.
from nltk.corpus import treebank as tb sentences_trained = treebank.tagged_sents()[:4000] uni_tagger = UnigramTagger(sentences_trained) sent_tested = treebank.tagged_sents()[3000:] print("Test score : ",uni_tagger.evaluate(sent_tested))
Output
Test score : 0.96
In the above code example, the unigram tagger is trained and 4000 sentences and then evaluated on the last 1000 sentences.
Smoothing Techniques
In many cases, we need to build statistical models in NLP for example that can predict the next words based on training data or autocompletion of sentences. In the universe of so many combinations of words or possibilities, it is indispensable to get the most accurate words predicted. In such cases, smoothing can be used. Smoothing is a method of adjusting the probabilities in the trained model so that it can predict the words more accurately and even predict appropriate words not present in the training corpus.
Types of Smoothing
Laplace Smoothing
It is also known as add 1 one smoothing where we add 1 to the count of words in the denominator so that we do not incur a 0 value or divide by 0 condition
For example,
Problaplace (wi | w(i-1)) = (count(wi w(i-1)) +1 ) / (count(w(i-1)) + N) N = total words in the training corpus Prob("He likes coffee") = Prob( I | <S>)* Prob( likes | I)* Prob( coffee | likes)* Prob(<E> | coffee) = ((1+1) / (4+6)) * ((1+1) / (1+8)) * ((0+1) / (1+5)) * ((1+1) / (4+8)) = 0.00123
Backoff and Interpolation
It involves two steps
Back off Process
We begin with n-gram,
We check for n-1 gram if observations are insufficient
If we have sufficient observation, we use n-2 gram
Interpolation process
We use an amalgamation of different n-gram models
For example, considering the sentence he went xxx, we can say that the tri-gram he went to that has occurred once, the probability of the word he went is 1 if the word is to and is 0 for all other words.
Conclusion
UnigramTagger is a useful NLTK tool to train a tagger that can use just a single word as context to determine the parts of speech of a sentence.UnigramTagger is made available in the NLTK toolkit which used Ngarm Tagger a sits parent class.