Combining taggers or chaining taggers with each other is one of the important features of NLTK. The main concept behind combining taggers is that, in case if one tagger doesn’t know how to tag a word, it would be passed to the chained tagger. To achieve this purpose, SequentialBackoffTagger provides us the Backoff tagging feature.
As told earlier, backoff tagging is one of the important features of SequentialBackoffTagger, which allows us to combine taggers in a way that if one tagger doesn’t know how to tag a word, the word would be passed to the next tagger and so on until there are no backoff taggers left to check.
Actually, every subclass of SequentialBackoffTagger can take a ‘backoff’ keyword argument. The value of this keyword argument is another instance of a SequentialBackoffTagger. Now whenever this SequentialBackoffTagger class is initialized, an internal list of backoff taggers (with itself as the first element) will be created. Moreover, if a backoff tagger is given, the internal list of this backoff taggers would be appended.
In the example below, we are taking DefaulTagger as the backoff tagger in the above Python recipe with which we have trained the UnigramTagger.
In this example, we are using DefaulTagger as the backoff tagger. Whenever the UnigramTagger is unable to tag a word, backoff tagger, i.e. DefaulTagger, in our case, will tag it with ‘NN’.
from nltk.tag import UnigramTagger from nltk.tag import DefaultTagger from nltk.corpus import treebank train_sentences = treebank.tagged_sents()[:2500] back_tagger = DefaultTagger('NN') Uni_tagger = UnigramTagger(train_sentences, backoff = back_tagger) test_sentences = treebank.tagged_sents()[1500:] Uni_tagger.evaluate(test_sentences)
From the above output, you can observe that by adding a backoff tagger the accuracy is increased by around 2%.
As we have seen that training a tagger is very cumbersome and also takes time. To save time, we can pickle a trained tagger for using it later. In the example below, we are going to do this to our already trained tagger named ‘Uni_tagger’.
import pickle f = open('Uni_tagger.pickle','wb') pickle.dump(Uni_tagger, f) f.close() f = open('Uni_tagger.pickle','rb') Uni_tagger = pickle.load(f)
From the hierarchy diagram discussed in previous unit, UnigramTagger is inherited from NgarmTagger class but we have two more subclasses of NgarmTagger class −
Actually an ngram is a subsequence of n items, hence, as name implies, BigramTagger subclass looks at the two items. First item is the previous tagged word and the second item is current tagged word.
On the same note of BigramTagger, TrigramTagger subclass looks at the three items i.e. two previous tagged words and one current tagged word.
Practically if we apply BigramTagger and TrigramTagger subclasses individually as we did with UnigramTagger subclass, they both perform very poorly. Let us see in the examples below:
from nltk.tag import BigramTagger from nltk.corpus import treebank train_sentences = treebank.tagged_sents()[:2500] Bi_tagger = BigramTagger(train_sentences) test_sentences = treebank.tagged_sents()[1500:] Bi_tagger.evaluate(test_sentences)
from nltk.tag import TrigramTagger from nltk.corpus import treebank train_sentences = treebank.tagged_sents()[:2500] Tri_tagger = TrigramTagger(train_sentences) test_sentences = treebank.tagged_sents()[1500:] Tri_tagger.evaluate(test_sentences)
You can compare the performance of UnigramTagger, we used previously (gave around 89% accuracy) with BigramTagger (gave around 44% accuracy) and TrigramTagger (gave around 41% accuracy). The reason is that Bigram and Trigram taggers cannot learn context from the first word(s) in a sentence. On the other hand, UnigramTagger class doesn’t care about the previous context and guesses the most common tag for each word, hence able to have high baseline accuracy.
As from the above examples, it is obvious that Bigram and Trigram taggers can contribute when we combine them with backoff tagging. In the example below, we are combining Unigram, Bigram and Trigram taggers with backoff tagging. The concept is same as the previous recipe while combining the UnigramTagger with backoff tagger. The only difference is that we are using the function named backoff_tagger() from tagger_util.py, given below, for backoff operation.
def backoff_tagger(train_sentences, tagger_classes, backoff=None): for cls in tagger_classes: backoff = cls(train_sentences, backoff=backoff) return backoff
from tagger_util import backoff_tagger from nltk.tag import UnigramTagger from nltk.tag import BigramTagger from nltk.tag import TrigramTagger from nltk.tag import DefaultTagger from nltk.corpus import treebank train_sentences = treebank.tagged_sents()[:2500] back_tagger = DefaultTagger('NN') Combine_tagger = backoff_tagger(train_sentences, [UnigramTagger, BigramTagger, TrigramTagger], backoff = back_tagger) test_sentences = treebank.tagged_sents()[1500:] Combine_tagger.evaluate(test_sentences)
From the above output, we can see it increases the accuracy by around 3%.