Natural Language Toolkit - Combining Taggers



Combining Taggers

Combining taggers or chaining taggers with each other is one of the important features of NLTK. The main concept behind combining taggers is that, in case if one tagger doesn’t know how to tag a word, it would be passed to the chained tagger. To achieve this purpose, SequentialBackoffTagger provides us the Backoff tagging feature.

Backoff Tagging

As told earlier, backoff tagging is one of the important features of SequentialBackoffTagger, which allows us to combine taggers in a way that if one tagger doesn’t know how to tag a word, the word would be passed to the next tagger and so on until there are no backoff taggers left to check.

How does it work?

Actually, every subclass of SequentialBackoffTagger can take a ‘backoff’ keyword argument. The value of this keyword argument is another instance of a SequentialBackoffTagger. Now whenever this SequentialBackoffTagger class is initialized, an internal list of backoff taggers (with itself as the first element) will be created. Moreover, if a backoff tagger is given, the internal list of this backoff taggers would be appended.

In the example below, we are taking DefaulTagger as the backoff tagger in the above Python recipe with which we have trained the UnigramTagger.

Example

In this example, we are using DefaulTagger as the backoff tagger. Whenever the UnigramTagger is unable to tag a word, backoff tagger, i.e. DefaulTagger, in our case, will tag it with ‘NN’.

from nltk.tag import UnigramTagger
from nltk.tag import DefaultTagger
from nltk.corpus import treebank
train_sentences = treebank.tagged_sents()[:2500]
back_tagger = DefaultTagger('NN')
Uni_tagger = UnigramTagger(train_sentences, backoff = back_tagger)
test_sentences = treebank.tagged_sents()[1500:]
Uni_tagger.evaluate(test_sentences)

Output

0.9061975746536931

From the above output, you can observe that by adding a backoff tagger the accuracy is increased by around 2%.

Saving taggers with pickle

As we have seen that training a tagger is very cumbersome and also takes time. To save time, we can pickle a trained tagger for using it later. In the example below, we are going to do this to our already trained tagger named ‘Uni_tagger’.

Example

import pickle
f = open('Uni_tagger.pickle','wb')
pickle.dump(Uni_tagger, f)
f.close()
f = open('Uni_tagger.pickle','rb')
Uni_tagger = pickle.load(f)

NgramTagger Class

From the hierarchy diagram discussed in previous unit, UnigramTagger is inherited from NgarmTagger class but we have two more subclasses of NgarmTagger class −

BigramTagger subclass

Actually an ngram is a subsequence of n items, hence, as name implies, BigramTagger subclass looks at the two items. First item is the previous tagged word and the second item is current tagged word.

TrigramTagger subclass

On the same note of BigramTagger, TrigramTagger subclass looks at the three items i.e. two previous tagged words and one current tagged word.

Practically if we apply BigramTagger and TrigramTagger subclasses individually as we did with UnigramTagger subclass, they both perform very poorly. Let us see in the examples below:

Using BigramTagger Subclass

from nltk.tag import BigramTagger
from nltk.corpus import treebank
train_sentences = treebank.tagged_sents()[:2500]
Bi_tagger = BigramTagger(train_sentences)
test_sentences = treebank.tagged_sents()[1500:]
Bi_tagger.evaluate(test_sentences)

Output

0.44669191071913594

Using TrigramTagger Subclass

from nltk.tag import TrigramTagger
from nltk.corpus import treebank
train_sentences = treebank.tagged_sents()[:2500]
Tri_tagger = TrigramTagger(train_sentences)
test_sentences = treebank.tagged_sents()[1500:]
Tri_tagger.evaluate(test_sentences)

Output

0.41949863394526193

You can compare the performance of UnigramTagger, we used previously (gave around 89% accuracy) with BigramTagger (gave around 44% accuracy) and TrigramTagger (gave around 41% accuracy). The reason is that Bigram and Trigram taggers cannot learn context from the first word(s) in a sentence. On the other hand, UnigramTagger class doesn’t care about the previous context and guesses the most common tag for each word, hence able to have high baseline accuracy.

Combining ngram taggers

As from the above examples, it is obvious that Bigram and Trigram taggers can contribute when we combine them with backoff tagging. In the example below, we are combining Unigram, Bigram and Trigram taggers with backoff tagging. The concept is same as the previous recipe while combining the UnigramTagger with backoff tagger. The only difference is that we are using the function named backoff_tagger() from tagger_util.py, given below, for backoff operation.

def backoff_tagger(train_sentences, tagger_classes, backoff=None):
   for cls in tagger_classes:
      backoff = cls(train_sentences, backoff=backoff)
   return backoff

Example

from tagger_util import backoff_tagger
from nltk.tag import UnigramTagger
from nltk.tag import BigramTagger
from nltk.tag import TrigramTagger
from nltk.tag import DefaultTagger
from nltk.corpus import treebank
train_sentences = treebank.tagged_sents()[:2500]
back_tagger = DefaultTagger('NN')
Combine_tagger = backoff_tagger(train_sentences,
[UnigramTagger, BigramTagger, TrigramTagger], backoff = back_tagger)
test_sentences = treebank.tagged_sents()[1500:]
Combine_tagger.evaluate(test_sentences)

Output

0.9234530029238365

From the above output, we can see it increases the accuracy by around 3%.

Advertisements