Training Tokenizer & Filtering Stopwords



Why to train own sentence tokenizer?

This is very important question that if we have NLTK’s default sentence tokenizer then why do we need to train a sentence tokenizer? The answer to this question lies in the quality of NLTK’s default sentence tokenizer. The NLTK’s default tokenizer is basically a general-purpose tokenizer. Although it works very well but it may not be a good choice for nonstandard text, that perhaps our text is, or for a text that is having a unique formatting. To tokenize such text and get best results, we should train our own sentence tokenizer.

Implementation Example

For this example, we will be using the webtext corpus. The text file which we are going to use from this corpus is having the text formatted as dialogs shown below −

Guy: How old are you?
Hipster girl: You know, I never answer that question. Because to me, it's about
how mature you are, you know? I mean, a fourteen year old could be more mature
than a twenty-five year old, right? I'm sorry, I just never answer that question.
Guy: But, uh, you're older than eighteen, right?
Hipster girl: Oh, yeah.

We have saved this text file with the name of training_tokenizer. NLTK provides a class named PunktSentenceTokenizer with the help of which we can train on raw text to produce a custom sentence tokenizer. We can get raw text either by reading in a file or from an NLTK corpus using the raw() method.

Let us see the example below to get more insight into it −

First, import PunktSentenceTokenizer class from nltk.tokenize package −

from nltk.tokenize import PunktSentenceTokenizer

Now, import webtext corpus from nltk.corpus package

from nltk.corpus import webtext

Next, by using raw() method, get the raw text from training_tokenizer.txt file as follows −

text = webtext.raw('C://Users/Leekha/training_tokenizer.txt')

Now, create an instance of PunktSentenceTokenizer and print the tokenize sentences from text file as follows −

sent_tokenizer = PunktSentenceTokenizer(text)
sents_1 = sent_tokenizer.tokenize(text)
print(sents_1[0])

Output

White guy: So, do you have any plans for this evening?
print(sents_1[1])
Output:
Asian girl: Yeah, being angry!
print(sents_1[670])
Output:
Guy: A hundred bucks?
print(sents_1[675])
Output:
Girl: But you already have a Big Mac...

Complete implementation example

from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import webtext
text = webtext.raw('C://Users/Leekha/training_tokenizer.txt')
sent_tokenizer = PunktSentenceTokenizer(text)
sents_1 = sent_tokenizer.tokenize(text)
print(sents_1[0])

Output

White guy: So, do you have any plans for this evening?

To understand the difference between NLTK’s default sentence tokenizer and our own trained sentence tokenizer, let us tokenize the same file with default sentence tokenizer i.e. sent_tokenize().

from nltk.tokenize import sent_tokenize
   from nltk.corpus import webtext
   text = webtext.raw('C://Users/Leekha/training_tokenizer.txt')
sents_2 = sent_tokenize(text)

print(sents_2[0])
Output:

White guy: So, do you have any plans for this evening?
print(sents_2[675])
Output:
Hobo: Y'know what I'd do if I was rich?

With the help of difference in the output, we can understand the concept that why it is useful to train our own sentence tokenizer.

What are stopwords?

Some common words that are present in text but do not contribute in the meaning of a sentence. Such words are not at all important for the purpose of information retrieval or natural language processing. The most common stopwords are ‘the’ and ‘a’.

NLTK stopwords corpus

Actually, Natural Language Tool kit comes with a stopword corpus containing word lists for many languages. Let us understand its usage with the help of the following example −

First, import the stopwords copus from nltk.corpus package −

from nltk.corpus import stopwords

Now, we will be using stopwords from English Languages

english_stops = set(stopwords.words('english'))
words = ['I', 'am', 'a', 'writer']
[word for word in words if word not in english_stops]

Output

['I', 'writer']

Complete implementation example

from nltk.corpus import stopwords
english_stops = set(stopwords.words('english'))
words = ['I', 'am', 'a', 'writer']
[word for word in words if word not in english_stops]

Output

['I', 'writer']

Finding complete list of supported languages

With the help of following Python script, we can also find the complete list of languages supported by NLTK stopwords corpus −

from nltk.corpus import stopwords
stopwords.fileids()

Output

[
   'arabic', 'azerbaijani', 'danish', 'dutch', 'english', 'finnish', 'french',
   'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali',
   'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish',
   'swedish', 'tajik', 'turkish'
]
Advertisements