This is very important question that if we have NLTK’s default sentence tokenizer then why do we need to train a sentence tokenizer? The answer to this question lies in the quality of NLTK’s default sentence tokenizer. The NLTK’s default tokenizer is basically a general-purpose tokenizer. Although it works very well but it may not be a good choice for nonstandard text, that perhaps our text is, or for a text that is having a unique formatting. To tokenize such text and get best results, we should train our own sentence tokenizer.
For this example, we will be using the webtext corpus. The text file which we are going to use from this corpus is having the text formatted as dialogs shown below −
Guy: How old are you? Hipster girl: You know, I never answer that question. Because to me, it's about how mature you are, you know? I mean, a fourteen year old could be more mature than a twenty-five year old, right? I'm sorry, I just never answer that question. Guy: But, uh, you're older than eighteen, right? Hipster girl: Oh, yeah.
We have saved this text file with the name of training_tokenizer. NLTK provides a class named PunktSentenceTokenizer with the help of which we can train on raw text to produce a custom sentence tokenizer. We can get raw text either by reading in a file or from an NLTK corpus using the raw() method.
Let us see the example below to get more insight into it −
First, import PunktSentenceTokenizer class from nltk.tokenize package −
from nltk.tokenize import PunktSentenceTokenizer
Now, import webtext corpus from nltk.corpus package
from nltk.corpus import webtext
Next, by using raw() method, get the raw text from training_tokenizer.txt file as follows −
text = webtext.raw('C://Users/Leekha/training_tokenizer.txt')
Now, create an instance of PunktSentenceTokenizer and print the tokenize sentences from text file as follows −
sent_tokenizer = PunktSentenceTokenizer(text) sents_1 = sent_tokenizer.tokenize(text) print(sents_1)
White guy: So, do you have any plans for this evening? print(sents_1) Output: Asian girl: Yeah, being angry! print(sents_1) Output: Guy: A hundred bucks? print(sents_1) Output: Girl: But you already have a Big Mac...
from nltk.tokenize import PunktSentenceTokenizer from nltk.corpus import webtext text = webtext.raw('C://Users/Leekha/training_tokenizer.txt') sent_tokenizer = PunktSentenceTokenizer(text) sents_1 = sent_tokenizer.tokenize(text) print(sents_1)
White guy: So, do you have any plans for this evening?
To understand the difference between NLTK’s default sentence tokenizer and our own trained sentence tokenizer, let us tokenize the same file with default sentence tokenizer i.e. sent_tokenize().
from nltk.tokenize import sent_tokenize from nltk.corpus import webtext text = webtext.raw('C://Users/Leekha/training_tokenizer.txt') sents_2 = sent_tokenize(text) print(sents_2) Output: White guy: So, do you have any plans for this evening? print(sents_2) Output: Hobo: Y'know what I'd do if I was rich?
With the help of difference in the output, we can understand the concept that why it is useful to train our own sentence tokenizer.
Some common words that are present in text but do not contribute in the meaning of a sentence. Such words are not at all important for the purpose of information retrieval or natural language processing. The most common stopwords are ‘the’ and ‘a’.
Actually, Natural Language Tool kit comes with a stopword corpus containing word lists for many languages. Let us understand its usage with the help of the following example −
First, import the stopwords copus from nltk.corpus package −
from nltk.corpus import stopwords
Now, we will be using stopwords from English Languages
english_stops = set(stopwords.words('english')) words = ['I', 'am', 'a', 'writer'] [word for word in words if word not in english_stops]
from nltk.corpus import stopwords english_stops = set(stopwords.words('english')) words = ['I', 'am', 'a', 'writer'] [word for word in words if word not in english_stops]
With the help of following Python script, we can also find the complete list of languages supported by NLTK stopwords corpus −
from nltk.corpus import stopwords stopwords.fileids()
[ 'arabic', 'azerbaijani', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish' ]