Training a tokenizer and filtering stop words in a sentence


Introduction

In NLP tokenizing text into sentences is a very crucial preprocessing task. Tokenization is the process of breaking the text corpus into individual sentences. In NLTK, the default tokenizer does a good task to tokenize text however it fails to do so in cases where the text contains punctuations, symbols, etc. that are non-standard. In such cases, we need to train a tokenizer.

In this article let us explore the training of a tokenizer and also see the usage of filter words or stopwords.

Tokenizing a Sentence in NLP

The default tokenizer in NLTK can be used on the text sample given below.

Ram − Where have gone last Sunday?

Mohan − I went to see the Taj Mahal.

Ram − Where is the Taj Mahal located

Mohan − It is located in Agra. It is considered to be one of the wonders of the world.

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

textual_data = """
Ram : Where have gone last Sunday?
Mohan : I went to see the Taj Mahal.
Ram: Where is the Taj Mahal located                  

Mohan: It is located in Agra.It is considered to be one of the wornders of the world.
"""
sentences = sent_tokenize(textual_data)
  
print(sentences[0])
print("\n",sentences)

Output

Ram : Where have gone last Sunday?

 ['\nRam : Where have gone last Sunday?', 'Mohan : I went to see the Taj Mahal.', 'Ram: Where is the Taj Mahal located                  \n\n\nMohan: It is located in Agra.It is considered to be one of the wonders of the world.']

The output of the last sentence does not look correct as the tokenizer failed to tokenize the text since the text does not follow a normal paragraph structure.

This is a situation where a tokenizer can be trained.

data.txt link : https://drive.google.com/file/d/1bs2eBbSxTSeaAuDlpoDqGB89Ej9HAqPz/view?usp=sharing.

Training a tokenizer

We would be using the Punkt Sentence Tokenizer for this example.

import nltk
nltk.download('webtext')

from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import webtext
  
data = webtext.raw('/content/data.txt')
tokenizer_sentence = PunktSentenceTokenizer(data)
sentences = tokenizer_sentence.tokenize(data)
  
print(sentences[0])
print("\n",sentences)

Output

Ram : Where have gone last Sunday?

 ['Ram : Where have gone last Sunday?', 'Mohan : I went to see the Taj Mahal.', 'Ram: Where is the Taj Mahal located?', 'Mohan: It is located in Agra.It is considered to be one of the wonders of the world.']

Filtering Stop words in a sentence

The words that do not add meaning to a particular sentence in a text corpus are called stop words. They are generally removed from the original text while preprocessing since they are not crucial to the NLP task. The NLTK library is a collection of stop words corresponding to different languages.

Let us see the process of filtering stop words through a code example.

Example sentence: "A new actor is born every generation and is worshipped by many fans"

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords as sw
from nltk.tokenize import word_tokenize
sentence = "A new actor is born every generation and is worshipped by many fans"
"
stopwords_en = set(sw.words('english'))
word_list = word_tokenize(sentence)
filtered_words = [w for w in word_list if w not in stopwords_en]
print ("Words present in the sentence initially : ",word_list)
print ("\nWords after stopword removal process : ",filtered_words)

Output

Words presents in the sentence initially :  ['A', 'new', 'actor', 'is', 'born', 'every', 'generation', 'and', 'is', 'worshipped', 'by', 'many', 'fans']

Words after stopword removal process :  ['A', 'new', 'actor', 'born', 'every', 'generation', 'worshipped', 'many', 'fans']

Different types of Tokenization

TFIDF Tokenization

TF-IDF stands for Term Frequency - Inverse Document Frequency. It is an algorithm that makes use of the frequency of the words occurring to determine and weigh how important the word is in a particular document. It has two terms TF (Term Frequency) and IDF (Inverse Document Frequency).

TF denotes the frequency of words occurring in a particular document and is given as

TF =frequency of t in d / total number of words in d =tf(t,d)

IDF measures in how many Documents D in the corpus, a particular t term occurs as is denoted as

IDF =Total number of documents in corpus N / number of docs having term t =idf(t,D)

TF-IDF is the product of TF and IDF terms

TF-IDF=tf(t,d) *idf(t,D)

An example of TFIDF vectorizer using Scikit-learn library

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
corpus = ["Hello how are you","You have called me","How are you"]
data = vectorizer.fit_transform(corpus)
tokens = vectorizer.get_feature_names_out()
print(tokens)
print(data.shape)

Output

['are' 'called' 'have' 'hello' 'how' 'me' 'you']
(3, 7)

Frequency Counting

This method is used to count the frequency of each word in a document or text in a corpus.

For example, in the given text

The fox was walking in the jungle. It then saw a tiger coming towards it. The fox was terrified on seeing the tiger."

The word frequency can is found to be

coming: 1

it.The: 1

jungle: 1

seeing: 1

terrified: 1

then: 1

tiger: 2

towards: 1

walking: 1

import nltk
from nltk.corpus import webtext
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
import nltk
nltk.download('punkt')
 
text_data = "The fox was walking in the jungle. It then saw a tiger coming towards it.The fox was terrified on seeing the tiger."
words = word_tokenize(text_data)
print(words)
 
word_freq = nltk.FreqDist(words)
 
words_filtered = dict([(i, j) for i, j in word_freq.items() if len(i) > 3])
 
for k in sorted(words_filtered):
   print("%s: %s" % (k, words_filtered[k]))

Output

['The', 'fox', 'was', 'walking', 'in', 'the', 'jungle', '.', 'It', 'then', 'saw', 'a', 'tiger', 'coming', 'towards', 'it.The', 'fox', 'was', 'terrified', 'on', 'seeing', 'the', 'tiger', '.']
coming: 1
it.The: 1
jungle: 1
seeing: 1
terrified: 1
then: 1
tiger: 2
towards: 1
walking: 1

Rule Based Tokenization

Ruled-based tokenizers break text into tokens using some predefined rules. These rules can either be regular expression filters or grammar constraints.

For example, a rule can be used to split text by white spaces or commas.

Also, some tokenizers work on tweets that have special rules to split the words and preserve special characters like emojis, etc.

Following is a code example of a regex rule-based tokenizer.

from nltk.tokenize import regexp_tokenize
data_text = "Jack and Jill went up the hill."
print(regexp_tokenize(data_text, "[\w']+"))

Output

['Jack', 'and', 'Jill', 'went', 'up', 'the', 'hill']

Stopwords filter

Stopwords are common words that do not add any special meaning to the meaning of the sentence in the context of NLP and text processing and are generally ignored by most tokenizers.

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
nltk.download('stopwords')
text_data = "Hello how are you.You have called me. How are you"
data = word_tokenize(text_data)
filtered_words = [w for w in data if w not in stopwords.words('english')]
print(filtered_words)

Output

['Hello', '.', 'You', 'called', '.', 'How']

Conclusion

Sentence Tokenizing and stop word removal are two very common and important NLP text preprocessing steps. For simple corpus structure, the default sentence token can be used however, for text not in the usual paragraph struct a Tokenizer can even be trained. Stop words don't contribute to the meaning of a sentence and hence are filtered during text preprocessing.

Updated on: 09-Aug-2023

87 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements