- Data Structure
- Networking
- RDBMS
- Operating System
- Java
- MS Excel
- iOS
- HTML
- CSS
- Android
- Python
- C Programming
- C++
- C#
- MongoDB
- MySQL
- Javascript
- PHP
- Physics
- Chemistry
- Biology
- Mathematics
- English
- Economics
- Psychology
- Social Studies
- Fashion Studies
- Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Training a tokenizer and filtering stop words in a sentence
Introduction
In NLP tokenizing text into sentences is a very crucial preprocessing task. Tokenization is the process of breaking the text corpus into individual sentences. In NLTK, the default tokenizer does a good task to tokenize text however it fails to do so in cases where the text contains punctuations, symbols, etc. that are non-standard. In such cases, we need to train a tokenizer.
In this article let us explore the training of a tokenizer and also see the usage of filter words or stopwords.
Tokenizing a Sentence in NLP
The default tokenizer in NLTK can be used on the text sample given below.
Ram − Where have gone last Sunday?
Mohan − I went to see the Taj Mahal.
Ram − Where is the Taj Mahal located
Mohan − It is located in Agra. It is considered to be one of the wonders of the world.
import nltk nltk.download('punkt') from nltk.tokenize import sent_tokenize textual_data = """ Ram : Where have gone last Sunday? Mohan : I went to see the Taj Mahal. Ram: Where is the Taj Mahal located Mohan: It is located in Agra.It is considered to be one of the wornders of the world. """ sentences = sent_tokenize(textual_data) print(sentences[0]) print("\n",sentences)
Output
Ram : Where have gone last Sunday? ['\nRam : Where have gone last Sunday?', 'Mohan : I went to see the Taj Mahal.', 'Ram: Where is the Taj Mahal located \n\n\nMohan: It is located in Agra.It is considered to be one of the wonders of the world.']
The output of the last sentence does not look correct as the tokenizer failed to tokenize the text since the text does not follow a normal paragraph structure.
This is a situation where a tokenizer can be trained.
data.txt link : https://drive.google.com/file/d/1bs2eBbSxTSeaAuDlpoDqGB89Ej9HAqPz/view?usp=sharing.
Training a tokenizer
We would be using the Punkt Sentence Tokenizer for this example.
import nltk nltk.download('webtext') from nltk.tokenize import PunktSentenceTokenizer from nltk.corpus import webtext data = webtext.raw('/content/data.txt') tokenizer_sentence = PunktSentenceTokenizer(data) sentences = tokenizer_sentence.tokenize(data) print(sentences[0]) print("\n",sentences)
Output
Ram : Where have gone last Sunday? ['Ram : Where have gone last Sunday?', 'Mohan : I went to see the Taj Mahal.', 'Ram: Where is the Taj Mahal located?', 'Mohan: It is located in Agra.It is considered to be one of the wonders of the world.']
Filtering Stop words in a sentence
The words that do not add meaning to a particular sentence in a text corpus are called stop words. They are generally removed from the original text while preprocessing since they are not crucial to the NLP task. The NLTK library is a collection of stop words corresponding to different languages.
Let us see the process of filtering stop words through a code example.
Example sentence: "A new actor is born every generation and is worshipped by many fans"
import nltk nltk.download('stopwords') from nltk.corpus import stopwords as sw from nltk.tokenize import word_tokenize sentence = "A new actor is born every generation and is worshipped by many fans" " stopwords_en = set(sw.words('english')) word_list = word_tokenize(sentence) filtered_words = [w for w in word_list if w not in stopwords_en] print ("Words present in the sentence initially : ",word_list) print ("\nWords after stopword removal process : ",filtered_words)
Output
Words presents in the sentence initially : ['A', 'new', 'actor', 'is', 'born', 'every', 'generation', 'and', 'is', 'worshipped', 'by', 'many', 'fans'] Words after stopword removal process : ['A', 'new', 'actor', 'born', 'every', 'generation', 'worshipped', 'many', 'fans']
Different types of Tokenization
TFIDF Tokenization
TF-IDF stands for Term Frequency - Inverse Document Frequency. It is an algorithm that makes use of the frequency of the words occurring to determine and weigh how important the word is in a particular document. It has two terms TF (Term Frequency) and IDF (Inverse Document Frequency).
TF denotes the frequency of words occurring in a particular document and is given as
TF =frequency of t in d / total number of words in d =tf(t,d)
IDF measures in how many Documents D in the corpus, a particular t term occurs as is denoted as
IDF =Total number of documents in corpus N / number of docs having term t =idf(t,D)
TF-IDF is the product of TF and IDF terms
TF-IDF=tf(t,d) *idf(t,D)
An example of TFIDF vectorizer using Scikit-learn library
from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer() corpus = ["Hello how are you","You have called me","How are you"] data = vectorizer.fit_transform(corpus) tokens = vectorizer.get_feature_names_out() print(tokens) print(data.shape)
Output
['are' 'called' 'have' 'hello' 'how' 'me' 'you'] (3, 7)
Frequency Counting
This method is used to count the frequency of each word in a document or text in a corpus.
For example, in the given text
The fox was walking in the jungle. It then saw a tiger coming towards it. The fox was terrified on seeing the tiger."
The word frequency can is found to be
coming: 1
it.The: 1
jungle: 1
seeing: 1
terrified: 1
then: 1
tiger: 2
towards: 1
walking: 1
import nltk from nltk.corpus import webtext from nltk.tokenize import word_tokenize from nltk.probability import FreqDist import nltk nltk.download('punkt') text_data = "The fox was walking in the jungle. It then saw a tiger coming towards it.The fox was terrified on seeing the tiger." words = word_tokenize(text_data) print(words) word_freq = nltk.FreqDist(words) words_filtered = dict([(i, j) for i, j in word_freq.items() if len(i) > 3]) for k in sorted(words_filtered): print("%s: %s" % (k, words_filtered[k]))
Output
['The', 'fox', 'was', 'walking', 'in', 'the', 'jungle', '.', 'It', 'then', 'saw', 'a', 'tiger', 'coming', 'towards', 'it.The', 'fox', 'was', 'terrified', 'on', 'seeing', 'the', 'tiger', '.'] coming: 1 it.The: 1 jungle: 1 seeing: 1 terrified: 1 then: 1 tiger: 2 towards: 1 walking: 1
Rule Based Tokenization
Ruled-based tokenizers break text into tokens using some predefined rules. These rules can either be regular expression filters or grammar constraints.
For example, a rule can be used to split text by white spaces or commas.
Also, some tokenizers work on tweets that have special rules to split the words and preserve special characters like emojis, etc.
Following is a code example of a regex rule-based tokenizer.
from nltk.tokenize import regexp_tokenize data_text = "Jack and Jill went up the hill." print(regexp_tokenize(data_text, "[\w']+"))
Output
['Jack', 'and', 'Jill', 'went', 'up', 'the', 'hill']
Stopwords filter
Stopwords are common words that do not add any special meaning to the meaning of the sentence in the context of NLP and text processing and are generally ignored by most tokenizers.
from nltk.corpus import stopwords from nltk.tokenize import word_tokenize import nltk nltk.download('stopwords') text_data = "Hello how are you.You have called me. How are you" data = word_tokenize(text_data) filtered_words = [w for w in data if w not in stopwords.words('english')] print(filtered_words)
Output
['Hello', '.', 'You', 'called', '.', 'How']
Conclusion
Sentence Tokenizing and stop word removal are two very common and important NLP text preprocessing steps. For simple corpus structure, the default sentence token can be used however, for text not in the usual paragraph struct a Tokenizer can even be trained. Stop words don't contribute to the meaning of a sentence and hence are filtered during text preprocessing.