
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Removing stop words with NLTK in Python
In NLP(Natural Language Processing), stop words are the words that are filtered out before or after processing text data, such as "is", "and", "a" etc. These words do not add meaning to the text and can be removed to improve the efficiency.
The Natural Language Toolkit (NLTK) is the python library that provides the easy to use interface and the tools for text processing such as tokenization and stop word removal. In this article, we will explore how to remove stop words using NLTK.
NLTK Stop Words
Before going to use the NLTK stop words, we have to make sure that the nltk package is installed. If not installed use the below command to install -
pip install nltk
After installation, import the necessary modules and download the stop words corpus.
import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize nltk.download('punkt') nltk.download('stopwords')
Let's dive into the examples for getting more idea of removing the stop words with NLTK.
Example 1
In this scenario, we are using the word_tokenize() for splitting the sentence into the words then using the list comprehension to the filter the stop words (like "is", "a" etc.).
Let's look at the following example, where we are going to comsider the basic stop word removal.
import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize str1="Welcome to the TutorialsPoint" a=set(stopwords.words('english')) words=word_tokenize(str1) result=[word for word in words if word.lower() not in a] print(result)
The output of the above program is as follows -
['Welcome', 'TutorialsPoint']
Example 2
In this case, we are using the string with the punctuation, which retains the punctuation as they are not part of the stop words.
Consider the following example, where we are applying the punctuation in the string and observing the output.
import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize str1="Hello,! have a nice day..!" a=set(stopwords.words('english')) words=word_tokenize(str1) result=[word for word in words if word.lower() not in a] print(result)
The output of the above program is as follows -
['Hello', ',', '!', 'nice', 'day', '..', '!']