 
- Natural Language Toolkit - Home
- Natural Language Toolkit - Introduction
- Natural Language Toolkit - Getting Started
- Natural Language Toolkit - Tokenizing Text
- Training Tokenizer & Filtering Stopwords
- Looking up words in Wordnet
- Stemming & Lemmatization
- Natural Language Toolkit - Word Replacement
- Synonym & Antonym Replacement
- Corpus Readers and Custom Corpora
- Basics of Part-of-Speech (POS) Tagging
- Natural Language Toolkit - Unigram Tagger
- Natural Language Toolkit - Combining Taggers
- Natural Language Toolkit - More NLTK Taggers
- Natural Language Toolkit - Parsing
- Chunking & Information Extraction
- Natural Language Toolkit - Transforming Chunks
- Natural Language Toolkit - Transforming Trees
- Natural Language Toolkit - Text Classification
- Natural Language Toolkit Resources
- Natural Language Toolkit - Quick Guide
- Natural Language Toolkit - Useful Resources
- Natural Language Toolkit - Discussion
Stemming & Lemmatization
What is Stemming?
Stemming is a technique used to extract the base form of the words by removing affixes from them. It is just like cutting down the branches of a tree to its stems. For example, the stem of the words eating, eats, eaten is eat.
Search engines use stemming for indexing the words. Thats why rather than storing all forms of a word, a search engine can store only the stems. In this way, stemming reduces the size of the index and increases retrieval accuracy.
Various Stemming algorithms
In NLTK, stemmerI, which have stem() method, interface has all the stemmers which we are going to cover next. Let us understand it with the following diagram
 
Porter stemming algorithm
It is one of the most common stemming algorithms which is basically designed to remove and replace well-known suffixes of English words.
PorterStemmer class
NLTK has PorterStemmer class with the help of which we can easily implement Porter Stemmer algorithms for the word we want to stem. This class knows several regular word forms and suffixes with the help of which it can transform the input word to a final stem. The resulting stem is often a shorter word having the same root meaning. Let us see an example −
First, we need to import the natural language toolkit(nltk).
import nltk
Now, import the PorterStemmer class to implement the Porter Stemmer algorithm.
from nltk.stem import PorterStemmer
Next, create an instance of Porter Stemmer class as follows −
word_stemmer = PorterStemmer()
Now, input the word you want to stem.
word_stemmer.stem('writing')
Output
'write'
word_stemmer.stem('eating')
Output
'eat'
Complete implementation example
import nltk
from nltk.stem import PorterStemmer
word_stemmer = PorterStemmer()
word_stemmer.stem('writing')
Output
'write'
Lancaster stemming algorithm
It was developed at Lancaster University and it is another very common stemming algorithms.
LancasterStemmer class
NLTK has LancasterStemmer class with the help of which we can easily implement Lancaster Stemmer algorithms for the word we want to stem. Let us see an example −
First, we need to import the natural language toolkit(nltk).
import nltk
Now, import the LancasterStemmer class to implement Lancaster Stemmer algorithm
from nltk.stem import LancasterStemmer
Next, create an instance of LancasterStemmer class as follows −
Lanc_stemmer = LancasterStemmer()
Now, input the word you want to stem.
Lanc_stemmer.stem('eats')
Output
'eat'
Complete implementation example
import nltk
from nltk.stem import LancatserStemmer
Lanc_stemmer = LancasterStemmer()
Lanc_stemmer.stem('eats')
Output
'eat'
Regular Expression stemming algorithm
With the help of this stemming algorithm, we can construct our own stemmer.
RegexpStemmer class
NLTK has RegexpStemmer class with the help of which we can easily implement Regular Expression Stemmer algorithms. It basically takes a single regular expression and removes any prefix or suffix that matches the expression. Let us see an example −
First, we need to import the natural language toolkit(nltk).
import nltk
Now, import the RegexpStemmer class to implement the Regular Expression Stemmer algorithm.
from nltk.stem import RegexpStemmer
Next, create an instance of RegexpStemmer class and provides the suffix or prefix you want to remove from the word as follows −
Reg_stemmer = RegexpStemmer(ing)
Now, input the word you want to stem.
Reg_stemmer.stem('eating')
Output
'eat'
Reg_stemmer.stem('ingeat')
Output
'eat'
Reg_stemmer.stem('eats')
Output
'eat'
Complete implementation example
import nltk
from nltk.stem import RegexpStemmer
Reg_stemmer = RegexpStemmer()
Reg_stemmer.stem('ingeat')
Output
'eat'
Snowball stemming algorithm
It is another very useful stemming algorithm.
SnowballStemmer class
NLTK has SnowballStemmer class with the help of which we can easily implement Snowball Stemmer algorithms. It supports 15 non-English languages. In order to use this steaming class, we need to create an instance with the name of the language we are using and then call the stem() method. Let us see an example −
First, we need to import the natural language toolkit(nltk).
import nltk
Now, import the SnowballStemmer class to implement Snowball Stemmer algorithm
from nltk.stem import SnowballStemmer
Let us see the languages it supports −
SnowballStemmer.languages
Output
( 'arabic', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish' )
Next, create an instance of SnowballStemmer class with the language you want to use. Here, we are creating the stemmer for French language.
French_stemmer = SnowballStemmer(french)
Now, call the stem() method and input the word you want to stem.
French_stemmer.stem (Bonjoura)
Output
'bonjour'
Complete implementation example
import nltk from nltk.stem import SnowballStemmer French_stemmer = SnowballStemmer(french) French_stemmer.stem (Bonjoura)
Output
'bonjour'
What is Lemmatization?
Lemmatization technique is like stemming. The output we will get after lemmatization is called lemma, which is a root word rather than root stem, the output of stemming. After lemmatization, we will be getting a valid word that means the same thing.
NLTK provides WordNetLemmatizer class which is a thin wrapper around the wordnet corpus. This class uses morphy() function to the WordNet CorpusReader class to find a lemma. Let us understand it with an example −
Example
First, we need to import the natural language toolkit(nltk).
import nltk
Now, import the WordNetLemmatizer class to implement the lemmatization technique.
from nltk.stem import WordNetLemmatizer
Next, create an instance of WordNetLemmatizer class.
lemmatizer = WordNetLemmatizer()
Now, call the lemmatize() method and input the word of which you want to find lemma.
lemmatizer.lemmatize('eating')
Output
'eating'
lemmatizer.lemmatize('books')
Output
'book'
Complete implementation example
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('books')
Output
'book'
Difference between Stemming & Lemmatization
Let us understand the difference between Stemming and Lemmatization with the help of the following example −
import nltk
from nltk.stem import PorterStemmer
word_stemmer = PorterStemmer()
word_stemmer.stem('believes')
Output
believ
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize(' believes ')
Output
believ
The output of both programs tells the major difference between stemming and lemmatization. PorterStemmer class chops off the es from the word. On the other hand, WordNetLemmatizer class finds a valid word. In simple words, stemming technique only looks at the form of the word whereas lemmatization technique looks at the meaning of the word. It means after applying lemmatization, we will always get a valid word.