
- Natural Language Toolkit Tutorial
- Natural Language Toolkit - Home
- Natural Language Toolkit - Introduction
- Natural Language Toolkit - Getting Started
- Natural Language Toolkit - Tokenizing Text
- Training Tokenizer & Filtering Stopwords
- Looking up words in Wordnet
- Stemming & Lemmatization
- Natural Language Toolkit - Word Replacement
- Synonym & Antonym Replacement
- Corpus Readers and Custom Corpora
- Basics of Part-of-Speech (POS) Tagging
- Natural Language Toolkit - Unigram Tagger
- Natural Language Toolkit - Combining Taggers
- Natural Language Toolkit - More NLTK Taggers
- Natural Language Toolkit - Parsing
- Chunking & Information Extraction
- Natural Language Toolkit - Transforming Chunks
- Natural Language Toolkit - Transforming Trees
- Natural Language Toolkit - Text Classification
- Natural Language Toolkit Resources
- Natural Language Toolkit - Quick Guide
- Natural Language Toolkit - Useful Resources
- Natural Language Toolkit - Discussion
Natural Language Toolkit - Word Replacement
Stemming and lemmatization can be considered as a kind of linguistic compression. In the same sense, word replacement can be thought of as text normalization or error correction.
But why we needed word replacement? Suppose if we talk about tokenization, then it is having issues with contractions (like can’t, won’t, etc.). So, to handle such issues we need word replacement. For example, we can replace contractions with their expanded forms.
Word replacement using regular expression
First, we are going to replace words that matches the regular expression. But for this we must have a basic understanding of regular expressions as well as python re module. In the example below, we will be replacing contraction with their expanded forms (e.g. “can’t” will be replaced with “cannot”), all that by using regular expressions.
Example
First, import the necessary package re to work with regular expressions.
import re from nltk.corpus import wordnet
Next, define the replacement patterns of your choice as follows −
R_patterns = [ (r'won\'t', 'will not'), (r'can\'t', 'cannot'), (r'i\'m', 'i am'), r'(\w+)\'ll', '\g<1> will'), (r'(\w+)n\'t', '\g<1> not'), (r'(\w+)\'ve', '\g<1> have'), (r'(\w+)\'s', '\g<1> is'), (r'(\w+)\'re', '\g<1> are'), ]
Now, create a class that can be used for replacing words −
class REReplacer(object): def __init__(self, pattern = R_patterns): self.pattern = [(re.compile(regex), repl) for (regex, repl) in patterns] def replace(self, text): s = text for (pattern, repl) in self.pattern: s = re.sub(pattern, repl, s) return s
Save this python program (say repRE.py) and run it from python command prompt. After running it, import REReplacer class when you want to replace words. Let us see how.
from repRE import REReplacer rep_word = REReplacer() rep_word.replace("I won't do it") Output: 'I will not do it' rep_word.replace("I can’t do it") Output: 'I cannot do it'
Complete implementation example
import re from nltk.corpus import wordnet R_patterns = [ (r'won\'t', 'will not'), (r'can\'t', 'cannot'), (r'i\'m', 'i am'), r'(\w+)\'ll', '\g<1> will'), (r'(\w+)n\'t', '\g<1> not'), (r'(\w+)\'ve', '\g<1> have'), (r'(\w+)\'s', '\g<1> is'), (r'(\w+)\'re', '\g<1> are'), ] class REReplacer(object): def __init__(self, patterns=R_patterns): self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns] def replace(self, text): s = text for (pattern, repl) in self.patterns: s = re.sub(pattern, repl, s) return s
Now once you saved the above program and run it, you can import the class and use it as follows −
from replacerRE import REReplacer rep_word = REReplacer() rep_word.replace("I won't do it")
Output
'I will not do it'
Replacement before text processing
One of the common practices while working with natural language processing (NLP) is to clean up the text before text processing. In this concern we can also use our REReplacer class created above in previous example, as a preliminary step before text processing i.e. tokenization.
Example
from nltk.tokenize import word_tokenize from replacerRE import REReplacer rep_word = REReplacer() word_tokenize("I won't be able to do this now") Output: ['I', 'wo', "n't", 'be', 'able', 'to', 'do', 'this', 'now'] word_tokenize(rep_word.replace("I won't be able to do this now")) Output: ['I', 'will', 'not', 'be', 'able', 'to', 'do', 'this', 'now']
In the above Python recipe, we can easily understand the difference between the output of word tokenizer without and with using regular expression replace.
Removal of repeating characters
Do we strictly grammatical in our everyday language? No, we are not. For example, sometimes we write ‘Hiiiiiiiiiiii Mohan’ in order to emphasize the word ‘Hi’. But computer system does not know that ‘Hiiiiiiiiiiii’ is a variation of the word “Hi”. In the example below, we will be creating a class named rep_word_removal which can be used for removing the repeating words.
Example
First, import the necessary package re to work with regular expressions
import re from nltk.corpus import wordnet
Now, create a class that can be used for removing the repeating words −
class Rep_word_removal(object): def __init__(self): self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)') self.repl = r'\1\2\3' def replace(self, word): if wordnet.synsets(word): return word repl_word = self.repeat_regexp.sub(self.repl, word) if repl_word != word: return self.replace(repl_word) else: return repl_word
Save this python program (say removalrepeat.py) and run it from python command prompt. After running it, import Rep_word_removal class when you want to remove the repeating words. Let us see how?
from removalrepeat import Rep_word_removal rep_word = Rep_word_removal() rep_word.replace ("Hiiiiiiiiiiiiiiiiiiiii") Output: 'Hi' rep_word.replace("Hellooooooooooooooo") Output: 'Hello'
Complete implementation example
import re from nltk.corpus import wordnet class Rep_word_removal(object): def __init__(self): self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)') self.repl = r'\1\2\3' def replace(self, word): if wordnet.synsets(word): return word replace_word = self.repeat_regexp.sub(self.repl, word) if replace_word != word: return self.replace(replace_word) else: return replace_word
Now once you saved the above program and run it, you can import the class and use it as follows −
from removalrepeat import Rep_word_removal rep_word = Rep_word_removal() rep_word.replace ("Hiiiiiiiiiiiiiiiiiiiii")
Output
'Hi'