Natural Language Toolkit - Transforming Chunks

Why transforming Chunks?

Till now we have got chunks or phrases from sentences but what are we supposed to do with them. One of the important tasks is to transform them. But why? It is to do the following −

• grammatical correction and
• rearranging phrases

Filtering insignificant/useless words

Suppose if you want to judge the meaning of a phrase then there are many commonly used words such as, ‘the’, ‘a’, are insignificant or useless. For example, see the following phrase −

‘The movie was good’.

Here the most significant words are ‘movie’ and ‘good’. Other words, ‘the’ and ‘was’ both are useless or insignificant. It is because without them also we can get the same meaning of the phrase. ‘Good movie’.

In the following python recipe, we will learn how to remove useless/insignificant words and keep the significant words with the help of POS tags.

Example

First, by looking through treebank corpus for stopwords we need to decide which part-of-speech tags are significant and which are not. Let us see the following table of insignificant words and tags −

Word Tag
a DT
All PDT
An DT
And CC
Or CC
That WDT
The DT

From the above table, we can see other than CC, all the other tags end with DT which means we can filter out insignificant words by looking at the tag’s suffix.

For this example, we are going to use a function named filter() which takes a single chunk and returns a new chunk without any insignificant tagged words. This function filters out any tags that end with DT or CC.

Example

import nltk
def filter(chunk, tag_suffixes=['DT', 'CC']):
significant = []
for word, tag in chunk:
ok = True
for suffix in tag_suffixes:
if tag.endswith(suffix):
ok = False
break
if ok:
significant.append((word, tag))
return (significant)


Now, let us use this function filter() in our Python recipe to delete insignificant words −

from chunk_parse import filter
filter([('the', 'DT'),('good', 'JJ'),('movie', 'NN')])


Output

[('good', 'JJ'), ('movie', 'NN')]


Verb Correction

Many times, in real-world language we see incorrect verb forms. For example, ‘is you fine?’ is not correct. The verb form is not correct in this sentence. The sentence should be ‘are you fine?’ NLTK provides us the way to correct such mistakes by creating verb correction mappings. These correction mappings are used depending on whether there is a plural or singular noun in the chunk.

Example

To implement Python recipe, we first need to need define verb correction mappings. Let us create two mapping as follows −

Plural to Singular mappings

plural= {
('is', 'VBZ'): ('are', 'VBP'),
('was', 'VBD'): ('were', 'VBD')
}


Singular to Plural mappings

singular = {
('are', 'VBP'): ('is', 'VBZ'),
('were', 'VBD'): ('was', 'VBD')
}


As seen above, each mapping has a tagged verb which maps to another tagged verb. The initial mappings in our example cover the basic of mappings is to are, was to were, and vice versa.

Next, we will define a function named verbs(), in which you can pass a chink with incorrect verb form and ‘ll get a corrected chunk back. To get it done, verb() function uses a helper function named index_chunk() which will search the chunk for the position of the first tagged word.

Let us see these functions −

def index_chunk(chunk, pred, start = 0, step = 1):
l = len(chunk)
end = l if step > 0 else -1
for i in range(start, end, step):
if pred(chunk[i]):
return i
return None
def tag_startswith(prefix):
def f(wt):
return wt[1].startswith(prefix)
return f

def verbs(chunk):
vbidx = index_chunk(chunk, tag_startswith('VB'))
if vbidx is None:
return chunk
verb, vbtag = chunk[vbidx]
nnpred = tag_startswith('NN')
nnidx = index_chunk(chunk, nnpred, start = vbidx+1)
if nnidx is None:
nnidx = index_chunk(chunk, nnpred, start = vbidx-1, step = -1)
if nnidx is None:
return chunk
noun, nntag = chunk[nnidx]
if nntag.endswith('S'):
chunk[vbidx] = plural.get((verb, vbtag), (verb, vbtag))
else:
chunk[vbidx] = singular.get((verb, vbtag), (verb, vbtag))
return chunk


Save these functions in a Python file in your local directory where Python or Anaconda is installed and run it. I have saved it as verbcorrect.py.

Now, let us call verbs() function on a POS tagged is you fine chunk −

from verbcorrect import verbs


Eliminating passive voice from phrases

Another useful task is to eliminate passive voice from phrases. This can be done with the help of swapping the words around a verb. For example, ‘the tutorial was great’ can be transformed into ‘the great tutorial’.

Example

To achieve this we are defining a function named eliminate_passive() that will swap the right-hand side of the chunk with the left-hand side by using the verb as the pivot point. In order to find the verb to pivot around, it will also use the index_chunk() function defined above.

def eliminate_passive(chunk):
def vbpred(wt):
word, tag = wt
return tag != 'VBG' and tag.startswith('VB') and len(tag) > 2
vbidx = index_chunk(chunk, vbpred)
if vbidx is None:
return chunk
return chunk[vbidx+1:] + chunk[:vbidx]


Now, let us call eliminate_passive() function on a POS tagged the tutorial was great chunk −

from passiveverb import eliminate_passive
eliminate_passive(
[
('the', 'DT'), ('tutorial', 'NN'), ('was', 'VBD'), ('great', 'JJ')
]
)


Output

[('great', 'JJ'), ('the', 'DT'), ('tutorial', 'NN')]


Swapping noun cardinals

As we know, a cardinal word such as 5, is tagged as CD in a chunk. These cardinal words often occur before or after a noun but for normalization purpose it is useful to put them before the noun always. For example, the date January 5 can be written as 5 January. Let us understand it with the following example.

Example

To achieve this we are defining a function named swapping_cardinals() that will swap any cardinal that occurs immediately after a noun with the noun. With this the cardinal will occur immediately before the noun. In order to do equality comparison with the given tag, it uses a helper function which we named as tag_eql().

def tag_eql(tag):
def f(wt):
return wt[1] == tag
return f


Now we can define swapping_cardinals() −

def swapping_cardinals (chunk):
cdidx = index_chunk(chunk, tag_eql('CD'))
if not cdidx or not chunk[cdidx-1][1].startswith('NN'):
return chunk
noun, nntag = chunk[cdidx-1]
chunk[cdidx-1] = chunk[cdidx]
chunk[cdidx] = noun, nntag
return chunk


Now, Let us call swapping_cardinals() function on a date “January 5”

from Cardinals import swapping_cardinals()
swapping_cardinals([('Janaury', 'NNP'), ('5', 'CD')])


Output

[('10', 'CD'), ('January', 'NNP')]
10 January


Useful Video Courses

Video

Natural Language Processing with Deep Learning Master Class

59 Lectures 2.5 hours

Video

Learn NLP - Natural Language Processing with AWS Machine Learning and Python Boto3

17 Lectures 1 hours