
- Natural Language Toolkit Tutorial
- Natural Language Toolkit - Home
- Natural Language Toolkit - Introduction
- Natural Language Toolkit - Getting Started
- Natural Language Toolkit - Tokenizing Text
- Training Tokenizer & Filtering Stopwords
- Looking up words in Wordnet
- Stemming & Lemmatization
- Natural Language Toolkit - Word Replacement
- Synonym & Antonym Replacement
- Corpus Readers and Custom Corpora
- Basics of Part-of-Speech (POS) Tagging
- Natural Language Toolkit - Unigram Tagger
- Natural Language Toolkit - Combining Taggers
- Natural Language Toolkit - More NLTK Taggers
- Natural Language Toolkit - Parsing
- Chunking & Information Extraction
- Natural Language Toolkit - Transforming Chunks
- Natural Language Toolkit - Transforming Trees
- Natural Language Toolkit - Text Classification
- Natural Language Toolkit Resources
- Natural Language Toolkit - Quick Guide
- Natural Language Toolkit - Useful Resources
- Natural Language Toolkit - Discussion
Natural Language Toolkit - Transforming Trees
Following are the two reasons to transform the trees −
- To modify deep parse tree and
- To flatten deep parse trees
Converting Tree or Subtree to Sentence
The first recipe we are going to discuss here is to convert a Tree or subtree back to a sentence or chunk string. This is very simple, let us see in the following example −
Example
from nltk.corpus import treebank_chunk tree = treebank_chunk.chunked_sents()[2] ' '.join([w for w, t in tree.leaves()])
Output
'Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC , was named a nonexecutive director of this British industrial conglomerate .'
Deep tree flattening
Deep trees of nested phrases can’t be used for training a chunk hence we must flatten them before using. In the following example, we are going to use 3rd parsed sentence, which is deep tree of nested phrases, from the treebank corpus.
Example
To achieve this, we are defining a function named deeptree_flat() that will take a single Tree and will return a new Tree that keeps only the lowest level trees. In order to do most of the work, it uses a helper function which we named as childtree_flat().
from nltk.tree import Tree def childtree_flat(trees): children = [] for t in trees: if t.height() < 3: children.extend(t.pos()) elif t.height() == 3: children.append(Tree(t.label(), t.pos())) else: children.extend(flatten_childtrees([c for c in t])) return children def deeptree_flat(tree): return Tree(tree.label(), flatten_childtrees([c for c in tree]))
Now, let us call deeptree_flat() function on 3rd parsed sentence, which is deep tree of nested phrases, from the treebank corpus. We saved these functions in a file named deeptree.py.
from deeptree import deeptree_flat from nltk.corpus import treebank deeptree_flat(treebank.parsed_sents()[2])
Output
Tree('S', [Tree('NP', [('Rudolph', 'NNP'), ('Agnew', 'NNP')]), (',', ','), Tree('NP', [('55', 'CD'), ('years', 'NNS')]), ('old', 'JJ'), ('and', 'CC'), Tree('NP', [('former', 'JJ'), ('chairman', 'NN')]), ('of', 'IN'), Tree('NP', [('Consolidated', 'NNP'), ('Gold', 'NNP'), ('Fields', 'NNP'), ('PLC', 'NNP')]), (',', ','), ('was', 'VBD'), ('named', 'VBN'), Tree('NP-SBJ', [('*-1', '-NONE-')]), Tree('NP', [('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN')]), ('of', 'IN'), Tree('NP', [('this', 'DT'), ('British', 'JJ'), ('industrial', 'JJ'), ('conglomerate', 'NN')]), ('.', '.')])
Building Shallow tree
In the previous section, we flatten a deep tree of nested phrases by only keeping the lowest level subtrees. In this section, we are going to keep only the highest-level subtrees i.e. to build the shallow tree. In the following example we are going to use 3rd parsed sentence, which is deep tree of nested phrases, from the treebank corpus.
Example
To achieve this, we are defining a function named tree_shallow() that will eliminate all the nested subtrees by keeping only the top subtree labels.
from nltk.tree import Tree def tree_shallow(tree): children = [] for t in tree: if t.height() < 3: children.extend(t.pos()) else: children.append(Tree(t.label(), t.pos())) return Tree(tree.label(), children)
Now, let us call tree_shallow() function on 3rd parsed sentence, which is deep tree of nested phrases, from the treebank corpus. We saved these functions in a file named shallowtree.py.
from shallowtree import shallow_tree from nltk.corpus import treebank tree_shallow(treebank.parsed_sents()[2])
Output
Tree('S', [Tree('NP-SBJ-1', [('Rudolph', 'NNP'), ('Agnew', 'NNP'), (',', ','), ('55', 'CD'), ('years', 'NNS'), ('old', 'JJ'), ('and', 'CC'), ('former', 'JJ'), ('chairman', 'NN'), ('of', 'IN'), ('Consolidated', 'NNP'), ('Gold', 'NNP'), ('Fields', 'NNP'), ('PLC', 'NNP'), (',', ',')]), Tree('VP', [('was', 'VBD'), ('named', 'VBN'), ('*-1', '-NONE-'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('of', 'IN'), ('this', 'DT'), ('British', 'JJ'), ('industrial', 'JJ'), ('conglomerate', 'NN')]), ('.', '.')])
We can see the difference with the help of getting the height of the trees −
from nltk.corpus import treebank tree_shallow(treebank.parsed_sents()[2]).height()
Output
3
from nltk.corpus import treebank treebank.parsed_sents()[2].height()
Output
9
Tree labels conversion
In parse trees there are variety of Tree label types that are not present in chunk trees. But while using parse tree to train a chunker, we would like to reduce this variety by converting some of Tree labels to more common label types. For example, we have two alternative NP subtrees namely NP-SBL and NP-TMP. We can convert both of them into NP. Let us see how to do it in the following example.
Example
To achieve this we are defining a function named tree_convert() that takes following two arguments −
- Tree to convert
- A label conversion mapping
This function will return a new Tree with all matching labels replaced based on the values in the mapping.
from nltk.tree import Tree def tree_convert(tree, mapping): children = [] for t in tree: if isinstance(t, Tree): children.append(convert_tree_labels(t, mapping)) else: children.append(t) label = mapping.get(tree.label(), tree.label()) return Tree(label, children)
Now, let us call tree_convert() function on 3rd parsed sentence, which is deep tree of nested phrases, from the treebank corpus. We saved these functions in a file named converttree.py.
from converttree import tree_convert from nltk.corpus import treebank mapping = {'NP-SBJ': 'NP', 'NP-TMP': 'NP'} convert_tree_labels(treebank.parsed_sents()[2], mapping)
Output
Tree('S', [Tree('NP-SBJ-1', [Tree('NP', [Tree('NNP', ['Rudolph']), Tree('NNP', ['Agnew'])]), Tree(',', [',']), Tree('UCP', [Tree('ADJP', [Tree('NP', [Tree('CD', ['55']), Tree('NNS', ['years'])]), Tree('JJ', ['old'])]), Tree('CC', ['and']), Tree('NP', [Tree('NP', [Tree('JJ', ['former']), Tree('NN', ['chairman'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('NNP', ['Consolidated']), Tree('NNP', ['Gold']), Tree('NNP', ['Fields']), Tree('NNP', ['PLC'])])])])]), Tree(',', [','])]), Tree('VP', [Tree('VBD', ['was']),Tree('VP', [Tree('VBN', ['named']), Tree('S', [Tree('NP', [Tree('-NONE-', ['*-1'])]), Tree('NP-PRD', [Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['nonexecutive']), Tree('NN', ['director'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['this']), Tree('JJ', ['British']), Tree('JJ', ['industrial']), Tree('NN', ['conglomerate'])])])])])])]), Tree('.', ['.'])])