Corpus Readers and Custom Corpora

What is a corpus?

A corpus is large collection, in structured format, of machine-readable texts that have been produced in a natural communicative setting. The word Corpora is the plural of Corpus. Corpus can be derived in many ways as follows −

From the text that was originally electronic
From the transcripts of spoken language
From optical character recognition and so on

Corpus representativeness, Corpus Balance, Sampling, Corpus Size are the elements that plays an important role while designing corpus. Some of the most popular corpus for NLP tasks are TreeBank, PropBank, VarbNet and WordNet.

How to build custom corpus?

While downloading NLTK, we also installed NLTK data package. So, we already have NLTK data package installed on our computer. If we talk about Windows, we’ll assume that this data package is installed at C:\natural_language_toolkit_data and if we talk about Linux, Unix and Mac OS X, we ‘ll assume that this data package is installed at /usr/share/natural_language_toolkit_data.

In the following Python recipe, we are going to create custom corpora which must be within one of the paths defined by NLTK. It is so because it can be found by NLTK. In order to avoid conflict with the official NLTK data package, let us create a custom natural_language_toolkit_data directory in our home directory.

import os, os.path
path = os.path.expanduser('~/natural_language_toolkit_data')
if not os.path.exists(path):
   os.mkdir(path)
os.path.exists(path)

Output

True

Now, Let us check whether we have natural_language_toolkit_data directory in our home directory or not −

import nltk.data
path in nltk.data.path

Output

True

As we have got the output True, means we have nltk_data directory in our home directory.

Now we will make a wordlist file, named wordfile.txt and put it in a folder, named corpus in nltk_data directory (~/nltk_data/corpus/wordfile.txt) and will load it by using nltk.data.load −

import nltk.data
nltk.data.load(‘corpus/wordfile.txt’, format = ‘raw’)

Output

b’tutorialspoint\n’

Corpus readers

NLTK provides various CorpusReader classes. We are going to cover them in the following python recipes

Creating wordlist corpus

NLTK has WordListCorpusReader class that provides access to the file containing a list of words. For the following Python recipe, we need to create a wordlist file which can be CSV or normal text file. For example, we have created a file named ‘list’ that contains the following data −

tutorialspoint
Online
Free
Tutorials

Now Let us instantiate a WordListCorpusReader class producing the list of words from our created file ‘list’.

from nltk.corpus.reader import WordListCorpusReader
reader_corpus = WordListCorpusReader('.', ['list'])
reader_corpus.words()

Output

['tutorialspoint', 'Online', 'Free', 'Tutorials']

Creating POS tagged word corpus

NLTK has TaggedCorpusReader class with the help of which we can create a POS tagged word corpus. Actually, POS tagging is the process of identifying the part-of-speech tag for a word.

One of the simplest formats for a tagged corpus is of the form ‘word/tag’like following excerpt from the brown corpus −

The/at-tl expense/nn and/cc time/nn involved/vbn are/ber
astronomical/jj ./.

In the above excerpt, each word has a tag which denotes its POS. For example, vb refers to a verb.

Now Let us instantiate a TaggedCorpusReaderclass producing POS tagged words form the file ‘list.pos’, which has the above excerpt.

from nltk.corpus.reader import TaggedCorpusReader
reader_corpus = TaggedCorpusReader('.', r'.*\.pos')
reader_corpus.tagged_words()

Output

[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ...]

Creating Chunked phrase corpus

NLTK has ChnkedCorpusReader class with the help of which we can create a Chunked phrase corpus. Actually, a chunk is a short phrase in a sentence.

For example, we have the following excerpt from the tagged treebank corpus −

[Earlier/JJR staff-reduction/NN moves/NNS] have/VBP trimmed/VBN about/
IN [300/CD jobs/NNS] ,/, [the/DT spokesman/NN] said/VBD ./.

In the above excerpt, every chunk is a noun phrase but the words that are not in brackets are part of the sentence tree and not part of any noun phrase subtree.

Now Let us instantiate a ChunkedCorpusReader class producing chunked phrase from the file ‘list.chunk’, which has the above excerpt.

from nltk.corpus.reader import ChunkedCorpusReader
reader_corpus = TaggedCorpusReader('.', r'.*\.chunk')
reader_corpus.chunked_words()

Output

[
   Tree('NP', [('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS')]),
   ('have', 'VBP'), ...
]

Creating Categorized text corpus

NLTK has CategorizedPlaintextCorpusReader class with the help of which we can create a categorized text corpus. It is very useful in case when we have a large corpus of text and want to categorize that into separate sections.

For example, the brown corpus has several different categories. Let us find out them with the help of following Python code −

from nltk.corpus import brown^M
brown.categories()

Output

[
   'adventure', 'belles_lettres', 'editorial', 'fiction', 'government',
   'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion',
   'reviews', 'romance', 'science_fiction'
]

One of the easiest ways to categorize a corpus is to have one file for every category. For example, let us see the two excerpts from the movie_reviews corpus −

movie_pos.txt

The thin red line is flawed but it provokes.

movie_neg.txt

A big-budget and glossy production cannot make up for a lack of spontaneity that permeates their tv show.

So, from above two files, we have two categories namely pos and neg.

Now let us instantiate a CategorizedPlaintextCorpusReader class.

from nltk.corpus.reader import CategorizedPlaintextCorpusReader
reader_corpus = CategorizedPlaintextCorpusReader('.', r'movie_.*\.txt',
cat_pattern = r'movie_(\w+)\.txt')
reader_corpus.categories()
reader_corpus.fileids(categories = [‘neg’])
reader_corpus.fileids(categories = [‘pos’])

Output

['neg', 'pos']
['movie_neg.txt']
['movie_pos.txt']