
- Natural Language Toolkit Tutorial
- Natural Language Toolkit - Home
- Natural Language Toolkit - Introduction
- Natural Language Toolkit - Getting Started
- Natural Language Toolkit - Tokenizing Text
- Training Tokenizer & Filtering Stopwords
- Looking up words in Wordnet
- Stemming & Lemmatization
- Natural Language Toolkit - Word Replacement
- Synonym & Antonym Replacement
- Corpus Readers and Custom Corpora
- Basics of Part-of-Speech (POS) Tagging
- Natural Language Toolkit - Unigram Tagger
- Natural Language Toolkit - Combining Taggers
- Natural Language Toolkit - More NLTK Taggers
- Natural Language Toolkit - Parsing
- Chunking & Information Extraction
- Natural Language Toolkit - Transforming Chunks
- Natural Language Toolkit - Transforming Trees
- Natural Language Toolkit - Text Classification
- Natural Language Toolkit Resources
- Natural Language Toolkit - Quick Guide
- Natural Language Toolkit - Useful Resources
- Natural Language Toolkit - Discussion
Corpus Readers and Custom Corpora
What is a corpus?
A corpus is large collection, in structured format, of machine-readable texts that have been produced in a natural communicative setting. The word Corpora is the plural of Corpus. Corpus can be derived in many ways as follows −
- From the text that was originally electronic
- From the transcripts of spoken language
- From optical character recognition and so on
Corpus representativeness, Corpus Balance, Sampling, Corpus Size are the elements that plays an important role while designing corpus. Some of the most popular corpus for NLP tasks are TreeBank, PropBank, VarbNet and WordNet.
How to build custom corpus?
While downloading NLTK, we also installed NLTK data package. So, we already have NLTK data package installed on our computer. If we talk about Windows, we’ll assume that this data package is installed at C:\natural_language_toolkit_data and if we talk about Linux, Unix and Mac OS X, we ‘ll assume that this data package is installed at /usr/share/natural_language_toolkit_data.
In the following Python recipe, we are going to create custom corpora which must be within one of the paths defined by NLTK. It is so because it can be found by NLTK. In order to avoid conflict with the official NLTK data package, let us create a custom natural_language_toolkit_data directory in our home directory.
import os, os.path path = os.path.expanduser('~/natural_language_toolkit_data') if not os.path.exists(path): os.mkdir(path) os.path.exists(path)
Output
True
Now, Let us check whether we have natural_language_toolkit_data directory in our home directory or not −
import nltk.data path in nltk.data.path
Output
True
As we have got the output True, means we have nltk_data directory in our home directory.
Now we will make a wordlist file, named wordfile.txt and put it in a folder, named corpus in nltk_data directory (~/nltk_data/corpus/wordfile.txt) and will load it by using nltk.data.load −
import nltk.data nltk.data.load(‘corpus/wordfile.txt’, format = ‘raw’)
Output
b’tutorialspoint\n’
Corpus readers
NLTK provides various CorpusReader classes. We are going to cover them in the following python recipes
Creating wordlist corpus
NLTK has WordListCorpusReader class that provides access to the file containing a list of words. For the following Python recipe, we need to create a wordlist file which can be CSV or normal text file. For example, we have created a file named ‘list’ that contains the following data −
tutorialspoint Online Free Tutorials
Now Let us instantiate a WordListCorpusReader class producing the list of words from our created file ‘list’.
from nltk.corpus.reader import WordListCorpusReader reader_corpus = WordListCorpusReader('.', ['list']) reader_corpus.words()
Output
['tutorialspoint', 'Online', 'Free', 'Tutorials']
Creating POS tagged word corpus
NLTK has TaggedCorpusReader class with the help of which we can create a POS tagged word corpus. Actually, POS tagging is the process of identifying the part-of-speech tag for a word.
One of the simplest formats for a tagged corpus is of the form ‘word/tag’like following excerpt from the brown corpus −
The/at-tl expense/nn and/cc time/nn involved/vbn are/ber astronomical/jj ./.
In the above excerpt, each word has a tag which denotes its POS. For example, vb refers to a verb.
Now Let us instantiate a TaggedCorpusReaderclass producing POS tagged words form the file ‘list.pos’, which has the above excerpt.
from nltk.corpus.reader import TaggedCorpusReader reader_corpus = TaggedCorpusReader('.', r'.*\.pos') reader_corpus.tagged_words()
Output
[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ...]
Creating Chunked phrase corpus
NLTK has ChnkedCorpusReader class with the help of which we can create a Chunked phrase corpus. Actually, a chunk is a short phrase in a sentence.
For example, we have the following excerpt from the tagged treebank corpus −
[Earlier/JJR staff-reduction/NN moves/NNS] have/VBP trimmed/VBN about/ IN [300/CD jobs/NNS] ,/, [the/DT spokesman/NN] said/VBD ./.
In the above excerpt, every chunk is a noun phrase but the words that are not in brackets are part of the sentence tree and not part of any noun phrase subtree.
Now Let us instantiate a ChunkedCorpusReader class producing chunked phrase from the file ‘list.chunk’, which has the above excerpt.
from nltk.corpus.reader import ChunkedCorpusReader reader_corpus = TaggedCorpusReader('.', r'.*\.chunk') reader_corpus.chunked_words()
Output
[ Tree('NP', [('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS')]), ('have', 'VBP'), ... ]
Creating Categorized text corpus
NLTK has CategorizedPlaintextCorpusReader class with the help of which we can create a categorized text corpus. It is very useful in case when we have a large corpus of text and want to categorize that into separate sections.
For example, the brown corpus has several different categories. Let us find out them with the help of following Python code −
from nltk.corpus import brown^M brown.categories()
Output
[ 'adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction' ]
One of the easiest ways to categorize a corpus is to have one file for every category. For example, let us see the two excerpts from the movie_reviews corpus −
movie_pos.txt
The thin red line is flawed but it provokes.
movie_neg.txt
A big-budget and glossy production cannot make up for a lack of spontaneity that permeates their tv show.
So, from above two files, we have two categories namely pos and neg.
Now let us instantiate a CategorizedPlaintextCorpusReader class.
from nltk.corpus.reader import CategorizedPlaintextCorpusReader reader_corpus = CategorizedPlaintextCorpusReader('.', r'movie_.*\.txt', cat_pattern = r'movie_(\w+)\.txt') reader_corpus.categories() reader_corpus.fileids(categories = [‘neg’]) reader_corpus.fileids(categories = [‘pos’])
Output
['neg', 'pos'] ['movie_neg.txt'] ['movie_pos.txt']