Python - Corpora Access


Advertisements

Corpora is a group presenting multiple collections of text documents. A single collection is called corpus. One such famous corpus is the Gutenberg Corpus which contains some 25,000 free electronic books, hosted at http://www.gutenberg.org/. In the below example we access the names of only those files from the corpus which are plain text with filename ending as .txt.

from nltk.corpus import gutenberg
fields = gutenberg.fileids()

print(fields)

When we run the above program, we get the following output −

[austen-emma.txt', austen-persuasion.txt', austen-sense.txt', bible-kjv.txt', 
blake-poems.txt', bryant-stories.txt', burgess-busterbrown.txt',
carroll-alice.txt', chesterton-ball.txt', chesterton-brown.txt', 
chesterton-thursday.txt', edgeworth-parents.txt', melville-moby_dick.txt',
milton-paradise.txt', shakespeare-caesar.txt', shakespeare-hamlet.txt',
shakespeare-macbeth.txt', whitman-leaves.txt']

Accessing Raw Text

We can access the raw text from these files using sent_tokenize function which is also available in nltk. In the below example we retrieve the first two paragraphs of the blake poen text.

from nltk.tokenize import sent_tokenize
from nltk.corpus import gutenberg

sample = gutenberg.raw("blake-poems.txt")

token = sent_tokenize(sample)

for para in range(2):
    print(token[para])

When we run the above program we get the following output −

[Poems by William Blake 1789]

 
SONGS OF INNOCENCE AND OF EXPERIENCE
and THE BOOK of THEL


 SONGS OF INNOCENCE
 
 
 INTRODUCTION
 
 Piping down the valleys wild,
   Piping songs of pleasant glee,
 On a cloud I saw a child,
   And he laughing said to me:
 
 "Pipe a song about a Lamb!"
So I piped with merry cheer.

Useful Video Courses


Video

Python Online Training

187 Lectures 17.5 hours

Malhar Lathkar

Video

Python Essentials Online Training

55 Lectures 8 hours

Arnab Chakraborty

Video

Learn Python Programming in 100 Easy Steps

136 Lectures 11 hours

In28Minutes Official

Video

Python with Data Science

Best Seller

75 Lectures 13 hours

Eduonix Learning Solutions

Video

Python 3 from scratch to become a developer in demand

Best Seller

70 Lectures 8.5 hours

Lets Kode It

Video

Python Data Science basics with Numpy, Pandas and Matplotlib

Most Popular

63 Lectures 6 hours

Abhilash Nelson

Advertisements