Gensim - Documents & Corpus

Here, we shall learn about the core concepts of Gensim, with main focus on the documents and the corpus.

Core Concepts of Gensim

Following are the core concepts and terms that are needed to understand and use Gensim −

Document − ZIt refers to some text.
Corpus − It refers to a collection of documents.
Vector − Mathematical representation of a document is called vector.
Model − It refers to an algorithm used for transforming vectors from one representation to another.

What is Document?

As discussed, it refers to some text. If we go in some detail, it is an object of the text sequence type which is known as ‘str’ in Python 3. For example, in Gensim, a document can be anything such as −

Short tweet of 140 characters
Single paragraph, i.e. article or research paper abstract
News article
Book
Novel
Theses

Text Sequence

A text sequence type is commonly known as ‘str’ in Python 3. As we know that in Python, textual data is handled with strings or more specifically ‘str’ objects. Strings are basically immutable sequences of Unicode code points and can be written in the following ways −

Single quotes − For example, ‘Hi! How are you?’. It allows us to embed double quotes also. For example, ‘Hi! “How” are you?’
Double quotes − For example, "Hi! How are you?". It allows us to embed single quotes also. For example, "Hi! 'How' are you?"
Triple quotes − It can have either three single quotes like, '''Hi! How are you?'''. or three double quotes like, """Hi! 'How' are you?"""

All the whitespaces will be included in the string literal.

Example

Following is an example of a Document in Gensim −

Document = “Tutorialspoint.com is the biggest online tutorials library and it’s all free also”

What is Corpus?

A corpus may be defined as the large and structured set of machine-readable texts produced in a natural communicative setting. In Gensim, a collection of document object is called corpus. The plural of corpus is corpora.

Role of Corpus in Gensim

A corpus in Gensim serves the following two roles −

Serves as Input for Training a Model

The very first and important role a corpus plays in Gensim, is as an input for training a model. In order to initialize model’s internal parameters, during training, the model look for some common themes and topics from the training corpus. As discussed above, Gensim focuses on unsupervised models, hence it doesn’t require any kind of human intervention.

Serves as Topic Extractor

Once the model is trained, it can be used to extract topics from the new documents. Here, the new documents are the ones that are not used in the training phase.

Example

The corpus can include all the tweets by a particular person, list of all the articles of a newspaper or all the research papers on a particular topic etc.

Collecting Corpus

Following is an example of small corpus which contains 5 documents. Here, every document is a string consisting of a single sentence.

t_corpus = [
   "A survey of user opinion of computer system response time",
   "Relation of user perceived response time to error measurement",
   "The generation of random binary unordered trees",
   "The intersection graph of paths in trees",
   "Graph minors IV Widths of trees and well quasi ordering",
]

Preprocessing Collecting Corpus

Once we collect the corpus, a few preprocessing steps should be taken to keep corpus simple. We can simply remove some commonly used English words like ‘the’. We can also remove words that occur only once in the corpus.

For example, the following Python script is used to lowercase each document, split it by white space and filter out stop words −

Example

import pprint
t_corpus = [
   "A survey of user opinion of computer system response time", 
   "Relation of user perceived response time to error measurement", 
   "The generation of random binary unordered trees", 
   "The intersection graph of paths in trees", 
   "Graph minors IV Widths of trees and well quasi ordering",
]
stoplist = set('for a of the and to in'.split(' '))
processed_corpus = [[word for word in document.lower().split() if word not in stoplist]
   for document in t_corpus]
	
pprint.pprint(processed_corpus)
]

Output

[['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
['generation', 'random', 'binary', 'unordered', 'trees'],
['intersection', 'graph', 'paths', 'trees'],
['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering']]

Effective Preprocessing

Gensim also provides function for more effective preprocessing of the corpus. In such kind of preprocessing, we can convert a document into a list of lowercase tokens. We can also ignore tokens that are too short or too long. Such function is gensim.utils.simple_preprocess(doc, deacc=False, min_len=2, max_len=15).

gensim.utils.simple_preprocess() fucntion

Gensim provide this function to convert a document into a list of lowercase tokens and also for ignoring tokens that are too short or too long. It has the following parameters −

doc(str)

It refers to the input document on which preprocessing should be applied.

deacc(bool, optional)

This parameter is used to remove the accent marks from tokens. It uses deaccent() to do this.

min_len(int, optional)

With the help of this parameter, we can set the minimum length of a token. The tokens shorter than defined length will be discarded.

max_len(int, optional)

With the help of this parameter we can set the maximum length of a token. The tokens longer than defined length will be discarded.

The output of this function would be the tokens extracted from input document.