- Gensim Tutorial
- Gensim - Home
- Gensim - Introduction
- Gensim - Getting Started
- Gensim - Documents & Corpus
- Gensim - Vector & Model
- Gensim - Creating a Dictionary
- Creating a bag of words (BoW) Corpus
- Gensim - Transformations
- Gensim - Creating TF-IDF Matrix
- Gensim - Topic Modeling
- Gensim - Creating LDA Topic Model
- Gensim - Using LDA Topic Model
- Gensim - Creating LDA Mallet Model
- Gensim - Documents & LDA Model
- Gensim - Creating LSI & HDP Topic Model
- Gensim - Developing Word Embedding
- Gensim - Doc2Vec Model
- Gensim Useful Resources
- Gensim - Quick Guide
- Gensim - Useful Resources
- Gensim - Discussion
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Gensim - Documents & Corpus
Here, we shall learn about the core concepts of Gensim, with main focus on the documents and the corpus.
Core Concepts of Gensim
Following are the core concepts and terms that are needed to understand and use Gensim −
Document − ZIt refers to some text.
Corpus − It refers to a collection of documents.
Vector − Mathematical representation of a document is called vector.
Model − It refers to an algorithm used for transforming vectors from one representation to another.
What is Document?
As discussed, it refers to some text. If we go in some detail, it is an object of the text sequence type which is known as ‘str’ in Python 3. For example, in Gensim, a document can be anything such as −
- Short tweet of 140 characters
- Single paragraph, i.e. article or research paper abstract
- News article
A text sequence type is commonly known as ‘str’ in Python 3. As we know that in Python, textual data is handled with strings or more specifically ‘str’ objects. Strings are basically immutable sequences of Unicode code points and can be written in the following ways −
Single quotes − For example, ‘Hi! How are you?’. It allows us to embed double quotes also. For example, ‘Hi! “How” are you?’
Double quotes − For example, "Hi! How are you?". It allows us to embed single quotes also. For example, "Hi! 'How' are you?"
Triple quotes − It can have either three single quotes like, '''Hi! How are you?'''. or three double quotes like, """Hi! 'How' are you?"""
All the whitespaces will be included in the string literal.
Following is an example of a Document in Gensim −
Document = “Tutorialspoint.com is the biggest online tutorials library and it’s all free also”
What is Corpus?
A corpus may be defined as the large and structured set of machine-readable texts produced in a natural communicative setting. In Gensim, a collection of document object is called corpus. The plural of corpus is corpora.
Role of Corpus in Gensim
A corpus in Gensim serves the following two roles −
Serves as Input for Training a Model
The very first and important role a corpus plays in Gensim, is as an input for training a model. In order to initialize model’s internal parameters, during training, the model look for some common themes and topics from the training corpus. As discussed above, Gensim focuses on unsupervised models, hence it doesn’t require any kind of human intervention.
Serves as Topic Extractor
Once the model is trained, it can be used to extract topics from the new documents. Here, the new documents are the ones that are not used in the training phase.
The corpus can include all the tweets by a particular person, list of all the articles of a newspaper or all the research papers on a particular topic etc.
Following is an example of small corpus which contains 5 documents. Here, every document is a string consisting of a single sentence.
t_corpus = [ "A survey of user opinion of computer system response time", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", ]
Preprocessing Collecting Corpus
Once we collect the corpus, a few preprocessing steps should be taken to keep corpus simple. We can simply remove some commonly used English words like ‘the’. We can also remove words that occur only once in the corpus.
For example, the following Python script is used to lowercase each document, split it by white space and filter out stop words −
import pprint t_corpus = [ "A survey of user opinion of computer system response time", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", ] stoplist = set('for a of the and to in'.split(' ')) processed_corpus = [[word for word in document.lower().split() if word not in stoplist] for document in t_corpus] pprint.pprint(processed_corpus) ]
[['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'], ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'], ['generation', 'random', 'binary', 'unordered', 'trees'], ['intersection', 'graph', 'paths', 'trees'], ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering']]
Gensim also provides function for more effective preprocessing of the corpus. In such kind of preprocessing, we can convert a document into a list of lowercase tokens. We can also ignore tokens that are too short or too long. Such function is gensim.utils.simple_preprocess(doc, deacc=False, min_len=2, max_len=15).
Gensim provide this function to convert a document into a list of lowercase tokens and also for ignoring tokens that are too short or too long. It has the following parameters −
It refers to the input document on which preprocessing should be applied.
This parameter is used to remove the accent marks from tokens. It uses deaccent() to do this.
With the help of this parameter, we can set the minimum length of a token. The tokens shorter than defined length will be discarded.
With the help of this parameter we can set the maximum length of a token. The tokens longer than defined length will be discarded.
The output of this function would be the tokens extracted from input document.