Gensim - Introduction



This chapter will help you understand history and features of Gensim along with its uses and advantages.

What is Gensim?

Gensim = “Generate Similar” is a popular open source natural language processing (NLP) library used for unsupervised topic modeling. It uses top academic models and modern statistical machine learning to perform various complex tasks such as −

  • Building document or word vectors
  • Corpora
  • Performing topic identification
  • Performing document comparison (retrieving semantically similar documents)
  • Analysing plain-text documents for semantic structure

Apart from performing the above complex tasks, Gensim, implemented in Python and Cython, is designed to handle large text collections using data streaming as well as incremental online algorithms. This makes it different from those machine learning software packages that target only in-memory processing.

History

In 2008, Gensim started off as a collection of various Python scripts for the Czech Digital Mathematics. There, it served to generate a short list of the most similar articles to a particular given article. But in 2009, RARE Technologies Ltd. released its initial release. Then, later in July 2019, we got its stable release (3.8.0).

Various Features

Following are some of the features and capabilities offered by Gensim −

Scalability

Gensim can easily process large and web-scale corpora by using its incremental online training algorithms. It is scalable in nature, as there is no need for the whole input corpus to reside fully in Random Access Memory (RAM) at any one time. In other words, all its algorithms are memory-independent with respect to the corpus size.

Robust

Gensim is robust in nature and has been in use in various systems by various people as well as organisations for over 4 years. We can easily plug in our own input corpus or data stream. It is also very easy to extend with other Vector Space Algorithms.

Platform Agnostic

As we know that Python is a very versatile language as being pure Python Gensim runs on all the platforms (like Windows, Mac OS, Linux) that supports Python and Numpy.

Efficient Multicore Implementations

In order to speed up processing and retrieval on machine clusters, Gensim provides efficient multicore implementations of various popular algorithms like Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), Random Projections (RP), Hierarchical Dirichlet Process (HDP).

Open Source and Abundance of Community Support

Gensim is licensed under the OSI-approved GNU LGPL license which allows it to be used for both personal as well as commercial use for free. Any modifications made in Gensim are in turn open-sourced and has abundance of community support too.

Uses of Gensim

Gensim has been used and cited in over thousand commercial and academic applications. It is also cited by various research papers and student theses. It includes streamed parallelised implementations of the following −

fastText

fastText, uses a neural network for word embedding, is a library for learning of word embedding and text classification. It is created by Facebook’s AI Research (FAIR) lab. This model, basically, allows us to create a supervised or unsupervised algorithm for obtaining vector representations for words.

Word2vec

Word2vec, used to produce word embedding, is a group of shallow and two-layer neural network models. The models are basically trained to reconstruct linguistic contexts of words.

LSA (Latent Semantic Analysis)

It is a technique in NLP (Natural Language Processing) that allows us to analyse relationships between a set of documents and their containing terms. It is done by producing a set of concepts related to the documents and terms.

LDA (Latent Dirichlet Allocation)

It is a technique in NLP that allows sets of observations to be explained by unobserved groups. These unobserved groups explain, why some parts of the data are similar. That’s the reason, it is a generative statistical model.

tf-idf (term frequency-inverse document frequency)

tf-idf, a numeric statistic in information retrieval, reflects how important a word is to a document in a corpus. It is often used by search engines to score and rank a document’s relevance given a user query. It can also be used for stop-words filtering in text summarisation and classification.

All of them will be explained in detail in the next sections.

Advantages

Gensim is a NLP package that does topic modeling. The important advantages of Gensim are as follows −

  • We may get the facilities of topic modeling and word embedding in other packages like ‘scikit-learn’ and ‘R’, but the facilities provided by Gensim for building topic models and word embedding is unparalleled. It also provides more convenient facilities for text processing.

  • Another most significant advantage of Gensim is that, it let us handle large text files even without loading the whole file in memory.

  • Gensim doesn’t require costly annotations or hand tagging of documents because it uses unsupervised models.

Advertisements