This chapter deals with topic modeling with regards to Gensim.
To annotate our data and understand sentence structure, one of the best methods is to use computational linguistic algorithms. No doubt, with the help of these computational linguistic algorithms we can understand some finer details about our data but,
Can we know what kind of words appear more often than others in our corpus?
Can we group our data?
Can we be underlying themes in our data?
We’d be able to achieve all these with the help of topic modeling. So let’s deep dive into the concept of topic models.
A Topic model may be defined as the probabilistic model containing information about topics in our text. But here, two important questions arise which are as follows −
First, what exactly a topic is?
Topic, as name implies, is underlying ideas or the themes represented in our text. To give you an example, the corpus containing newspaper articles would have the topics related to finance, weather, politics, sports, various states news and so on.
Second, what is the importance of topic models in text processing?
As we know that, in order to identify similarity in text, we can do information retrieval and searching techniques by using words. But, with the help of topic models, now we can search and arrange our text files using topics rather than words.
In this sense we can say that topics are the probabilistic distribution of words. That’s why, by using topic models, we can describe our documents as the probabilistic distributions of topics.
As discussed above, the focus of topic modeling is about underlying ideas and themes. Its main goals are as follows −
Topic models can be used for text summarisation.
They can be used to organise the documents. For example, we can use topic modeling to group news articles together into an organised/ interconnected section such as organising all the news articles related to cricket.
They can improve search result. How? For a search query, we can use topic models to reveal the document having a mix of different keywords, but are about same idea.
The concept of recommendations is very useful for marketing. It’s used by various online shopping websites, news websites and many more. Topic models helps in making recommendations about what to buy, what to read next etc. They do it by finding materials having a common topic in list.
Undoubtedly, Gensim is the most popular topic modeling toolkit. Its free availability and being in Python make it more popular. In this section, we will be discussing some most popular topic modeling algorithms. Here, we will focus on ‘what’ rather than ‘how’ because Gensim abstract them very well for us.
Latent Dirichlet allocation (LDA) is the most common and popular technique currently in use for topic modeling. It is the one that the Facebook researchers used in their research paper published in 2013. It was first proposed by David Blei, Andrew Ng, and Michael Jordan in 2003. They proposed LDA in their paper that was entitled simply Latent Dirichlet allocation.
Let’s know more about this wonderful technique through its characteristics −
Probabilistic topic modeling technique
LDA is a probabilistic topic modeling technique. As we discussed above, in topic modeling we assume that in any collection of interrelated documents (could be academic papers, newspaper articles, Facebook posts, Tweets, e-mails and so-on), there are some combinations of topics included in each document.
The main goal of probabilistic topic modeling is to discover the hidden topic structure for collection of interrelated documents. Following three things are generally included in a topic structure −
Statistical distribution of topics among the documents
Words across a document comprising the topic
Work in an unsupervised way
LDA works in an unsupervised way. It is because, LDA use conditional probabilities to discover the hidden topic structure. It assumes that the topics are unevenly distributed throughout the collection of interrelated documents.
Very easy to create it in Gensim
In Gensim, it is very easy to create LDA model. we just need to specify the corpus, the dictionary mapping, and the number of topics we would like to use in our model.
Model=models.LdaModel(corpus, id2word=dictionary, num_topics=100)
May face computationally intractable problem
Calculating the probability of every possible topic structure is a computational challenge faced by LDA. It’s challenging because, it needs to calculate the probability of every observed word under every possible topic structure. If we have large number of topics and words, LDA may face computationally intractable problem.
The topic modeling algorithms that was first implemented in Gensim with Latent Dirichlet Allocation (LDA) is Latent Semantic Indexing (LSI). It is also called Latent Semantic Analysis (LSA).
It got patented in 1988 by Scott Deerwester, Susan Dumais, George Furnas, Richard Harshman, Thomas Landaur, Karen Lochbaum, and Lynn Streeter. In this section we are going to set up our LSI model. It can be done in the same way of setting up LDA model. we need to import LSI model from gensim.models.
Actually, LSI is a technique NLP, especially in distributional semantics. It analyzes the relationship in between a set of documents and the terms these documents contain. If we talk about its working, then it constructs a matrix that contains word counts per document from a large piece of text.
Once constructed, to reduce the number of rows, LSI model use a mathematical technique called singular value decomposition (SVD). Along with reducing the number of rows, it also preserves the similarity structure among columns. In matrix, the rows represent unique words and the columns represent each document. It works based on distributional hypothesis i.e. it assumes that the words that are close in meaning will occur in same kind of text.
Model=models.LsiModel(corpus, id2word=dictionary, num_topics=100)
Topic models such as LDA and LSI helps in summarizing and organize large archives of texts that is not possible to analyze by hand. Apart from LDA and LSI, one other powerful topic model in Gensim is HDP (Hierarchical Dirichlet Process). It’s basically a mixed-membership model for unsupervised analysis of grouped data. Unlike LDA (its’s finite counterpart), HDP infers the number of topics from the data.