What is Document Clustering Analysis?

Document clustering is the important techniques for organizing files in an unsupervised manner. When documents are represented as term vectors, the clustering methods can be applied. The document space is continually of large dimensionality, ranging from various hundreds to thousands.

Due to the curse of dimensionality, it makes sense to first project the documents into a lowerdimensional subspace in which the semantic structure of the document space becomes clear. In the low-dimensional semantic areas, the traditional clustering algorithms can be used.

There are several methods of document clustering analysis is as follows −

Spectral clustering − The spectral clustering method first performs spectral embedding (dimensionality reduction) on the original data, and then applies the traditional clustering algorithm (e.g., k-means) on the reduced document space.

It can work on spectral clustering shows its capability to handle highly nonlinear data (the data space has high curvature at every local area). Its powerful links to differential geometry make it capable of finding the manifold architecture of the file space.

The limitation of these spectral clustering algorithms can use the nonlinear embedding (dimensionality reduction), which is only represented on “training” data. They have to use some data points to understand the embedding. When the data set is huge, it is computationally costly to understand such an embedding. This restricts the software of spectral clustering on high data sets.

Mixture model − The mixture model clustering method models the text data with a mixture model, often involving multinomial component models. Clustering involves two steps as follows −

It can be estimating the model parameters based on the text data and any additional prior knowledge.

It can be inferring the clusters based on the estimated model parameters. It is depending on how the mixture model is defined, these methods can cluster words and documents at the same time.

Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA) are two instances of such approaches. The benefit of clustering methods is that the clusters can be designed to support comparative analysis of files.

The Latent Semantic Indexing (LSI) and Locality Preserving Indexing (LPI) methods are linear dimensionality reduction methods. It is used to achieve the transformation vectors (embedding function) in LSI and LPI. Such embedding functions are represented everywhere; thus, it can use element of the data to understand the embedding function and embed some data to low-dimensional space.

The aims of LSI is to find the best subspace approximation to the original document space in the sense of minimizing the global reconstruction error. In other words, LSI seeks to uncover the most representative features rather than the most discriminative features for document representation. Therefore, LSI might not be optimal in discriminating documents with different semantics, which is the ultimate goal of clustering.