- Trending Categories
- Data Structure
- Operating System
- MS Excel
- C Programming
- Social Studies
- Fashion Studies
- Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
What is Document Clustering Analysis?
Document clustering is the important techniques for organizing files in an unsupervised manner. When documents are represented as term vectors, the clustering methods can be applied. The document space is continually of large dimensionality, ranging from various hundreds to thousands.
Due to the curse of dimensionality, it makes sense to first project the documents into a lowerdimensional subspace in which the semantic structure of the document space becomes clear. In the low-dimensional semantic areas, the traditional clustering algorithms can be used.
There are several methods of document clustering analysis is as follows −
Spectral clustering − The spectral clustering method first performs spectral embedding (dimensionality reduction) on the original data, and then applies the traditional clustering algorithm (e.g., k-means) on the reduced document space.
It can work on spectral clustering shows its capability to handle highly nonlinear data (the data space has high curvature at every local area). Its powerful links to differential geometry make it capable of finding the manifold architecture of the file space.
The limitation of these spectral clustering algorithms can use the nonlinear embedding (dimensionality reduction), which is only represented on “training” data. They have to use some data points to understand the embedding. When the data set is huge, it is computationally costly to understand such an embedding. This restricts the software of spectral clustering on high data sets.
Mixture model − The mixture model clustering method models the text data with a mixture model, often involving multinomial component models. Clustering involves two steps as follows −
It can be estimating the model parameters based on the text data and any additional prior knowledge.
It can be inferring the clusters based on the estimated model parameters. It is depending on how the mixture model is defined, these methods can cluster words and documents at the same time.
Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA) are two instances of such approaches. The benefit of clustering methods is that the clusters can be designed to support comparative analysis of files.
The Latent Semantic Indexing (LSI) and Locality Preserving Indexing (LPI) methods are linear dimensionality reduction methods. It is used to achieve the transformation vectors (embedding function) in LSI and LPI. Such embedding functions are represented everywhere; thus, it can use element of the data to understand the embedding function and embed some data to low-dimensional space.
The aims of LSI is to find the best subspace approximation to the original document space in the sense of minimizing the global reconstruction error. In other words, LSI seeks to uncover the most representative features rather than the most discriminative features for document representation. Therefore, LSI might not be optimal in discriminating documents with different semantics, which is the ultimate goal of clustering.
- Related Articles
- What is Clustering?
- What is Conceptual Clustering?
- What is Multirelational clustering?
- What is K-means clustering?
- What is Agglomerative Hierarchical Clustering?
- What is Prototype-Based Clustering?
- What is model-based clustering?
- What is Multi-relational Clustering?
- What is analysis?
- What is clustering Index in DBMS?
- What is an Agglomerative Clustering Algorithm?
- What is STING grid-based clustering?
- What is scipy cluster hierarchy? How to cut hierarchical clustering into flat clustering?
- What is Nodal Analysis?
- What is Sensitivity Analysis?