- Trending Categories
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
Physics
Chemistry
Biology
Mathematics
English
Economics
Psychology
Social Studies
Fashion Studies
Legal Studies
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
How can automated document classification be performed?
Automated document classification is an essential text mining service because the existence of a tremendous number of on-line files, it is endless yet important to be able to automatically organize such records into classes to support document retrieval and sucessive analysis.
Document classification has been used in automated topic tagging (i.e., assigning labels to documents), topic directory construction, and identification of the document writing styles and defining the goals of hyperlinks related to a set of documents.
A general procedure is as follows − First, a group of preclassified files is taken as the training set. The training set is analyzed to change a classification scheme. Such a classification scheme required to be refined with a testing ohase. The so-derived classification scheme can be used for classification of several on-line files.
This phase occurs same to the classification of relational records. Relational data are well structured such as every tuple is described by a group of attribute-value pairs.
For instance, in the tuple {sunny, warm, dry, not windy, play tennis}, the value “sunny” equivalent to the attribute weather outlook, “warm” equivalent to the attribute temperature, etc.
The classification analysis determines which group of attribute-value pairs has the highest discriminating power in deciding whether a person is going to play tennis. In other terms, document databases are not structured as per the attribute-value pairs.
It is a set of keywords associated with a set of documents is not organized into a fixed set of attributes or dimensions. If we view each distinct keyword, term, or feature in the document as a dimension, there may be thousands of dimensions in a set of documents. Thus, it is generally used relational data-oriented classification methods, including decision tree analysis, cannot be efficient for the classification of document databases.
As per the vector-space model, two files are same if they share same files vectors. This model motivates the construction of the k-nearest-neighbor classifier, based on the intuition that similar documents are expected to be assigned the same class label.
It can simply index all of the training documents, each associated with its corresponding class label. When a test document is submitted, we can treat it as a query to the IR system and retrieve from the training set k documents that are most similar to the query, where k is a tunable constant.
The class label of the test files can be decided depends on the class label distribution of its k nearest neighbors. Such class label distribution can also be refined, such as based on weighted counts instead of raw counts, or setting aside a portion of labeled documents for validation.
- Related Articles
- How can generalization be performed on such data?
- How can discrete Fourier transform be performed in SciPy Python?
- How can Unicode operations be performed in Tensorflow using Python?
- How can Tensorflow be used to attach a classification head using Python?
- Why analytical characterization and attribute relevance analysis are needed and how these can be performed?
- Contract Which Need not be Performed
- C++ Program to Check Whether Topological Sorting can be Performed in a Graph
- What kind of string comparison, case-sensitive or not, can be performed by MySQL?
- How to identify the operations to be performed in word problems ?
- How can Tensorflow and pre-trained model be used to add classification head to the model?
- How to get the list of document properties which can be accessed using W3C DOM?
- How is class comparison performed?
- What are the document properties which can be accessed using Legacy DOM?
- By Whom Contracts Must Be Performed Under Indian Contract Act
- MongoDB - how can I access fields in a document?
