How can automated document classification be performed?

Automated document classification is an essential text mining service because the existence of a tremendous number of on-line files, it is endless yet important to be able to automatically organize such records into classes to support document retrieval and sucessive analysis.

Document classification has been used in automated topic tagging (i.e., assigning labels to documents), topic directory construction, and identification of the document writing styles and defining the goals of hyperlinks related to a set of documents.

A general procedure is as follows − First, a group of preclassified files is taken as the training set. The training set is analyzed to change a classification scheme. Such a classification scheme required to be refined with a testing ohase. The so-derived classification scheme can be used for classification of several on-line files.

This phase occurs same to the classification of relational records. Relational data are well structured such as every tuple is described by a group of attribute-value pairs.

For instance, in the tuple {sunny, warm, dry, not windy, play tennis}, the value “sunny” equivalent to the attribute weather outlook, “warm” equivalent to the attribute temperature, etc.

The classification analysis determines which group of attribute-value pairs has the highest discriminating power in deciding whether a person is going to play tennis. In other terms, document databases are not structured as per the attribute-value pairs.

It is a set of keywords associated with a set of documents is not organized into a fixed set of attributes or dimensions. If we view each distinct keyword, term, or feature in the document as a dimension, there may be thousands of dimensions in a set of documents. Thus, it is generally used relational data-oriented classification methods, including decision tree analysis, cannot be efficient for the classification of document databases.

As per the vector-space model, two files are same if they share same files vectors. This model motivates the construction of the k-nearest-neighbor classifier, based on the intuition that similar documents are expected to be assigned the same class label.

It can simply index all of the training documents, each associated with its corresponding class label. When a test document is submitted, we can treat it as a query to the IR system and retrieve from the training set k documents that are most similar to the query, where k is a tunable constant.

The class label of the test files can be decided depends on the class label distribution of its k nearest neighbors. Such class label distribution can also be refined, such as based on weighted counts instead of raw counts, or setting aside a portion of labeled documents for validation.