Mahout - Classification

Quiz

What is Classification?

Classification is a machine learning technique that uses known data to determine how the new data should be classified into a set of existing categories. For example,

iTunes application uses classification to prepare playlists.
Mail service providers such as Yahoo! and Gmail use this technique to decide whether a new mail should be classified as a spam. The categorization algorithm trains itself by analyzing user habits of marking certain mails as spams. Based on that, the classifier decides whether a future mail should be deposited in your inbox or in the spams folder.

How Classification Works

While classifying a given set of data, the classifier system performs the following actions:

Initially a new data model is prepared using any of the learning algorithms.
Then the prepared data model is tested.
Thereafter, this data model is used to evaluate the new data and to determine its class.

Applications of Classification

Credit card fraud detection - The Classification mechanism is used to predict credit card frauds. Using historical information of previous frauds, the classifier can predict which future transactions may turn into frauds.
Spam e-mails - Depending on the characteristics of previous spam mails, the classifier determines whether a newly encountered e-mail should be sent to the spam folder.

Naive Bayes Classifier

Mahout uses the Naive Bayes classifier algorithm. It uses two implementations:

Distributed Naive Bayes classification
Complementary Naive Bayes classification

Naive Bayes is a simple technique for constructing classifiers. It is not a single algorithm for training such classifiers, but a family of algorithms. A Bayes classifier constructs models to classify problem instances. These classifications are made using the available data.

An advantage of naive Bayes is that it only requires a small amount of training data to estimate the parameters necessary for classification.

For some types of probability models, naive Bayes classifiers can be trained very efficiently in a supervised learning setting.

Despite its oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations.

Procedure of Classification

The following steps are to be followed to implement Classification:

Generate example data
Create sequence files from data
Convert sequence files to vectors
Train the vectors
Test the vectors

Step1: Generate Example Data

Generate or download the data to be classified. For example, you can get the 20 newsgroups example data from the following link: http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz

Create a directory for storing input data. Download the example as shown below.

$ mkdir classification_example
$ cd classification_example
$tar xzvf 20news-bydate.tar.gz
wget http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz

Step 2: Create Sequence Files

Create sequence file from the example using seqdirectory utility. The syntax to generate sequence is given below:

mahout seqdirectory -i <input file path> -o <output directory>

Step 3: Convert Sequence Files to Vectors

Create vector files from sequence files using seq2parse utility. The options of seq2parse utility are given below:

$MAHOUT_HOME/bin/mahout seq2sparse
--analyzerName (-a) analyzerName  The class name of the analyzer
--chunkSize (-chunk) chunkSize    The chunkSize in MegaBytes.
--output (-o) output              The directory pathname for o/p
--input (-i) input                Path to job input directory.

Step 4: Train the Vectors

Train the generated vectors using the trainnb utility. The options to use trainnb utility are given below:

mahout trainnb
 -i ${PATH_TO_TFIDF_VECTORS}
 -el
 -o ${PATH_TO_MODEL}/model
 -li ${PATH_TO_MODEL}/labelindex
 -ow
 -c

Step 5: Test the Vectors

Test the vectors using testnb utility. The options to use testnb utility are given below:

mahout testnb
 -i ${PATH_TO_TFIDF_TEST_VECTORS}
 -m ${PATH_TO_MODEL}/model
 -l ${PATH_TO_MODEL}/labelindex
 -ow
 -o ${PATH_TO_OUTPUT}
 -c
 -seq

Print Page