Big Data Analytics - Online Learning

Online learning is a subfield of machine learning that allows to scale supervised learning models to massive datasets. The basic idea is that we don’t need to read all the data in memory to fit a model, we only need to read each instance at a time.

In this case, we will show how to implement an online learning algorithm using logistic regression. As in most of supervised learning algorithms, there is a cost function that is minimized. In logistic regression, the cost function is defined as −

$$J(\theta) \: = \: \frac{-1}{m} \left [ \sum_{i = 1}^{m}y^{(i)}log(h_{\theta}(x^{(i)})) + (1 - y^{(i)}) log(1 - h_{\theta}(x^{(i)})) \right ]$$

where J(θ) represents the cost function and h_θ(x) represents the hypothesis. In the case of logistic regression it is defined with the following formula −

$$h_\theta(x) = \frac{1}{1 + e^{\theta^T x}}$$

Now that we have defined the cost function we need to find an algorithm to minimize it. The simplest algorithm for achieving this is called stochastic gradient descent. The update rule of the algorithm for the weights of the logistic regression model is defined as −

$$\theta_j : = \theta_j - \alpha(h_\theta(x) - y)x$$

There are several implementations of the following algorithm, but the one implemented in the vowpal wabbit library is by far the most developed one. The library allows training of large scale regression models and uses small amounts of RAM. In the creators own words it is described as: "The Vowpal Wabbit (VW) project is a fast out-of-core learning system sponsored by Microsoft Research and (previously) Yahoo! Research".

We will be working with the titanic dataset from a kaggle competition. The original data can be found in the bda/part3/vw folder. Here, we have two files −

We have training data (train_titanic.csv), and
unlabeled data in order to make new predictions (test_titanic.csv).

In order to convert the csv format to the vowpal wabbit input format use the csv_to_vowpal_wabbit.py python script. You will obviously need to have python installed for this. Navigate to the bda/part3/vw folder, open the terminal and execute the following command −

python csv_to_vowpal_wabbit.py

Note that for this section, if you are using windows you will need to install a Unix command line, enter the cygwin website for that.

Open the terminal and also in the folder bda/part3/vw and execute the following command −

vw train_titanic.vw -f model.vw --binary --passes 20 -c -q ff --sgd --l1 
0.00000001 --l2 0.0000001 --learning_rate 0.5 --loss_function logistic

Let us break down what each argument of the vw call means.

-f model.vw − means that we are saving the model in the model.vw file for making predictions later
--binary − Reports loss as binary classification with -1,1 labels
--passes 20 − The data is used 20 times to learn the weights
-c − create a cache file
-q ff − Use quadratic features in the f namespace
--sgd − use regular/classic/simple stochastic gradient descent update, i.e., nonadaptive, non-normalized, and non-invariant.
--l1 --l2 − L1 and L2 norm regularization
--learning_rate 0.5 − The learning rate αas defined in the update rule formula

The following code shows the results of running the regression model in the command line. In the results, we get the average log-loss and a small report of the algorithm performance.

-loss_function logistic
creating quadratic features for pairs: ff  
using l1 regularization = 1e-08 
using l2 regularization = 1e-07 

final_regressor = model.vw 
Num weight bits = 18 
learning rate = 0.5 
initial_t = 1 
power_t = 0.5 
decay_learning_rate = 1 
using cache_file = train_titanic.vw.cache 
ignoring text input in favor of cache input 
num sources = 1 

average    since         example   example  current  current  current 
loss       last          counter   weight    label   predict  features 
0.000000   0.000000          1      1.0    -1.0000   -1.0000       57 
0.500000   1.000000          2      2.0     1.0000   -1.0000       57 
0.250000   0.000000          4      4.0     1.0000    1.0000       57 
0.375000   0.500000          8      8.0    -1.0000   -1.0000       73 
0.625000   0.875000         16     16.0    -1.0000    1.0000       73 
0.468750   0.312500         32     32.0    -1.0000   -1.0000       57 
0.468750   0.468750         64     64.0    -1.0000    1.0000       43 
0.375000   0.281250        128    128.0     1.0000   -1.0000       43 
0.351562   0.328125        256    256.0     1.0000   -1.0000       43 
0.359375   0.367188        512    512.0    -1.0000    1.0000       57 
0.274336   0.274336       1024   1024.0    -1.0000   -1.0000       57 h 
0.281938   0.289474       2048   2048.0    -1.0000   -1.0000       43 h 
0.246696   0.211454       4096   4096.0    -1.0000   -1.0000       43 h 
0.218922   0.191209       8192   8192.0     1.0000    1.0000       43 h 

finished run 
number of examples per pass = 802 
passes used = 11 
weighted example sum = 8822 
weighted label sum = -2288 
average loss = 0.179775 h 
best constant = -0.530826 
best constant’s loss = 0.659128 
total feature number = 427878

Now we can use the model.vw we trained to generate predictions with new data.

vw -d test_titanic.vw -t -i model.vw -p predictions.txt

The predictions generated in the previous command are not normalized to fit between the [0, 1] range. In order to do this, we use a sigmoid transformation.

# Read the predictions
preds = fread('vw/predictions.txt')  

# Define the sigmoid function 
sigmoid = function(x) { 
   1 / (1 + exp(-x)) 
} 
probs = sigmoid(preds[[1]])  

# Generate class labels 
preds = ifelse(probs > 0.5, 1, 0) 
head(preds) 
# [1] 0 1 0 0 1 0