CNTK - Out-of-Memory Datasets


In this chapter, how to measure performance of out-of-memory datasets will be explained.

In previous sections, we have discussed about various methods to validate the performance of our NN, but the methods we have discussed, are ones that deals with the datasets that fit in the memory.

Here, the question arises what about out-of-memory datasets, because in production scenario, we need a lot of data to train NN. In this section, we are going to discuss how to measure performance when working with minibatch sources and manual minibatch loop.

Minibatch sources

While working with out-of-memory dataset, i.e. minibatch sources, we need slightly different setup for loss, as well as metric, than the setup we used while working with small datasets i.e. in-memory datasets. First, we will see how to set up a way to feed data to the trainer of NN model.

Following are the implementation steps−

Step 1 − First, from cntk.io module import the components for creating the minibatch source as follows−

from cntk.io import StreamDef, StreamDefs, MinibatchSource, CTFDeserializer,
 INFINITY_REPEAT
 

Step 2 − Next, create a new function named say create_datasource. This function will have two parameters namely filename and limit, with a default value of INFINITELY_REPEAT.

def create_datasource(filename, limit =INFINITELY_REPEAT)

Step 3 − Now, within the function, by using StreamDef class crate a stream definition for the labels that reads from the labels field that has three features. We also need to set is_sparse to False as follows−

labels_stream = StreamDef(field=’labels’, shape=3, is_sparse=False)

Step 4 − Next, create to read the features filed from the input file, create another instance of StreamDef as follows.

feature_stream = StreamDef(field=’features’, shape=4, is_sparse=False)

Step 5 − Now, initialise the CTFDeserializer instance class. Specify the filename and streams that we need to deserialize as follows −

deserializer = CTFDeserializer(filename, StreamDefs(labels=
label_stream, features=features_stream)

Step 6 − Next, we need to create instance of minisourceBatch by using deserializer as follows −

Minibatch_source = MinibatchSource(deserializer, randomize=True, max_sweeps=limit)
return minibatch_source

Step 7 − At last, we need to provide training and testing source, which we created in previous sections also. We are using iris flower dataset.

training_source = create_datasource(‘Iris_train.ctf’)
test_source = create_datasource(‘Iris_test.ctf’, limit=1)

Once you create MinibatchSource instance, we need to train it. We can use the same training logic, as used when we worked with small in-memory datasets. Here, we will use MinibatchSource instance, as the input for the train method on loss function as follows −

Following are the implementation steps−

Step 1 − In order to log the output of the training session, first import the ProgressPrinter from cntk.logging module as follows −

from cntk.logging import ProgressPrinter

Step 2 − Next, to set up the training session, import the trainer and training_session from cntk.train module as follows−

from cntk.train import Trainer, training_session

Step 3 − Now, we need to define some set of constants like minibatch_size, samples_per_epoch and num_epochs as follows−

minbatch_size = 16
samples_per_epoch = 150
num_epochs = 30
max_samples = samples_per_epoch * num_epochs

Step 4 − Next, in order to know how to read data during training in CNTK, we need to define a mapping between the input variable for the network and the streams in the minibatch source.

input_map = {
   features: training_source.streams.features,
   labels: training_source.streams.labels
}

Step 5 − Next to log the output of the training process, initialize the progress_printer variable with a new ProgressPrinter instance. Also, initialize the trainer and provide it with the model as follows−

progress_writer = ProgressPrinter(0)
trainer: training_source.streams.labels

Step 6 − At last, to start the training process, we need to invoke the training_session function as follows −

session = training_session(trainer,
   mb_source=training_source,
   mb_size=minibatch_size,
   model_inputs_to_streams=input_map,
   max_samples=max_samples,
   test_config=test_config)
session.train()

Once we trained the model, we can add validation to this setup by using a TestConfig object and assign it to the test_config keyword argument of the train_session function.

Following are the implementation steps−

Step 1 − First, we need to import the TestConfig class from the module cntk.train as follows−

from cntk.train import TestConfig

Step 2 − Now, we need to create a new instance of the TestConfig with the test_source as input−

Test_config = TestConfig(test_source)

Complete Example

from cntk.io import StreamDef, StreamDefs, MinibatchSource, CTFDeserializer, INFINITY_REPEAT
def create_datasource(filename, limit =INFINITELY_REPEAT)
labels_stream = StreamDef(field=’labels’, shape=3, is_sparse=False)
feature_stream = StreamDef(field=’features’, shape=4, is_sparse=False)
deserializer = CTFDeserializer(filename, StreamDefs(labels=label_stream, features=features_stream)
Minibatch_source = MinibatchSource(deserializer, randomize=True, max_sweeps=limit)
return minibatch_source
training_source = create_datasource(‘Iris_train.ctf’)
test_source = create_datasource(‘Iris_test.ctf’, limit=1)
from cntk.logging import ProgressPrinter
from cntk.train import Trainer, training_session
minbatch_size = 16
samples_per_epoch = 150
num_epochs = 30
max_samples = samples_per_epoch * num_epochs
input_map = {
   features:   training_source.streams.features,
   labels: training_source.streams.labels
 }
progress_writer = ProgressPrinter(0)
trainer: training_source.streams.labels
session = training_session(trainer,
   mb_source=training_source,
   mb_size=minibatch_size,
   model_inputs_to_streams=input_map,
   max_samples=max_samples,
   test_config=test_config)
session.train()
from cntk.train import TestConfig
Test_config = TestConfig(test_source)

Output

-------------------------------------------------------------------
average   since   average   since  examples
loss      last    metric    last
------------------------------------------------------
Learning rate per minibatch: 0.1
1.57      1.57     0.214    0.214   16
1.38      1.28     0.264    0.289   48
[………]
Finished Evaluation [1]: Minibatch[1-1]:metric = 69.65*30;

Manual minibatch loop

As we see above, it is easy to measure the performance of our NN model during and after training, by using the metrics when training with regular APIs in CNTK. But, on the other side, things will not be that easy while working with a manual minibatch loop.

Here, we are using the model given below with 4 inputs and 3 outputs from Iris Flower dataset, created in previous sections too−

from cntk import default_options, input_variable
from cntk.layers import Dense, Sequential
from cntk.ops import log_softmax, relu, sigmoid
from cntk.learners import sgd
model = Sequential([
   Dense(4, activation=sigmoid),
   Dense(3, activation=log_softmax)
])
features = input_variable(4)
labels = input_variable(3)
z = model(features)

Next, the loss for the model is defined as the combination of the cross-entropy loss function, and the F-measure metric as used in previous sections. We are going to use the criterion_factory utility, to create this as a CNTK function object as shown below−

import cntk
from cntk.losses import cross_entropy_with_softmax, fmeasure
@cntk.Function
def criterion_factory(outputs, targets):
   loss = cross_entropy_with_softmax(outputs, targets)
   metric = fmeasure(outputs, targets, beta=1)
   return loss, metric
loss = criterion_factory(z, labels)
learner = sgd(z.parameters, 0.1)
label_mapping = {
   'Iris-setosa': 0,
   'Iris-versicolor': 1,
   'Iris-virginica': 2
}

Now, as we have defined the loss function, we will see how we can use it in the trainer, to set up a manual training session.

Following are the implementation steps −

Step 1 − First, we need to import the required packages like numpy and pandas to load and preprocess the data.

import pandas as pd
import numpy as np

Step 2 − Next, in order to log information during training, import the ProgressPrinter class as follows−

from cntk.logging import ProgressPrinter

Step 3 − Then, we need to import the trainer module from cntk.train module as follows −

from cntk.train import Trainer

Step 4 − Next, create a new instance of ProgressPrinter as follows −

progress_writer = ProgressPrinter(0)

Step 5 − Now, we need to initialise trainer with the parameters the loss, the learner and the progress_writer as follows −

trainer = Trainer(z, loss, learner, progress_writer)

Step 6 −Next, in order to train the model, we will create a loop that will iterate over the dataset thirty times. This will be the outer training loop.

for _ in range(0,30):

Step 7 − Now, we need to load the data from disk using pandas. Then, in order to load the dataset in mini-batches, set the chunksize keyword argument to 16.

input_data = pd.read_csv('iris.csv',
names=['sepal_length', 'sepal_width','petal_length','petal_width', 'species'],
index_col=False, chunksize=16)

Step 8 − Now, create an inner training for loop to iterate over each of the mini-batches.

for df_batch in input_data:

Step 9 − Now inside this loop, read the first four columns using the iloc indexer, as the features to train from and convert them to float32 −

feature_values = df_batch.iloc[:,:4].values
feature_values = feature_values.astype(np.float32)

Step 10 − Now, read the last column as the labels to train from, as follows −

label_values = df_batch.iloc[:,-1]

Step 11 − Next, we will use one-hot vectors to convert the label strings to their numeric presentation as follows −

label_values = label_values.map(lambda x: label_mapping[x])

Step 12 − After that, take the numeric presentation of the labels. Next, convert them to a numpy array, so it is easier to work with them as follows −

label_values = label_values.values

Step 13 − Now, we need to create a new numpy array that has the same number of rows as the label values that we have converted.

encoded_labels = np.zeros((label_values.shape[0], 3))

Step 14 − Now, in order to create one-hot encoded labels, select the columns based on the numeric label values.

encoded_labels[np.arange(label_values.shape[0]), label_values] = 1.

Step 15 − At last, we need to invoke the train_minibatch method on the trainer and provide the processed features and labels for the minibatch.

trainer.train_minibatch({features: feature_values, labels: encoded_labels})

Complete Example

from cntk import default_options, input_variable
from cntk.layers import Dense, Sequential
from cntk.ops import log_softmax, relu, sigmoid
from cntk.learners import sgd
model = Sequential([
   Dense(4, activation=sigmoid),
   Dense(3, activation=log_softmax)
])
features = input_variable(4)
labels = input_variable(3)
z = model(features)
import cntk
from cntk.losses import cross_entropy_with_softmax, fmeasure
@cntk.Function
def criterion_factory(outputs, targets):
   loss = cross_entropy_with_softmax(outputs, targets)
   metric = fmeasure(outputs, targets, beta=1)
   return loss, metric
loss = criterion_factory(z, labels)
learner = sgd(z.parameters, 0.1)
label_mapping = {
   'Iris-setosa': 0,
   'Iris-versicolor': 1,
   'Iris-virginica': 2
}
import pandas as pd
import numpy as np
from cntk.logging import ProgressPrinter
from cntk.train import Trainer
progress_writer = ProgressPrinter(0)
trainer = Trainer(z, loss, learner, progress_writer)
for _ in range(0,30):
   input_data = pd.read_csv('iris.csv',
      names=['sepal_length', 'sepal_width','petal_length','petal_width', 'species'],
      index_col=False, chunksize=16)
for df_batch in input_data:
   feature_values = df_batch.iloc[:,:4].values
   feature_values = feature_values.astype(np.float32)
   label_values = df_batch.iloc[:,-1]
label_values = label_values.map(lambda x: label_mapping[x])
label_values = label_values.values
   encoded_labels = np.zeros((label_values.shape[0], 3))
   encoded_labels[np.arange(label_values.shape[0]), 
label_values] = 1.
   trainer.train_minibatch({features: feature_values, labels: encoded_labels})

Output

-------------------------------------------------------------------
average    since    average   since  examples
loss       last      metric   last
------------------------------------------------------
Learning rate per minibatch: 0.1
1.45       1.45     -0.189    -0.189   16
1.24       1.13     -0.0382    0.0371  48
[………]

In the above output, we got both the output for the loss and the metric during training. It is because we combined a metric and loss in a function object and used a progress printer in the trainer configuration.

Now, in order to evaluate the model performance, we need to perform same task as with training the model, but this time, we need to use an Evaluator instance to test the model. It is shown in the following Python code−

from cntk import Evaluator
evaluator = Evaluator(loss.outputs[1], [progress_writer])
input_data = pd.read_csv('iris.csv',
   names=['sepal_length', 'sepal_width','petal_length','petal_width', 'species'],
index_col=False, chunksize=16)
for df_batch in input_data:
   feature_values = df_batch.iloc[:,:4].values
   feature_values = feature_values.astype(np.float32)
   label_values = df_batch.iloc[:,-1]
   label_values = label_values.map(lambda x: label_mapping[x])
   label_values = label_values.values
   encoded_labels = np.zeros((label_values.shape[0], 3))
   encoded_labels[np.arange(label_values.shape[0]), label_values] = 1.
   evaluator.test_minibatch({ features: feature_values, labels:
      encoded_labels})
evaluator.summarize_test_progress()

Now, we will get the output something like the following−

Output

Finished Evaluation [1]: Minibatch[1-11]:metric = 74.62*143;