CNTK - In-Memory and Large Datasets

In this chapter, we will learn about how to work with the in-memory and large datasets in CNTK.

Training with small in memory datasets

When we talk about feeding data into CNTK trainer, there can be many ways, but it will depend upon the size of the dataset and format of the data. The data sets can be small in-memory or large datasets.

In this section, we are going to work with in-memory datasets. For this, we will use the following two frameworks −

• Numpy
• Pandas

Using Numpy arrays

Here, we will work with a numpy based randomly generated dataset in CNTK. In this example, we are going to simulate data for a binary classification problem. Suppose, we have a set of observations with 4 features and want to predict two possible labels with our deep learning model.

Implementation Example

For this, first we must generate a set of labels containing a one-hot vector representation of the labels, we want to predict. It can be done with the help of following steps −

Step 1 − Import the numpy package as follows −

```import numpy as np
num_samples = 20000
```

Step 2 − Next, generate a label mapping by using np.eye function as follows −

```label_mapping = np.eye(2)
```

Step 3 − Now by using np.random.choice function, collect the 20000 random samples as follows −

```y = label_mapping[np.random.choice(2,num_samples)].astype(np.float32)
```

Step 4 − Now at last by using np.random.random function, generate an array of random floating point values as follows −

```x = np.random.random(size=(num_samples, 4)).astype(np.float32)
```

Once, we generate an array of random floating-point values, we need to convert them to 32-bit floating point numbers so that it can be matched to the format expected by CNTK. Let’s follow the steps below to do this −

Step 5 − Import the Dense and Sequential layer functions from cntk.layers module as follows −

```from cntk.layers import Dense, Sequential
```

Step 6 − Now, we need to import the activation function for the layers in the network. Let us import the sigmoid as activation function −

```from cntk import input_variable, default_options
from cntk.ops import sigmoid
```

Step 7 − Now, we need to import the loss function to train the network. Let us import binary_cross_entropy as loss function −

```from cntk.losses import binary_cross_entropy
```

Step 8 − Next, we need to define the default options for the network. Here, we will be providing the sigmoid activation function as a default setting. Also, create the model by using Sequential layer function as follows −

```with default_options(activation=sigmoid):
model = Sequential([Dense(6),Dense(2)])
```

Step 9 − Next, initialise an input_variable with 4 input features serving as the input for the network.

```features = input_variable(4)
```

Step 10 − Now, in order to complete it, we need to connect features variable to the NN.

```z = model(features)
```

So, now we have a NN, with the help of following steps, let us train it using in-memory dataset −

Step 11 − To train this NN, first we need to import learner from cntk.learners module. We will import sgd learner as follows −

```from cntk.learners import sgd
```

Step 12 − Along with that import the ProgressPrinter from cntk.logging module as well.

```from cntk.logging import ProgressPrinter
progress_writer = ProgressPrinter(0)
```

Step 13 − Next, define a new input variable for the labels as follows −

```labels = input_variable(2)
```

Step 14 − In order to train the NN model, next, we need to define a loss using the binary_cross_entropy function. Also, provide the model z and the labels variable.

```loss = binary_cross_entropy(z, labels)
```

Step 15 − Next, initialize the sgd learner as follows −

```learner = sgd(z.parameters, lr=0.1)
```

Step 16 − At last, call the train method on the loss function. Also, provide it with the input data, the sgd learner and the progress_printer.−

```training_summary=loss.train((x,y),parameter_learners=[learner],callbacks=[progress_writer])
```

Complete implementation example

```import numpy as np
num_samples = 20000
label_mapping = np.eye(2)
y = label_mapping[np.random.choice(2,num_samples)].astype(np.float32)
x = np.random.random(size=(num_samples, 4)).astype(np.float32)
from cntk.layers import Dense, Sequential
from cntk import input_variable, default_options
from cntk.ops import sigmoid
from cntk.losses import binary_cross_entropy
with default_options(activation=sigmoid):
model = Sequential([Dense(6),Dense(2)])
features = input_variable(4)
z = model(features)
from cntk.learners import sgd
from cntk.logging import ProgressPrinter
progress_writer = ProgressPrinter(0)
labels = input_variable(2)
loss = binary_cross_entropy(z, labels)
learner = sgd(z.parameters, lr=0.1)
training_summary=loss.train((x,y),parameter_learners=[learner],callbacks=[progress_writer])
```

Output

```Build info:
Built time: *** ** **** 21:40:10
Build type: Release
Build target: CPU-only
With ASGD: yes
Math lib: mkl
Build SHA1:ae9c9c7c5f9e6072cc9c94c254f816dbdc1c5be6 (modified)
MPI distribution: Microsoft MPI
MPI version: 7.0.12437.6
-------------------------------------------------------------------
average   since   average   since examples
loss      last    metric    last
------------------------------------------------------
Learning rate per minibatch: 0.1
1.52      1.52      0         0     32
1.51      1.51      0         0     96
1.48      1.46      0         0    224
1.45      1.42      0         0    480
1.42       1.4      0         0    992
1.41      1.39      0         0   2016
1.4       1.39      0         0   4064
1.39      1.39      0         0   8160
1.39      1.39      0         0  16352
```

Using Pandas DataFrames

Numpy arrays are very limited in what they can contain and one of the most basic ways of storing data. For example, a single n-dimensional array can contain data of a single data type. But on the other hand, for many real-world cases we need a library that can handle more than one data type in a single dataset.

One of the Python libraries called Pandas makes it easier to work with such kind of datasets. It introduces the concept of a DataFrame (DF) and allows us to load datasets from disk stored in various formats as DFs. For example, we can read DFs stored as CSV, JSON, Excel, etc.

You can learn Python Pandas library in more detail at https://www.tutorialspoint.com/python_pandas/index.htm.

Implementation Example

In this example, we are going to use the example of classifying three possible species of the iris flowers based on four properties. We have created this deep learning model in the previous sections too. The model is as follows −

```from cntk.layers import Dense, Sequential
from cntk import input_variable, default_options
from cntk.ops import sigmoid, log_softmax
from cntk.losses import binary_cross_entropy
model = Sequential([
Dense(4, activation=sigmoid),
Dense(3, activation=log_softmax)
])
features = input_variable(4)
z = model(features)
```

The above model contains one hidden layer and an output layer with three neurons to match the number of classes we can predict.

Next, we will use the train method and loss function to train the network. For this, first we must load and preprocess the iris dataset, so that it matches the expected layout and data format for the NN. It can be done with the help of following steps −

Step 1 − Import the numpy and Pandas package as follows −

```import numpy as np
import pandas as pd
```

Step 2 − Next, use the read_csv function to load the dataset into memory −

```df_source = pd.read_csv(‘iris.csv’, names = [‘sepal_length’, ‘sepal_width’,
‘petal_length’, ‘petal_width’, ‘species’], index_col=False)
```

Step 3 − Now, we need to create a dictionary that will be mapping the labels in the dataset with their corresponding numeric representation.

```label_mapping = {‘Iris-Setosa’ : 0, ‘Iris-Versicolor’ : 1, ‘Iris-Virginica’ : 2}
```

Step 4 − Now, by using iloc indexer on the DataFrame, select the first four columns as follows −

```x = df_source.iloc[:, :4].values
```

Step 5 −Next, we need to select the species columns as the labels for the dataset. It can be done as follows −

```y = df_source[‘species’].values
```

Step 6 − Now, we need to map the labels in the dataset, which can be done by using label_mapping. Also, use one_hot encoding to convert them into one-hot encoding arrays.

```y = np.array([one_hot(label_mapping[v], 3) for v in y])
```

Step 7 − Next, to use the features and the mapped labels with CNTK, we need to convert them both to floats −

```x= x.astype(np.float32)
y= y.astype(np.float32)
```

As we know that, the labels are stored in the dataset as strings and CNTK cannot work with these strings. That’s the reason, it needs one-hot encoded vectors representing the labels. For this, we can define a function say one_hot as follows −

```def one_hot(index, length):
result = np.zeros(length)
result[index] = index
return result
```

Now, we have the numpy array in the correct format, with the help of following steps we can use them to train our model −

Step 8 − First, we need to import the loss function to train the network. Let us import binary_cross_entropy_with_softmax as loss function −

```from cntk.losses import binary_cross_entropy_with_softmax
```

Step 9 − To train this NN, we also need to import learner from cntk.learners module. We will import sgd learner as follows −

```from cntk.learners import sgd
```

Step 10 − Along with that import the ProgressPrinter from cntk.logging module as well.

```from cntk.logging import ProgressPrinter
progress_writer = ProgressPrinter(0)
```

Step 11 − Next, define a new input variable for the labels as follows −

```labels = input_variable(3)
```

Step 12 − In order to train the NN model, next, we need to define a loss using the binary_cross_entropy_with_softmax function. Also provide the model z and the labels variable.

```loss = binary_cross_entropy_with_softmax (z, labels)
```

Step 13 − Next, initialise the sgd learner as follows −

```learner = sgd(z.parameters, 0.1)
```

Step 14 − At last, call the train method on the loss function. Also, provide it with the input data, the sgd learner and the progress_printer.

```training_summary=loss.train((x,y),parameter_learners=[learner],callbacks=
[progress_writer],minibatch_size=16,max_epochs=5)
```

Complete implementation example

```from cntk.layers import Dense, Sequential
from cntk import input_variable, default_options
from cntk.ops import sigmoid, log_softmax
from cntk.losses import binary_cross_entropy
model = Sequential([
Dense(4, activation=sigmoid),
Dense(3, activation=log_softmax)
])
features = input_variable(4)
z = model(features)
import numpy as np
import pandas as pd
df_source = pd.read_csv(‘iris.csv’, names = [‘sepal_length’, ‘sepal_width’, ‘petal_length’, ‘petal_width’, ‘species’], index_col=False)
label_mapping = {‘Iris-Setosa’ : 0, ‘Iris-Versicolor’ : 1, ‘Iris-Virginica’ : 2}
x = df_source.iloc[:, :4].values
y = df_source[‘species’].values
y = np.array([one_hot(label_mapping[v], 3) for v in y])
x= x.astype(np.float32)
y= y.astype(np.float32)
def one_hot(index, length):
result = np.zeros(length)
result[index] = index
return result
from cntk.losses import binary_cross_entropy_with_softmax
from cntk.learners import sgd
from cntk.logging import ProgressPrinter
progress_writer = ProgressPrinter(0)
labels = input_variable(3)
loss = binary_cross_entropy_with_softmax (z, labels)
learner = sgd(z.parameters, 0.1)
training_summary=loss.train((x,y),parameter_learners=[learner],callbacks=[progress_writer],minibatch_size=16,max_epochs=5)
```

Output

```Build info:
Built time: *** ** **** 21:40:10
Build type: Release
Build target: CPU-only
With ASGD: yes
Math lib: mkl
Build SHA1:ae9c9c7c5f9e6072cc9c94c254f816dbdc1c5be6 (modified)
MPI distribution: Microsoft MPI
MPI version: 7.0.12437.6
-------------------------------------------------------------------
average    since    average   since   examples
loss        last     metric   last
------------------------------------------------------
Learning rate per minibatch: 0.1
1.1         1.1        0       0      16
0.835     0.704        0       0      32
1.993      1.11        0       0      48
1.14       1.14        0       0     112
[………]
```

Training with large datasets

In the previous section, we worked with small in-memory datasets using Numpy and pandas, but not all datasets are so small. Specially the datasets containing images, videos, sound samples are large. MinibatchSource is a component, that can load data in chunks, provided by CNTK to work with such large datasets. Some of the features of MinibatchSource components are as follows −

• MinibatchSource can prevent NN from overfitting by automatically randomize samples read from the data source.

• It has built-in transformation pipeline which can be used to augment the data.

• It loads the data on a background thread separate from the training process.

In the following sections, we are going to explore how to use a minibatch source with out-of-memory data to work with large datasets. We will also explore, how we can use it to feed for training a NN.

Creating MinibatchSource instance

In the previous section, we have used iris flower example and worked with small in-memory dataset using Pandas DataFrames. Here, we will be replacing the code that uses data from a pandas DF with MinibatchSource. First, we need to create an instance of MinibatchSource with the help of following steps −

Implementation Example

Step 1 − First, from cntk.io module import the components for the minibatchsource as follows −

```from cntk.io import StreamDef, StreamDefs, MinibatchSource, CTFDeserializer,
INFINITY_REPEAT
```

Step 2 − Now, by using StreamDef class, crate a stream definition for the labels.

```labels_stream = StreamDef(field=’labels’, shape=3, is_sparse=False)
```

Step 3 − Next, create to read the features filed from the input file, create another instance of StreamDef as follows.

```feature_stream = StreamDef(field=’features’, shape=4, is_sparse=False)
```

Step 4 − Now, we need to provide iris.ctf file as input and initialise the deserializer as follows −

```deserializer = CTFDeserializer(‘iris.ctf’, StreamDefs(labels=
label_stream, features=features_stream)
```

Step 5 − At last, we need to create instance of minisourceBatch by using deserializer as follows −

```Minibatch_source = MinibatchSource(deserializer, randomize=True)
```

Creating a MinibatchSource instance - Complete implementation example

```from cntk.io import StreamDef, StreamDefs, MinibatchSource, CTFDeserializer, INFINITY_REPEAT
labels_stream = StreamDef(field=’labels’, shape=3, is_sparse=False)
feature_stream = StreamDef(field=’features’, shape=4, is_sparse=False)
deserializer = CTFDeserializer(‘iris.ctf’, StreamDefs(labels=label_stream, features=features_stream)
Minibatch_source = MinibatchSource(deserializer, randomize=True)
```

Creating MCTF file

As you have seen above, we are taking the data from ‘iris.ctf’ file. It has the file format called CNTK Text Format(CTF). It is mandatory to create a CTF file to get the data for the MinibatchSource instance we created above. Let us see how we can create a CTF file.

Implementation Example

Step 1 − First, we need to import the pandas and numpy packages as follows −

```import pandas as pd
import numpy as np
```

Step 2 − Next, we need to load our data file, i.e. iris.csv into memory. Then, store it in the df_source variable.

```df_source = pd.read_csv(‘iris.csv’, names = [‘sepal_length’, ‘sepal_width’, ‘petal_length’, ‘petal_width’, ‘species’], index_col=False)
```

Step 3 − Now, by using iloc indexer as the features, take the content of the first four columns. Also, use the data from species column as follows −

```features = df_source.iloc[: , :4].values
labels = df_source[‘species’].values
```

Step 4 − Next, we need to create a mapping between the label name and its numeric representation. It can be done by creating label_mapping as follows −

```label_mapping = {‘Iris-Setosa’ : 0, ‘Iris-Versicolor’ : 1, ‘Iris-Virginica’ : 2}
```

Step 5 − Now, convert the labels to a set of one-hot encoded vectors as follows −

```labels = [one_hot(label_mapping[v], 3) for v in labels]
```

Now, as we did before, create a utility function called one_hot to encode the labels. It can be done as follows −

```def one_hot(index, length):
result = np.zeros(length)
result[index] = 1
return result
```

As, we have loaded and preprocessed the data, it’s time to store it on disk in the CTF file format. We can do it with the help of following Python code −

```With open(‘iris.ctf’, ‘w’) as output_file:
for index in range(0, feature.shape[0]):
feature_values = ‘ ‘.join([str(x) for x in np.nditer(features[index])])
label_values = ‘ ‘.join([str(x) for x in np.nditer(labels[index])])
output_file.write(‘features {} | labels {} \n’.format(feature_values, label_values))
```

Creating a MCTF file - Complete implementation example

```import pandas as pd
import numpy as np
df_source = pd.read_csv(‘iris.csv’, names = [‘sepal_length’, ‘sepal_width’, ‘petal_length’, ‘petal_width’, ‘species’], index_col=False)
features = df_source.iloc[: , :4].values
labels = df_source[‘species’].values
label_mapping = {‘Iris-Setosa’ : 0, ‘Iris-Versicolor’ : 1, ‘Iris-Virginica’ : 2}
labels = [one_hot(label_mapping[v], 3) for v in labels]
def one_hot(index, length):
result = np.zeros(length)
result[index] = 1
return result
With open(‘iris.ctf’, ‘w’) as output_file:
for index in range(0, feature.shape[0]):
feature_values = ‘ ‘.join([str(x) for x in np.nditer(features[index])])
label_values = ‘ ‘.join([str(x) for x in np.nditer(labels[index])])
output_file.write(‘features {} | labels {} \n’.format(feature_values, label_values))
```

Feeding the data

Once you create MinibatchSource, instance, we need to train it. We can use the same training logic as used when we worked with small in-memory datasets. Here, we will use MinibatchSource instance as the input for the train method on loss function as follows −

Implementation Example

Step 1 − In order to log the output of the training session, first import the ProgressPrinter from cntk.logging module as follows −

```from cntk.logging import ProgressPrinter
```

Step 2 − Next, to set up the training session, import the trainer and training_session from cntk.train module as follows −

```from cntk.train import Trainer,
```

Step 3 − Now, we need to define some set of constants like minibatch_size, samples_per_epoch and num_epochs as follows −

```minbatch_size = 16
samples_per_epoch = 150
num_epochs = 30
```

Step 4 − Next, in order to know CNTK how to read data during training, we need to define a mapping between the input variable for the network and the streams in the minibatch source.

```input_map = {
features: minibatch.source.streams.features,
labels: minibatch.source.streams.features
}
```

Step 5 − Next, to log the output of the training process, initialise the progress_printer variable with a new ProgressPrinter instance as follows −

```progress_writer = ProgressPrinter(0)
```

Step 6 − At last, we need to invoke the train method on the loss as follows −

```train_history = loss.train(minibatch_source,
parameter_learners=[learner],
model_inputs_to_streams=input_map,
callbacks=[progress_writer],
epoch_size=samples_per_epoch,
max_epochs=num_epochs)
```

Feeding the data - Complete implementation example

```from cntk.logging import ProgressPrinter
from cntk.train import Trainer, training_session
minbatch_size = 16
samples_per_epoch = 150
num_epochs = 30
input_map = {
features: minibatch.source.streams.features,
labels: minibatch.source.streams.features
}
progress_writer = ProgressPrinter(0)
train_history = loss.train(minibatch_source,
parameter_learners=[learner],
model_inputs_to_streams=input_map,
callbacks=[progress_writer],
epoch_size=samples_per_epoch,
max_epochs=num_epochs)
```

Output

```-------------------------------------------------------------------
average   since   average   since  examples
loss      last     metric   last
------------------------------------------------------
Learning rate per minibatch: 0.1
1.21      1.21      0        0       32
1.15      0.12      0        0       96
[………]
```