Apache MXNet - Distributed Training


This chapter is about the distributed training in Apache MXNet. Let us start by understanding what are the modes of computation in MXNet.

Modes of Computation

MXNet, a multi-language ML library, offers its users the following two modes of computation −

Imperative mode

This mode of computation exposes an interface like NumPy API. For example, in MXNet, use the following imperative code to construct a tensor of zeros on both CPU as well as GPU −

import mxnet as mx
tensor_cpu = mx.nd.zeros((100,), ctx=mx.cpu())
tensor_gpu= mx.nd.zeros((100,), ctx=mx.gpu(0))

As we see in the above code, MXNets specifies the location where to hold the tensor, either in CPU or GPU device. In above example, it is at location 0. MXNet achieve incredible utilisation of the device, because all the computations happen lazily instead of instantaneously.

Symbolic mode

Although the imperative mode is quite useful, but one of the drawbacks of this mode is its rigidity, i.e. all the computations need to be known beforehand along with pre-defined data structures.

On the other hand, Symbolic mode exposes a computation graph like TensorFlow. It removes the drawback of imperative API by allowing MXNet to work with symbols or variables instead of fixed/pre-defined data structures. Afterwards, the symbols can be interpreted as a set of operations as follows −

import mxnet as mx
x = mx.sym.Variable(“X”)
y = mx.sym.Variable(“Y”)
z = (x+y)
m = z/100

Kinds of Parallelism

Apache MXNet supports distributed training. It enables us to leverage multiple machines for faster as well as effective training.

Following are the two ways in which, we can distribute the workload of training a NN across multiple devices, CPU or GPU device −

Data Parallelism

In this kind of parallelism, each device stores a complete copy of the model and works with a different part of the dataset. Devices also update a shared model collectively. We can locate all the devices on a single machine or across multiple machines.

Model Parallelism

It is another kind of parallelism, which comes handy when models are so large that they do not fit into device memory. In model parallelism, different devices are assigned the task of learning different parts of the model. The important point here to note is that currently Apache MXNet supports model parallelism in a single machine only.

Working of distributed training

The concepts given below are the key to understand the working of distributed training in Apache MXNet −

Types of processes

Processes communicates with each other to accomplish the training of a model. Apache MXNet has the following three processes −


The job of worker node is to perform training on a batch of training samples. The Worker nodes will pull weights from the server before processing every batch. The Worker nodes will send gradients to the server, once the batch is processed.


MXNet can have multiple servers for storing the model’s parameters and to communicate with the worker nodes.


The role of the scheduler is to set up the cluster, which includes waiting for messages that each node has come up and which port the node is listening to. After setting up the cluster, the scheduler lets all the processes know about every other node in the cluster. It is because the processes can communicate with each other. There is only one scheduler.

KV Store

KV stores stands for Key-Value store. It is critical component used for multi-device training. It is important because, the communication of parameters across devices on single as well as across multiple machines is transmitted through one or more servers with a KVStore for the parameters. Let’s understand the working of KVStore with the help of following points −

  • Each value in KVStore is represented by a key and a value.

  • Each parameter array in the network is assigned a key and the weights of that parameter array is referred by value.

  • After that, the worker nodes push gradients after processing a batch. They also pull updated weights before processing a new batch.

The notion of KVStore server exists only during distributed training and the distributed mode of it is enabled by calling mxnet.kvstore.create function with a string argument containing the word dist

kv = mxnet.kvstore.create(‘dist_sync’)

Distribution of Keys

It is not necessary that, all the servers store all the parameters array or keys, but they are distributed across different servers. Such distribution of keys across different servers is handled transparently by the KVStore and the decision of which server stores a specific key is made at random.

KVStore, as discussed above, ensures that whenever the key is pulled, its request is sent to that server, which has the corresponding value. What if the value of some key is large? In that case, it may be shared across different servers.

Split training data

As being the users, we want each machine to be working on different parts of the dataset, especially, when running distributed training in data parallel mode. We know that, to split a batch of samples provided by the data iterator for data parallel training on a single worker we can use mxnet.gluon.utils.split_and_load and then, load each part of the batch on the device which will process it further.

On the other hand, in case of distributed training, at beginning we need to divide the dataset into n different parts so that every worker gets a different part. Once got, each worker can then use split_and_load to again divide that part of the dataset across different devices on a single machine. All this happen through data iterator. mxnet.io.MNISTIterator and mxnet.io.ImageRecordIter are two such iterators in MXNet that support this feature.

Weights updating

For updating the weights, KVStore supports following two modes −

  • First method aggregates the gradients and updates the weights by using those gradients.

  • In the second method the server only aggregates gradients.

If you are using Gluon, there is an option to choose between above stated methods by passing update_on_kvstore variable. Let’s understand it by creating the trainer object as follows −

trainer = gluon.Trainer(net.collect_params(), optimizer='sgd',
   optimizer_params={'learning_rate': opt.lr,
      'wd': opt.wd,
      'momentum': opt.momentum,
      'multi_precision': True},

Modes of Distributed Training

If the KVStore creation string contains the word dist, it means the distributed training is enabled. Following are different modes of distributed training that can be enabled by using different types of KVStore −


As name implies, it denotes synchronous distributed training. In this, all the workers use the same synchronized set of model parameters at the start of every batch.

The drawback of this mode is that, after each batch the server should have to wait to receive gradients from each worker before it updates the model parameters. This means that if a worker crashes, it would halt the progress of all workers.


As name implies, it denotes synchronous distributed training. In this, the server receives gradients from one worker and immediately updates its store. Server uses the updated store to respond to any further pulls.

The advantage, in comparison of dist_sync mode, is that a worker who finishes processing a batch can pull the current parameters from server and start the next batch. The worker can do so, even if the other worker has not yet finished processing the earlier batch. It is also faster than dist_sync mode because, it can take more epochs to converge without any cost of synchronization.


This mode is same as dist_sync mode. The only difference is that, when there are multiple GPUs being used on every node dist_sync_device aggregates gradients and updates weights on GPU whereas, dist_sync aggregates gradients and updates weights on CPU memory.

It reduces expensive communication between GPU and CPU. That is why, it is faster than dist_sync. The drawback is that it increases the memory usage on GPU.


This mode works same as dist_sync_device mode, but in asynchronous mode.