Distributed Training with DeepSpeed



Single-GPU training becomes inefficient or impossible in most cases with the increase in model sizes and dataset sizes. Therefore, the models can scale easily from a single GPU to multiple GPUs and nodes based on distributed training. In conjunction with optimizing this training method, DeepSpeed by Microsoft is one of the best-equipped frameworks. It enables the handling of large models and reduces memory overhead through salient techniques such as data parallelism, model parallelism, and the Zero Redundancy Optimizer (ZeRO).

Basic Distributed Training

There are a lot of parts in training a machine learning model, and usually, it splits into parts over many computing resources such as GPUs or cluster nodes. Scaling up data and computation generally faces an important challenge, leading to the big model's easy and efficient training.

Why Distributed Training?

The following are key reasons to consider for distributed training while working on large deep learning models −

  • Scalability − It is prohibitively hard to train very large models with tens of millions, or even billions, of parameters on a single GPU. This process can be scaled across many GPUs by using distributed training.
  • Faster Convergence − Spreading out the training process among several GPUs accelerates the process of convergence, hence faster model development.
  • Resource Efficiency − This kind of training will serve the purpose of putting your available hardware to its maximum use, hence saving time and money spent.
  • Data Parallelism − This is the case when one model is spread across many GPUs, and each GPU processes different batches of the dataset.
  • Model Parallelism − The model is parallelized on multiple GPUs; each GPU calculates parts of the model's operations.
  • Hybrid Parallelism − Mix data and model parallelism. In other words, split the data on GPUs and further split the model.

Data Parallelism

DeepSpeed facilitates distributed training by offering adaptable models and data concurrency. Let's explore these in depth.

Each GPU or worker receives a portion of the data to process when data parallelism is used. It then averages those results after processing to update the model weights. Therefore, one can train with larger batch sizes without running out of memory.

Example of Data Parallelism With DeepSpeed

The following is a simple Python example to show data parallelism with DeepSpeed −

import torch
import deepspeed

# Define a simple neural network model
class SimpleModel(torch.nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc1 = torch.nn.Linear(784, 128)
        self.fc2 = torch.nn.Linear(128, 10)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        return self.fc2(x)

# Initialize DeepSpeed configuration
deepspeed_config = {
    "train_batch_size": 64,
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.001
        }
    }
}

# Initialize model
model = SimpleModel()

# Initialize DeepSpeed for distributed data parallelity
model_engine, optimizer, _, _ = deepspeed.initialize(
    config=deepspeed_config,
    model=model
)

# Dummy data
inputs = torch.randn(64, 784)
labels = torch.randint(0, 10, (64,))

# Forward pass
outputs = model_engine(inputs)
loss = torch.nn.functional.cross_entropy(outputs, labels)

# Backward pass and optimization
model_engine.backward(loss)
model_engine.step()

The neural network would now be training on several GPUs; each GPU takes care of a portion of the data.

Model Parallelism

Model parallelism deals with splitting the model across multiple GPUs. This becomes helpful when a single model does not fit into the memory of a single GPU.

Model Parallelism With DeepSpeed

It splits the model across multiple GPUs, where different parts of the model can execute on different GPUs concurrently.

Example of Model Parallelism With DeepSpeed

The following is a simple Python program to show the working of model parallelism using DeepSpeed −

import torch
import deepspeed
from deepspeed.pipe import PipelineModule, LayerSpec

# Define a simple pipeline model
class SimpleLayer(torch.nn.Module):
    def __init__(self, input_size, output_size):
        super(SimpleLayer, self).__init__()
        self.fc = torch.nn.Linear(input_size, output_size)

    def forward(self, x):
        return torch.relu(self.fc(x))

# Two GPUs and two layers in a pipeline paradigm.
layers = [
    LayerSpec(SimpleLayer, 784, 128),
    LayerSpec(SimpleLayer, 128, 10)
]
# We create a pipeline model, specifying the number of stages - 2
pipeline_model = PipelineModule(layers=layers, num_stages=2)

# Initialize DeepSpeed for model parallelism
model_engine, optimizer, _, _ = deepspeed.initialize(
    config=deepspeed_config,
    model=pipeline_model
)

# Dummy inputs
inputs = torch.randn(64, 784)

# Forward pass through pipeline
outputs = model_engine(inputs)

This will process the forward pass across many GPUs in phases. The first GPU will process up to the first layer, while the second will process up to the second last layer.

Zero Redundancy Optimizer (ZeRO)

The most salient feature of DeepSpeed is perhaps the Zero Redundancy Optimizer, more conveniently called ZeRO and designed to solve the memory consumption problem of model training. It splits various states across different GPUs, allowing more efficient usage of memory: optimizer, gradients, and parameters.

ZeRO includes three phases −

  • Stage 1 − Partitioning the state of the optimizer.
  • Stage 2 − Partitioning the gradient state.
  • Stage 3 − Partitioning the parameter state.

Example of Zero Redundancy Optimizer

Following is a simple example of zero redundancy optimizer in python −

import torch
import deepspeed

# Use ZeRO optimization to define the model and DeepSpeed settings
deepspeed_config = {
    "train_batch_size": 64,
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.001
        }
    },
    "zero_optimization": {
        "stage": 2 # Toggle gradient partitioning using ZeRO Stage 2
    }
}

# Initialize model
model = SimpleModel()

# Initialize DeepSpeed with ZeRO optimization
model_engine, optimizer, _, _ = deepspeed.initialize(
    config=deepspeed_config,
    model=model
)

# Forward pass
inputs = torch.randn(64, 784)
outputs = model_engine(inputs)

# Backward pass and optimization
model_engine.backward(outputs)
model_engine.step()

This code runs on ZeRO Stage 2, which is a gradient state partitioned across GPUs and reduces memory consumption during training.

Scaling Models Across Multiple GPUs and Nodes

DeepSpeed scales models across multiple GPUs and nodes by leveraging a mixture of parallelism strategies with the advanced communication layer of DeepSpeed to realize the best scaling.

Scaling Example with Multiple Nodes

The NCCL backend was utilized for inter-GPU communication and scale training to multiple GPUs and nodes. We can make the following call to use DeepSpeed running on multiple GPUs and nodes:

To run on multiple GPUs and nodes using DeepSpeed, you can use the following command:

deepspeed --num_nodes 2 --num_gpus 8 train.py

This uses a total of 8 GPUs and 2 nodes for training.

Example of Training on Multiple GPUs with DeepSpeed

The follwing example demostrates how to work with Training on multiple GPUs using DeepSpeed −

import deepspeed
# Training on multiple GPUs
if torch.distributed.get_rank() == 0:
    print("Training on multiple GPUs with DeepSpeed")
# Initialize DeepSpeed with ZeRO optimization for multi-GPU
model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    config=deepspeed_config
)
# Training loop
for batch in train_loader:
    inputs, labels = batch
    outputs = model_engine(inputs)
    loss = torch.nn.functional.cross_entropy(outputs, labels)
    model_engine.backward(loss)
    model_engine.step()

This code uses DeepSpeed for training the model on various GPUs memory-efficiently, employing methods such as ZeRO for optimization.

Summing Up

DeepSpeed has been powerfully developed for scaling and optimizing distributed training in deep learning models. With the integration of ZeRO for further scaling up on multiple GPUs and nodes with a combination of data parallelism and model parallelism, DeepSpeed can fully address all those challenges in the efficient training of big models. That means that simultaneously, features from DeepSpeed will ensure that distributed training stays accessible and performance-enhanced as it grows.

Advertisements