DeepSpeed - Optimizer



Optimization and scheduling form the grounds for better performance in deep learning for large-scale models. DeepSpeed is an open-source, deep learning optimization library that assists model training more efficiently using its various supported techniques: memory optimization, gradient accumulation, and mixed-precision training.

The two key components of DeepSpeed are DeepSpeed Optimizer and DeepSpeed Scheduler. These work together to efficiently manage system resources, accelerate training, and reduce memory footprint on humble hardware setups- to train a model that can have billions of parameters.

Let's understand in detail how DeepSpeed Optimizer works with examples in the code of how it is used. We will look at the DeepSpeed scheduler in the following chapter.

What is DeepSpeed Optimizer?

DeepSpeed Optimizer manages model optimization by efficiently distributing memory. It supports optimizations natively interfaced with any of the popular deep learning frameworks such as PyTorch hence, it handles optimizer states that include momentum and gradient accumulation. This is a deep speed optimizer, which includes zero redundancy optimizer, ZeRO, mixed precision training, and gradient checkpointing among its main features.

Key Features of DeepSpeed Optimizer

The following are key features of DeepSpeed Optimizer −

1. Zero Redundancy Optimizer (ZeRO)

This reduces memory consumption for the states of optimizers, gradients, and model parameters by partitioning them across multiple devices.

This enables training giant models on capacity-limited devices.

2. Mixed Precision Training

By using both 16-bit and 32-bit floating-point representation, mixed precision training allows minimum memory consumption while not reducing model accuracy.

3. Gradient Checkpointing

It shards the models into chunks and stores only a subset of activations during the forward pass; hence, it might compute the intermediate values during the backward pass to save memory.

Example of Using DeepSpeed Optimizer

Following is a PyTorch-based example using DeepSpeed Optimizer with ZeRO −

import deepspeed
import torch
import torch.nn as nn
import torch.optim as optim

# Sample model definition
class SampleModel(nn.Module):
    def __init__(self):
        super(SampleModel, self).__init__()
        self.fc = nn.Linear(10, 1)

    def forward(self, x):
        return self.fc(x)

# Initialize model and optimizer
model = SampleModel()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# DeepSpeed configuration
ds_config = {
    "train_batch_size": 8,
    "gradient_accumulation_steps": 2,
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.001,
        }
    },
    "zero_optimization": {
        "stage": 1
    }
}

# Initialize DeepSpeed
model_engine, optimizer, _, _ = deepspeed.initialize(model=model, optimizer=optimizer, config_params=ds_config)

# Sample input and forward pass
inputs = torch.randn(8, 10)
outputs = model_engine(inputs)
loss = outputs.mean()

# Backward pass and optimization
model_engine.backward(loss)
model_engine.step()

Output

When executed in an IDE environment like PyCharm or VSCode, it will look like −

Deepspeed is initiated
Input tensor: torch.Size([8, 10])
Forward pass completed
Loss: -0.015
Backward pass and optimizer step complete

Above is an example of the IDE like PyCharm or VSCode that shows the code snippet with the already applied optimizer inside and the Terminal output that will show the successful execution of this optimizer.

Applying these examples and outputs shown in this chapter will make applying these tools to your deep learning workflow much easier.

Advertisements