DeepSpeed Useful Resources

Memory Optimization with DeepSpeed

Quiz

Memory optimization is crucial during training with the growing complexity of deep learning models and large-scale computations. DeepSpeed provides memory-saving techniques such as offloading, gradient checkpointing, and ZeRO. Developers can train models of such gigantic sizes based on memory-saving techniques within the standard hardware. Additionally, these techniques give the ability to train models that were hitherto limited by hardware limits.

DeepSpeed has made impressive strides in not just the realms of research but equally so in industry domains and hence is an irreplaceable tool for deep learning practitioners. It means by such strategies, you can make your model consume less memory and actually push out the limits of your hardware.

Why Memory Optimization?

Memory optimization forms part of the most critical components in training deep learning models. Under such parameters as billions, like GPT and BERT, effective memory management must be at work while training on available hardware. DeepSpeed is an open-source library for training deep learning models featuring ZeRO-optimizer, offloading techniques, and gradient checkpointing to avoid major utilization in training.

Memory Issues in Deep Learning

Deep learning models' size and complexity have recently increased. Such massive models need a vast amount of memory to train. DeepSpeed, developed by Microsoft as a deep learning optimization library, presented powerful solutions to the challenges.

Deep learning models are a bit of art as they represent something as new as the memory-related issues that pop up as the models' size grows. Some of the most common memory-related issues include:

Model Parameters − Large models, such as GPT-3, have hundreds of billions of parameters and, thus, require much memory to store.
Gradients − The gradients for computing every parameter during training also have to be calculated and kept in memory, which consumes a lot more.
Activation Maps − All the intermediate values resulting during the forward pass need to be stored until the back passes only for gradient computation, which is called activation maps.
Batch Sizes − Larger batch sizes increase convergence speeds but consume more memory.
Data Parallelism − Bifurcation of the data among several GPUs is a great strategy to cut down the training time but, undoubtedly, it does gobble up a lot of memory unless it is kept in control.

Unless those pitfalls are identified, training large models becomes impossible even on consumer-grade hardware. DeepSpeed conquers these by using innovative memory-saving techniques.

DeepSpeed's Memory Optimization Techniques

There is more than one way DeepSpeed has for the optimization of memory usage when the models are trained. Some methods include ZeRO, which stands for Zero Redundancy Optimizer, gradient checkpointing, and activation recomputations.

Zero Redundancy Optimizer (ZeRO)
Gradient Checkpointing
Offloading Techniques

1. Zero Redundancy Optimizer (ZeRO)

ZeRO is mainly concerned with memory optimization at the point where redundant copies of the optimizer's state, gradients, and model parameters are removed. ZeRO moves through these three phases:

Stage 1 − Sharding optimizer states across GPUs, with each GPU storing a portion of the optimizer state.
Stage 2 − Further reduction of memory as gradients are sharded across GPUs.
Stage 3 − Model parameters are sharded, and now models can be trained up to a trillion parameters.

Example

import deepspeed

model = MyModel()  # your dl model
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

# DeepSpeed configuration for ZeRO
ds_config = {
 "train_batch_size": 8,
 "zero_optimization": {
    "stage": 2,   # adjust the stage of ZeRO here
    "allgather_partitions": True,
    "reduce_scatter": True,
    "allgather_bucket_size": 5e8,
    "overlap_comm": True,
    "contiguous_gradients": True
 }
}
# Initialize DeepSpeed
model_engine, optimizer, _, _ = deepspeed.initialize(
  model=model,
  optimizer=optimizer,
  config_params=ds_config
)

# Training loop
for batch in train_dataloader:
    loss = model_engine(batch)
model_engine.backward(loss)
model_engine.step()

You'll notice that memory usage is much lower, especially for large models. A memory profiler can then highlight where the ZeRO optimization kicks in.

2. Gradient Checkpointing

Gradient checkpointing gives you reduced memory by not storing activations during the forward pass in the buffer. These are instead reconstructed on the backward pass, sacrificing a bit of computing to save some memory.

Example

import torch
from torch.utils.checkpoint import checkpoint

def custom_forward(*inputs):
    return model(*inputs)

# Gradient checkpointing
outputs = checkpoint(custom_forward, input_data)
loss = criterion(outputs, labels)
loss.backward()

In this case, memory saved will depend on the size of intermediate activations.

3. Offloading Techniques

DeepSpeed also provides offloading of another form. It lets you move parts of your model, like optimizer states and gradients, to the CPU or even NVMe storage, freeing GPU memory for a different usage.

CPU Offloading

DeepSpeed lets us offload optimizer states and gradients to the CPU. That also frees up that precious GPU memory. That can be really useful in case one has limited memory on the GPU but quite a bit of memory on the CPU.

Example

ds_config = {
    "train_batch_size": 8,
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": True
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        }
    }
}
model_engine, optimizer, _, _ = deepspeed.initialize(
   model=model,
   optimizer=optimizer,
   config_params=ds_config
)

Because transfer offloading to the CPU involves a cost with respect to inter-device, the training is relatively slow but at model sizes that otherwise wouldn't have fit on memory-constrained GPUs.

NVMe Offloading

This is not enough yet for big models. DeepSpeed also offloads optimizer states and gradients to the NVMe storage. This will increase the scale of the models that can teach even in addition.

Example

ds_config = {
    "train_batch_size": 8,
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "nvme",
            "nvme_path": "/local_nvme"
        },
        "offload_param": {
            "device": "nvme",
            "nvme_path": "/local_nvme"
        }
    }
}
model_engine, optimizer, _, _ = deepspeed.init(
   model=model,
   optimizer=optimizer,
   config_params=ds_config
)

Offloading using NVMe will enable the training of massive models, though the rate will largely rely on the I/O speed of the NVMe drive.

Memory Optimization Case Studies

Let's discuss some real-life case studies of memory optimization using DeepSpeed −

Case Study 1: Training GPT-2 using ZeRO Optimization

Using DeepSpeed, a research team scaled the training of this 1.5 billion parameter GPT-2 model down to consumer-grade GPUs. With ZeRO stage 3, it was possible to train it on 4 NVIDIA RTX 3090 GPUs, with a total of 24 GB of memory per GPU. Had it not been possible to do this using ZeRO, training would have been impossible due to the model's requirement for more than 50 GB of memory per GPU.

Case Study 2: Offloading with NVMe for a 175B Parameter Model

Microsoft leveraged DeepSpeed's offloading ability and trained a 175 billion parameter model on a cluster of GPUs with limited memory. The use of close to no memory bottlenecks for offloading optimizer states and parameters during training of the model shows how offloading can make way for super-large models even when GPU resources are limited.

Print Page