DeepSpeed Useful Resources

DeepSpeed - Advanced Features

Quiz

DeepSpeed is a high-performing library in terms of deep learning optimization offered by Microsoft that has been leading innovation when it comes to training models of large-scale AI applications by techniques of scaling up and efficiency but with low resource usage.

While the users are already pretty familiar with the core functionalities, full utilization of the capability of DeepSpeed requires some in-depth knowledge of such advanced features as the custom operators, sophisticated options in configuration, and tools of profiling and debugging.

As such, it makes an even more in-depth exploration of these features, showing the reader how to tap into even more of what DeepSpeed can offer in their deep learning projects.

Custom Operators in DeepSpeed

DeepSpeed's custom operators allow the user to fine-tune specific parts of the model to pre-optimize them to conduct efficient computations with little overhead. They become necessary where the default implementation is weak for the task at hand. The power they impart on a developer has some room for finetuning the model parts to suit performance needs.

Example: Custom Operators

DeepSpeed gives the flexibility to easily register and embed custom CUDA and CPU operators. Below is how you can create a simple custom operator.

import torch
# Custom Add Operator using PyTorch's autograd
class AddOp(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x, y):
        result = x + y
        ctx.save_for_backward(x, y)
        return result

    @staticmethod
    def backward(ctx, grad_output):
        x, y = ctx.saved_tensors
        return grad_output, grad_output

# Wrapper class for the custom operator
class CustomAddOp:
    def build(self):
        return AddOp.apply

# Testing the custom operator
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = torch.tensor([4.0, 5.0, 6.0], requires_grad=True)

custom_add = CustomAddOp().build()
output = custom_add(x, y)
output.backward(torch.ones_like(x))

print(f"Output: {output}")
print(f"x Gradient: {x.grad}")
print(f"y Gradient: {y.grad}")

Output

tensor([5., 7., 9.], grad_fn=<AddOpBackward>)
x Gradient: tensor([1., 1., 1.])
y Gradient: tensor([1., 1., 1.])

Custom operators are for developers who need to fine-tune and optimize a certain model layer, allowing full flexibility for large models and computationally intensive processes.

More Fine-Tuned Configuration Options

DeepSpeed features a very strong setup system in which users can control fine-grained how models are trained. Flexibility is accessible by specifying options with JSON configuration files.

Example Configuration File

Here is a minimal DeepSpeed config file with advanced options −

{
    "train_batch_size": 32,
    "gradient_accumulation_steps": 2,
    "fp16": {
        "enabled": true
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 0.001,
            "betas": [0.9, 0.999],
            "eps": 1e-08,
            "weight_decay": 0.01
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0.0,
            "warmup_max_lr": 0.001,
            "warmup_num_steps": 100
        }
    },
    "zero_optimization": {
        "stage": 2,
        "contiguous_gradients": true,
        "reduce_scatter": true,
        "allgather_partitions": true
    }
}

Advanced Options

fp16 − Mixed precision training greatly improves performance with negligible loss of accuracy.
zero_optimization − It uses the Zero Redundancy Optimizer that reduces the memory required for large models by reducing the gradients and states of the optimizers.
gradient_accumulation_steps − It enables handling much higher effective batch sizes through stratification of the different batches into smaller pieces and no longer requires fitting the entire batch in memory. Even training on hardware-constrained resources will be highly efficient.

Loading and Using the Configuration

To load and use the config in your training script, you do: "nnCanBeConverted"

import deepspeed
import torch

# Assuming a model and dataloader are already defined
# For example: model = YourModelClass() and dataloader = DataLoader(...)

# Load DeepSpeed configuration
# ds_config.json should contain the configuration details for DeepSpeed
model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    model_parameters=model.parameters(),
    config='ds_config.json'
)

# Training loop
for batch in dataloader:
    # Forward pass
    outputs = model_engine(batch)
    loss = outputs['loss'] if isinstance(outputs, dict) and 'loss' in outputs else outputs

    # Backward pass and optimization step
    model_engine.backward(loss)
    model_engine.step()
# This setup allows large models to leverage DeepSpeed's advanced memory and computational optimizations.

This allows training DeepSpeed advanced functionality to be accelerated, but especially for large models.

Profiling and Debugging DeepSpeed Applications

Profiling and debugging big models is a tool used for the identification of bottlenecks and ensuring that your code runs reasonably efficiently. DeepSpeed gives out several such tools, including built-in logging and interoperability with mainstream profiling tools like NVIDIA Nsight Systems.

Using DeepSpeed Profiler

DeepSpeed provides hooks to add performance profiling when training a model. You can easily add these hooks to your training script.

import deepspeed
import torch

# Define DeepSpeed configuration with profiling enabled
deepspeed_config = {
    "train_batch_size": 32,
    "steps_per_print": 10,
    "gradient_clipping": 1.0,
    "wall_clock_breakdown": True  # Enables detailed timing breakdown for profiling
}

# Assuming `model` and `data_loader` are defined
# Initialize DeepSpeed with profiling enabled
model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    config_params=deepspeed_config,
    model_parameters=model.parameters()
)

# Start training with profiling enabled
for batch in data_loader:
    # Forward pass
    outputs = model_engine(batch)
    loss = outputs['loss'] if isinstance(outputs, dict) and 'loss' in outputs else outputs

    # Backward pass and optimization step
    model_engine.backward(loss)
    model_engine.step()

# Profiling enabled by `wall_clock_breakdown` provides detailed insights into performance bottlenecks.

Debugging Techniques

There are also other utilities in DeepSpeed related to debugging purposes such as logging and real-time resource monitoring. Those could be used for the detection of memory leaks or communication overheads or anything that could lead to inefficiency in the updates of gradients.

To put the config in verbose mode, you can replace it with something like this −

To enable chatty logging, modify the configuration as follows −

{
    "logging": {
        "level": "info",
        "steps_per_print": 50
    }
}

You can also debug by attaching an external debugger such as pdb or gdb at some parts of DeepSpeed to trace real-time errors.

Experimental and Cutting-Edge Features

DeepSpeed is a moving target, and quite some experimental features have already been added to recent releases. The benefits in terms of performance for edge cases are huge, though such features must be heavily tested.

3D Parallelism

DeepSpeed 3D Parallelism scales up model parallelism over tensor, pipeline, and data dimensions for unprecedented scalability for models with billions of parameters.

Here's an example configuration for 3D parallelism −

{
    "train_batch_size": 64,
    "tensor_parallel" : {
        "tp_size": 8
    },
    "pipeline_parallel":
"p_size": 4,
        "activation_checkpointing": true
}

With this configuration, the model is split into tensor parallel groups of size 8, and the pipeline is divided into 4 stages, thereby ensuring efficient memory usage while training massive models.

Activation Checkpointing

This method reduces memory at training because only the latest activations are saved and they are recomputed in the backward pass.

import deepspeed
from deepspeed.runtime.activation_checkpointing import checkpointing

# To activate activation checkpointing
checkpointing.configure(None, deepspeed_config="ds_config.json")

Activation checkpointing is rather important while training very deep models or constrained to the limited amount of memory on the GPU.

Print Page