DeepSpeed Useful Resources

DeepSpeed - Troubleshooting and Common Issues

Quiz

DeepSpeed is that revolutionary tool for scaling and optimizing deep learning models, but just like with any powerful technology, it will get you into trouble sometimes. You really need to know how to diagnose common errors, solve performance bottlenecks, and use community support to make the experience smoother.

There are multiple optimizing parameters, mixed precision, and ZeRO stages, which will bring enormous performance gains for the users. Using the facilities and guidelines of this chapter will make troubleshooting and optimizing DeepSpeed much easier and open full exploitation of capabilities for AI development.

Diagnosing and Fixing Common DeepSpeed Errors

At times when working with DeepSpeed, errors may arise due to hardware configuration, installation, or wrong application of features in DeepSpeed. A few of these include −

1. CUDA Out of Memory

This mostly occurs while training large models that happen to be larger than the available GPU memory.

Solution

You may scale down the batch size if you're short on memory or use mixed-precision training to mitigate the memory requirement.
The ZeRO configuration stage of DeepSpeed

Here is an example of using ZeRO for handling out of memory (OOM) to control

import deepspeed
from transformers import BertForSequenceClassification, Trainer, TrainingArguments

# Load model
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")

# DeepSpeed config with ZeRO optimization
ds_config = {
    "train_batch_size": 16,
    "fp16": {
        "enabled": True
    },
    "zero_optimization": {
        "stage": 2  # For memory efficiency, use ZeRO Stage 2.
    }
}

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',          # Output directory
    evaluation_strategy="steps",     # Evaluation strategy to adopt during training
    eval_steps=500,                  # Number of steps between evaluations
    save_steps=10_000,               # Save model every 10,000 steps
    per_device_train_batch_size=16,  # Batch size per device during training
    per_device_eval_batch_size=16,   # Batch size for evaluation
    logging_dir='./logs',            # Directory for storing logs
    logging_steps=100                 # Log every 100 steps
)

# Initialize Trainer with DeepSpeed
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,      # Ensure train_dataset is defined
    eval_dataset=eval_dataset          # Ensure eval_dataset is defined
)

# Integrate DeepSpeed
trainer = deepspeed.initialize(trainer=trainer, config_params=ds_config)

# Start training
trainer.train()

# Output statement
print("Training was successful, with memory usage reduced by the use of ZeRO Stage 2.")

Output

Training was successful, with the mem usage reduced by the use of ZeRO Stage 2.

2. DeepSpeed Version Unsupported

For some users, unsupported versions of PyTorch or DeepSpeed may cause problems.

Solution

Verify that the versions of PyTorch and DeepSpeed you have been appropriate. You can check the compatibility from the DeepSpeed documentation.

Alternatively, use the following code −

pip install deepspeed==0.5.5 torch==1.9.0

Check if DeepSpeed and PyTorch is installed successfully or not −

import torch
import deepspeed

print("PyTorch version:", torch.__version__)
print("DeepSpeed version:", deepspeed.__version__)

Output

PyTorch version: 1.9.0
DeepSpeed version: 0.5.5

3. DeepSpeed Optimizer Initialization Error

This bug occurs primarily due to failing optimizing initialization during use of DeepSpeed.

Solution

Make sure that the optimizer is properly initialized in your DeepSpeed configuration. For custom optimizers, make sure they satisfy the appropriate conditions from DeepSpeed.

import torch
# Initialize optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

# Initialize DeepSpeed with model and optimizer
model, optimizer, _, _ = deepspeed.initialize(config=ds_config, model=model, optimizer=optimizer)

# Output statement
print("Optimizer initialized with DeepSpeed correctly without errors.")

Output

Optimizer initialized with DeepSpeed correctly without errors.

Performance Bottlenecks and How to Resolve Them

Although DeepSpeed brings huge acceleration to big model training, performance bottlenecks may sometimes happen due to many reasons. Find and correct all of them to really reap the benefits of DeepSpeed.

1. Data Loading Bottleneck

In case your data pipeline cannot follow the speed of your model during training, you're likely to leave some GPUs behind.

Solution

Use the torch.utils.data.DataLoader with multi-threading through num_workers and asynchronous data loading.

Example

from torch.utils.data import DataLoader
# Optimized DataLoader with multiple workers
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True, num_workers=4)

# Output statement
print("Data loading performance improved using multi-threading.")

Output

Data loading performance improved using multi-threading.

2. Communication Bottleneck in Distributed Training

With multiple GPUs, the communication overhead hinders distributed training.

Solution

Reduce communication between GPUs with DeepSpeed's ZeRO stage 2 or 3 that reduces communication overhead from optimizer states and gradient partitioning.

Example

ds_config = {
    "zero_optimization": {
        "stage": 3,  # Use ZeRO Stage 3 for the least communication overhead
        "offload_optimizer": {"device": "cpu"},
        "offload_param": {"device": "cpu"}
    }
}

# Output statement
print("Using ZeRO Stage 3 reduces communication overhead.")

Output

ZeRO Stage 3 Communication overhead.

3. Inefficient Mixed Precision Training

Mixed precision can bring huge speed-ups but isn't usually configured optimally.

Solution

Use DeepSpeed's AMP to automatically tune with no extra overhead of manual tuning.

Example

ds_config = {
    "fp16": {
        "enabled": True  # Enable mixed precision
    }
}

# Output statement
print("Fast training with auto-tuned mixed precision.")

Output

Fast training with auto-tuned mixed precision.

Community Support and Resources

DeepSpeed has an incredible, lively community, and there are some really good resources available to help out with issues or performance enhancements.

1. GitHub Issues and Discussions

Problems under the DeepSpeed GitHub repository are an excellent place to begin with if one wants to debug. Users can surf some of which already exist or open a new one if one has detected a bug or error.

Ask a question or request help from the DeepSpeed community, and you will find the discussions to be a great place.

2. DeepSpeed Documentation

The official DeepSpeed documentation has detailed guides and FAQs. Hence, it forms the first point of reference if you ever need to install, learn how to use, or seek optimization tips.

3. Community Forums and Stack Overflow

Stack Overflow questions about DeepSpeed have broadly increased and so have the responses from AI/ML experts in their respective troubleshooting areas.

Best practices in applying DeepSpeed can be sourced from forums like the community on Hugging Face's website or PyTorch forums using Hugging Face Transformers and similar popular frameworks.

Improving DeepSpeed Performance

Apply the following tips to better improve DeepSpeed performance −

Use the stages of optimization in ZeRO

ZeRO by DeepSpeed has three stages of optimization that trade-off various degrees of memory usage and performance. Try different stages based on the size of your model and the number of your GPUs.

One of the simplest methods to increase training speed without compromising accuracy is to use mixed precision.

Example

ds_config = {
    "fp16": {
        "enabled": True  # Enable AMP for accelerated training
    }
}

Optimize Data Loading

The data loading pipeline should be multi-threaded and use bigger batches to be able to keep up with the model. Then, optimize data preprocessing so that it doesn't speed up training.

Profile and Benchmark Often Profiling your training pipeline really helps you find bottlenecks early. Monitor performance metrics with profiling tools like PyTorch's torch.profiler .

Example

import torch.profiler as profiler

# Profile training
with profiler.profile() as prof:
    trainer.train()

# Export profiling results
prof.export_chrome_trace("./trace.json")
print("Trace profiling saved to trace.json")

Output

Trace profiling saved to trace.json

Print Page