DeepSpeed - Quick Guide



Getting Started with DeepSpeed

Deep learning models are becoming increasingly complex with rising training computational costs. DeepSpeed, developed by Microsoft, efficiently trains large-scale models on minimum resources. This chapter will take you through basic steps that get them up and running with DeepSpeed in a flow from installation and setting up of the environment to running their first model.

Installing DeepSpeed

The first thing we need to do is install the library before digging further into the details of DeepSpeed. Using pip, this is simple to accomplish −

pip install deepspeed

While installing, you may see the result something like below −

Collecting deepspeed
Downloading deepspeed-0.6.0-py3-none-any.whl (696 kB)
|| 696 kB 3.2 MB/s 
Collecting torch
Downloading torch-1.9.1-cp38-cp38-manylinux1_x86_64.whl (804.1 MB)
||
deepspeed-0.6.0 torch-1.9.1 installed successfully

You can also clone the GitHub repository and install from the source, if you so desire −

git clone https://github.com/microsoft/DeepSpeed.git
cd DeepSpeed
pip install .

This will give you the latest features, which might not be released yet in the stable.

Environment Setup

After installing DeepSpeed, one has to set up the environment. First, make sure that all the required dependencies are present.

Create a virtual environment for managing the dependencies −

python -m venv deepspeed-env
source deepspeed-env/bin/activate  # On Windows, use 'deepspeed-env\\Scripts\\activate'

Install PyTorch if you haven't already −

pip install torch torchvision torchaudio

Further, depending on your use case, you might need CUDA or other types of acceleration for GPUs. If you are on a machine with GPUs, installation of the CUDA version of PyTorch is as simple as running the following in your terminal −

pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

This will ensure the setup so that DeepSpeed uses all your machine's hardware capabilities.

Basic Concepts and Terminology

Before running your first model, let's cover some basic concepts and terminology that you will encounter quite frequently in DeepSpeed.

  • Optimizer − DeepSpeed currently supports multiple optimizers that can be used to optimize the training of large models. The optimizer handles the gradient update while training the model.
  • Scheduler− Schedulers update the learning rate during training. By default, DeepSpeed integrates all PyTorch schedulers and further provides additional custom schedulers developed for large models.
  • Zero Redundancy Optimizer (ZeRO) − It is a memory optimization technique that reduces the memory footprint of large models by partitioning the model states across many GPUs.
  • Accumulate Gradients − This can facilitate the use of larger batch sizes than allow GPU memory by summing gradients over multiple iterations before model weight updates.
  • Checkpoint Activations − This saves some memory at the cost of additional computation, recomputing the forward pass activations during back-propagation.

Understanding these concepts should provide you context enough to go through most of the advanced features in DeepSpeed and customize your training pipeline.

Running Your First Model With DeepSpeed

Now that your environment is set up and you are familiar with basic terminology, let's run a simple DeepSpeed model. We will first create a basic PyTorch model and then add DeepSpeed to it to see performance gains.

Step 1: Create a Simple PyTorch Model

import torch
import torch.nn as nn
import torch.optim as optim

class SimpleModel(nn.Module):
def __init__(self):
super(SimpleModel, self).__init__()
self.fc1 = nn.Linear(10, 50) # input layer (10) -> hidden layer (50)
self.fc2 = nn.Linear(50, 1) # hidden layer (50) -> output layer (1)

def forward(self, x):
x = torch.relu(self.fc1(x)) # hidden layer activation function
x = self.fc2(x)
return x

model = SimpleModel()

Step 2: Implement DeepSpeed

Now, let's refactor the code so it works with DeepSpeed. We will initialize the model with DeepSpeed and some basic configuration.

import deepspeed

ds_config = {
   "train_batch_size": 32,
   "fp16": {
      "enabled": True
   },
   "zero_optimization": {
      "stage": 1
   }
}

model_engine, optimizer, _, _ = deepspeed.initialize(
   model=model,
   model_parameters=model.parameters(),
   config=ds_config
)

Output

If all goes well, DeepSpeed will initialize and print out the configuration settings −

[INFO] DeepSpeed info: version=0.6.0, git-hash=unknown, git-branch=unknown
[INFO] Initializing model parallel group with size 1
[INFO] Initializing optimizer with DeepSpeed Zero Optimizer

Step 3: Train Model

At this point, you should now be able to train your model using DeepSpeed. Below is an example training loop.

for epoch in range(5) − 
inputs = torch.randn(32, 10)
labels = torch.randn(32, 1)

model_engine.train()
outputs = model_engine(inputs)
loss = nn.MSELoss()(outputs, labels)

model_engine.backward(loss)
model_engine.step()
print(f'Epoch {epoch+1}, Loss: {loss.item()}')

Output

Each epoch will give you a result something like this:

Epoch 1, Loss: 0.4857
Epoch 2, Loss: 0.3598
Epoch 3, Loss: 0.2893
Epoch 4, Loss: 0.2194
Epoch 5, Loss: 0.1745

Step 4: Save the Model

Finally, you can save the model that is so far trained −

model_engine.save_checkpoint('./checkpoint', epoch=5)

Output

[INFO] Saving model checkpoint to ./checkpoint

Advanced Capabilities of DeepSpeed

Let's now look into some advanced capabilities of DeepSpeed, having had a basic view of what DeepSpeed is. These advanced features are implemented to deal with the complexity of training large models, reducing memory consumption, and improving computation efficiency.

  • Mixed Precision Training FP16 − One of the reasons for fast model training in DeepSpeed is that it supports mixed precision training by using half precision.
  • ZeRO Optimization Stages − DeepSpeed has a game-changing technique known as ZeRO, which reduces memory by partitioning model states across multiple GPUs.
  • Gradient Accumulation − Another strategy that DeepSpeed supports is gradient accumulation, which can simulate larger batch sizes without requiring more GPU memory.
  • Offloading − Even for very large models, optimizations provided by ZeRO Stage 3 may be insufficient.

Summing Up

The major steps that form part of getting started with DeepSpeed are installation of the library, setup of your environment, knowing some basic concepts, and running your first model. DeepSpeed allows the training of large models with much higher efficiency at higher memory and lower overall training time. This basic chapter will enable you to go further into the advanced features of DeepSpeed in driving your deep learning projects.

Model Training with DeepSpeed

Deep learning models have grown big and complicated, making the training process even more difficult to carry out effectively. That is where DeepSpeed-Deep Learning Optimization Library from Microsoft comes in. The library was destined for the training of big models; it also boasts a collection of features aimed at memory optimization, computational efficiency, and overall training performance. Objectives by the end of this chapter will include training with DeepSpeed, looking at configuration files that set up the optimization features, and giving some examples of training popular models using this power tool.

Deep Learning Model Training with DeepSpeed

Training deep learning models is a compute-bound task, especially when working on large datasets and complex architectures. DeepSpeed is built for this challenge by providing a set of capabilities comprising mixed precision training, ZeRO (Zero Redundancy Optimizer), and gradient accumulation all in one framework that ensures extremely high efficiency while scaling up model training without necessarily exponentially scaling computation resources.

Now we will start by implementing DeepSpeed into a simple model training pipeline.

Step 1: Model and Dataset

Assume that a simple PyTorch model is solving the regression problem:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# A simple regression model
class RegressionModel(nn.Module):
    def __init__(self):
        super(RegressionModel, self).__init__()
        self.fc1 = nn.Linear(10, 50)
self.fc2 = nn.Linear(50, 1)
def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Generating synthetic data
inputs = torch.randn(1000, 10)
targets = torch.randn(1000, 1)
dataset = TensorDataset(inputs, targets)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

model = RegressionModel()

Step 2: Add DeepSpeed

Next step is to add DeepSpeed to your configuration file to enable training optimization.

DeepSpeed Configuration Files

DeepSpeed configuration files are JSON files, which specify a number of parameters in optimizing model training. An example is as follows:

{
    "train_batch_size": 32,
    "fp16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 1,
        "allgather_partitions": true,
        "reduce_scatter": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true
    },
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.001,
            "betas": [0.9, 0.999],
            "eps": 1e-8,
            "weight_decay": 3e-7
        }
    }
}

Save the preceding text to a file in your project folder called ds_config.json.

Step 3: DeepSpeed Initialization

This is where things get interesting. With a configuration file setup, you're ready to initialize DeepSpeed in your training script as follows:

import deepspeed

# Initialize DeepSpeed
ds_config_path = "ds_config.json"
model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    model_parameters=model.parameters(),
    config=ds_config_path
)

Output

Running the above code will initialize DeepSpeed with the config specified below −

[INFO] DeepSpeed info: version=0.6.0, git-hash=unknown, git-branch=unknown
[INFO] Initializing model parallel group with size 1
[INFO] Initialize optimizer with DeepSpeed Zero Optimizer

Optimizing Training with DeepSpeed's Features

DeepSpeed comes with a set of features that could optimize model training. We will discuss some of the key features here in.

  • Mixed Precision Training − It trains the models in 16-bit floating-point representation, hence requiring less memory and therefore faster computations.
  • ZeRO Optimization − The Zero Redundancy Optimizer (ZeRO) can substantially reduce the memory footprint for large models by partitioning model states across thousands of GPUs. You can control how much optimization is done with the value of the stage parameter in the zero_optimization section.
  • Gradient Accumulation − This feature allows increasing the effective batch size without needing a proportional increase of GPU memory. You can enable gradient accumulation by setting the value for gradient_accumulation_steps in the config file.
  • Activation Checkpointing − This approach is a computation versus memory saving approach since it saves memory at the cost of recomputing some activations in the backward pass. That means it reduces overall memory consumption at train time.

These features can be combined in various ways depending on what is optimal for your particular requirements.

Example of Training BERT Model Using DeepSpeed

Demonstrating the power of DeepSpeed, take the training of a famous model like BERT − Bidirectional Encoder Representations from Transformers.

Step 1: Prepare and Load the BERT Model

You can load a pre-trained BERT model using the Hugging Face Transformers library easily −

from transformers import BertForSequenceClassification, BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")

# Sample data
inputs = tokenizer("DeepSpeed makes BERT training efficient!", return_tensors="pt")
labels = torch.tensor([1]).unsqueeze(0)

# Dataloader
dataloader = DataLoader([(inputs, labels)], batch_size=1)

Step 2: Add DeepSpeed Integration

As before, we add DeepSpeed integration by initializing with your model and config file −

model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    model_parameters=model.parameters(),
    config="ds_config.json"
)

Step 3: Run Model

The the model as follows −

for batch in dataloader:
        inputs, labels = batch
        outputs = model_engine(**inputs)
loss = nn.CrossEntropyLoss()(outputs.logits,labels)

        model_engine.backward(loss)
        model_engine.step()
print(f"Epoch {epoch+1}, Loss: {loss.item()}")

Output

Training BERT with DeepSpeed will output the loss for every epoch, assuring us that the model is being trained efficiently −

Epoch 1, Loss: 0.6785
Epoch 2, Loss: 0.5432
Epoch 3, Loss: 0.4218

Handling Large Datasets with DeepSpeed

Large datasets pose problems that go well beyond model architecture. How you manage memory and computational resources efficiently while processing big volumes of data will save you from bottlenecks. DeepSpeed tackles these very challenges through its advanced features in the domain of data handling.

1. Dynamic Data Loading

DeepSpeed performs dynamic loading of the data, thereby loading into memory only the batches being used at one time during training. This cuts down on the memory footprint, hence allowing you to train on more substantial datasets without necessarily needing more powerful hardware. Besides that, you will keep the memory usage minimal; hence, you minimize the time taken by input/output operations of data, which enhances the overall speed of training.

2. Data Parallelism

Another important capability enabled by DeepSpeed is that of data parallelism. It supports natively distributed data across many GPUs. Because of that, different batches can be processed at once. This parallel will speed up the training process. It can occupy GPU resources efficiently. Therefore, in practice, applying data parallelism using DeepSpeed to your training pipeline is not painful because it's integrated into PyTorch's DataLoader.

3. Memory-Efficient Data Shuffling

Large datasets normally require shuffling to avoid overfitting and learning by pattern based on how data has been ordered. However, this is extremely memory-consuming for large datasets. DeepSpeed optimizes this process using very memory-efficient algorithms able to provide effective shuffling without a huge memory increase. This makes sure that on large datasets, training will be smooth and efficient.

4. Data Augmentation Support

Data augmentation in general includes certain methods that increase the size of a dataset artificially by modifying existing data. DeepSpeed supports on-the-fly data augmentation, which means one doesn't have to store augmented data in memory but can perform data augmentation on the fly during training. This can reduce the memory pressure even further and also provide much more extensive utilization of data augmentation techniques.

5. Batch Size Scaling

With DeepSpeed gradient accumulation and ZeRO optimization, that allows scaling up of batch sizes even when working with enormous datasets. Larger batch sizes can sometimes improve model convergence and training stability. DeepSpeed is enabled, which allows scaling of batch size with management of the GPU memory requirement; hence, your model should be able to train on big datasets effectively.

The above DeepSpeed features help in that aspect by being able to manage a large dataset, thus making it possible for you to design and train high-performance models with no hardware restrictions. Whether you're training your model on a very big corpus of text or processing images in super-high resolution, this feature in handling data by DeepSpeed keeps your training pipeline optimized and scalable.

Summing Up

DeepSpeed allows having an effective training framework for deep learning models, especially in scaling size and complexity. Therefore, learning advanced features of how to use mixed precision training, ZeRO optimization, and activation checkpointing are ways in which added value optimizes the process. This chapter has information about model training using DeepSpeed preparing the environment for DeepSpeed, the configuration of DeepSpeed, and running the training processes. With this tool and technique in hand, now you can handle large-scale deep-learning projects with better performance and low consumption of resources.

DeepSpeed - Optimizer

Optimization and scheduling form the grounds for better performance in deep learning for large-scale models. DeepSpeed is an open-source, deep learning optimization library that assists model training more efficiently using its various supported techniques: memory optimization, gradient accumulation, and mixed-precision training.

The two key components of DeepSpeed are DeepSpeed Optimizer and DeepSpeed Scheduler. These work together to efficiently manage system resources, accelerate training, and reduce memory footprint on humble hardware setups- to train a model that can have billions of parameters.

Let's understand in detail how DeepSpeed Optimizer works with examples in the code of how it is used. We will look at the DeepSpeed scheduler in the following chapter.

What is DeepSpeed Optimizer?

DeepSpeed Optimizer manages model optimization by efficiently distributing memory. It supports optimizations natively interfaced with any of the popular deep learning frameworks such as PyTorch hence, it handles optimizer states that include momentum and gradient accumulation. This is a deep speed optimizer, which includes zero redundancy optimizer, ZeRO, mixed precision training, and gradient checkpointing among its main features.

Key Features of DeepSpeed Optimizer

The following are key features of DeepSpeed Optimizer −

1. Zero Redundancy Optimizer (ZeRO)

This reduces memory consumption for the states of optimizers, gradients, and model parameters by partitioning them across multiple devices.

This enables training giant models on capacity-limited devices.

2. Mixed Precision Training

By using both 16-bit and 32-bit floating-point representation, mixed precision training allows minimum memory consumption while not reducing model accuracy.

3. Gradient Checkpointing

It shards the models into chunks and stores only a subset of activations during the forward pass; hence, it might compute the intermediate values during the backward pass to save memory.

Example of Using DeepSpeed Optimizer

Following is a PyTorch-based example using DeepSpeed Optimizer with ZeRO −

import deepspeed
import torch
import torch.nn as nn
import torch.optim as optim

# Sample model definition
class SampleModel(nn.Module):
    def __init__(self):
        super(SampleModel, self).__init__()
        self.fc = nn.Linear(10, 1)

    def forward(self, x):
        return self.fc(x)

# Initialize model and optimizer
model = SampleModel()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# DeepSpeed configuration
ds_config = {
    "train_batch_size": 8,
    "gradient_accumulation_steps": 2,
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.001,
        }
    },
    "zero_optimization": {
        "stage": 1
    }
}

# Initialize DeepSpeed
model_engine, optimizer, _, _ = deepspeed.initialize(model=model, optimizer=optimizer, config_params=ds_config)

# Sample input and forward pass
inputs = torch.randn(8, 10)
outputs = model_engine(inputs)
loss = outputs.mean()

# Backward pass and optimization
model_engine.backward(loss)
model_engine.step()

Output

When executed in an IDE environment like PyCharm or VSCode, it will look like −

Deepspeed is initiated
Input tensor: torch.Size([8, 10])
Forward pass completed
Loss: -0.015
Backward pass and optimizer step complete

Above is an example of the IDE like PyCharm or VSCode that shows the code snippet with the already applied optimizer inside and the Terminal output that will show the successful execution of this optimizer.

Applying these examples and outputs shown in this chapter will make applying these tools to your deep learning workflow much easier.

DeepSpeed - Learning Rate Scheduler

The DeepSpeed provides us with the optimizer and learning rate scheduler, which solve the huge challenges in large-scale deep learning training.

The DeepSpeed optimizer reduces memory consumption and improves training efficiency using ZeRO, mixed precision training, and gradient checkpointing. The DeepSpeed scheduler dynamically updates the learning rate in real-time during the time when convergence needs to happen much faster or with much better performance in the model.

Put all together, these are letting developers push what was once thought impossible in AI and deep learning to allow for the training of models that are far too large to manage effectively.

What is Learning Rate Scheduler?

DeepSpeed Scheduler is crucial in model training because it optimizes the learning rate. The Scheduler stabilizes the training by dynamically adjusting the learning rate and ensures quick convergence. Further, the scheduler is versatile for several common scheduling techniques, such as linear decay, cosine decay, and step decay in different training settings.

Key Features of DeepSpeed Scheduler

The following are the key features of DeepSpeed Scheduler −

1. Dynamic Learning Rate Adjustment

This involves adjusting the learning rate during training to improve convergence and prevent overfitting by following a predefined schedule.

2. Warm-up Schedulers

The library provides warm-up strategies that allow the growth of the learning rate from an extremely low-value starting training.

3. Multi-Phase Schedulers

It is possible to configure multiple phases in your schedule, each defining different learning rate behavior.

Example of Using DeepSpeed Scheduler

Below is how one would use DeepSpeed Scheduler in this way −

import torch.nn as nn
import torch.optim as optim

# Model definition
class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc = nn.Linear(10, 1)

    def forward(self, x):
        return self.fc(x)

# Initialize model and optimizer
model = SimpleModel()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# DeepSpeed configuration for optimizer and scheduler
ds_config = {
    "train_batch_size": 8,
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.01,
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0.001,
            "warmup_max_lr": 0.01,
            "warmup_num_steps": 100
        }
    }
}

# Initialize DeepSpeed with model and optimizer
model_engine, optimizer, _, lr_scheduler = deepspeed.initialize(model=model, optimizer=optimizer, config_params=ds_config)

# Sample input and forward pass
inputs = torch.randn(8, 10)
outputs = model_engine(inputs)
loss = outputs.mean()

# Backward pass and step
model_engine.backward(loss)
model_engine.step()
lr_scheduler.step()

Output

The following is the result of above Python code −

Learning rate after warm-up: 0.0023
Loss: 0.0214
Training step completed

Here is an example of what this would look like in the IDE interface with code and a terminal open to present the output you would need to see how the learning rate was adjusted post-warm-up.

These examples and outputs shown in this chapter will make applying these tools to your deep learning workflow much easier.

DeepSpeed Optimizer and Scheduler Work Together

The DeepSpeed Optimizer and Scheduler go hand in glove to deliver the best from each other. While the optimizer is designed to fit efficiently in memory and perform high-level gradient-based updates, the scheduler will dynamically adjust the learning rate for better convergence and overall performance during training. DeepSpeed integrates these pieces, making it possible to train large models more quickly with resource-efficient utilization and stability.

Distributed Training with DeepSpeed

Single-GPU training becomes inefficient or impossible in most cases with the increase in model sizes and dataset sizes. Therefore, the models can scale easily from a single GPU to multiple GPUs and nodes based on distributed training. In conjunction with optimizing this training method, DeepSpeed by Microsoft is one of the best-equipped frameworks. It enables the handling of large models and reduces memory overhead through salient techniques such as data parallelism, model parallelism, and the Zero Redundancy Optimizer (ZeRO).

Basic Distributed Training

There are a lot of parts in training a machine learning model, and usually, it splits into parts over many computing resources such as GPUs or cluster nodes. Scaling up data and computation generally faces an important challenge, leading to the big model's easy and efficient training.

Why Distributed Training?

The following are key reasons to consider for distributed training while working on large deep learning models −

  • Scalability − It is prohibitively hard to train very large models with tens of millions, or even billions, of parameters on a single GPU. This process can be scaled across many GPUs by using distributed training.
  • Faster Convergence − Spreading out the training process among several GPUs accelerates the process of convergence, hence faster model development.
  • Resource Efficiency − This kind of training will serve the purpose of putting your available hardware to its maximum use, hence saving time and money spent.
  • Data Parallelism − This is the case when one model is spread across many GPUs, and each GPU processes different batches of the dataset.
  • Model Parallelism − The model is parallelized on multiple GPUs; each GPU calculates parts of the model's operations.
  • Hybrid Parallelism − Mix data and model parallelism. In other words, split the data on GPUs and further split the model.

Data Parallelism

DeepSpeed facilitates distributed training by offering adaptable models and data concurrency. Let's explore these in depth.

Each GPU or worker receives a portion of the data to process when data parallelism is used. It then averages those results after processing to update the model weights. Therefore, one can train with larger batch sizes without running out of memory.

Example of Data Parallelism With DeepSpeed

The following is a simple Python example to show data parallelism with DeepSpeed −

import torch
import deepspeed

# Define a simple neural network model
class SimpleModel(torch.nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc1 = torch.nn.Linear(784, 128)
        self.fc2 = torch.nn.Linear(128, 10)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        return self.fc2(x)

# Initialize DeepSpeed configuration
deepspeed_config = {
    "train_batch_size": 64,
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.001
        }
    }
}

# Initialize model
model = SimpleModel()

# Initialize DeepSpeed for distributed data parallelity
model_engine, optimizer, _, _ = deepspeed.initialize(
    config=deepspeed_config,
    model=model
)

# Dummy data
inputs = torch.randn(64, 784)
labels = torch.randint(0, 10, (64,))

# Forward pass
outputs = model_engine(inputs)
loss = torch.nn.functional.cross_entropy(outputs, labels)

# Backward pass and optimization
model_engine.backward(loss)
model_engine.step()

The neural network would now be training on several GPUs; each GPU takes care of a portion of the data.

Model Parallelism

Model parallelism deals with splitting the model across multiple GPUs. This becomes helpful when a single model does not fit into the memory of a single GPU.

Model Parallelism With DeepSpeed

It splits the model across multiple GPUs, where different parts of the model can execute on different GPUs concurrently.

Example of Model Parallelism With DeepSpeed

The following is a simple Python program to show the working of model parallelism using DeepSpeed −

import torch
import deepspeed
from deepspeed.pipe import PipelineModule, LayerSpec

# Define a simple pipeline model
class SimpleLayer(torch.nn.Module):
    def __init__(self, input_size, output_size):
        super(SimpleLayer, self).__init__()
        self.fc = torch.nn.Linear(input_size, output_size)

    def forward(self, x):
        return torch.relu(self.fc(x))

# Two GPUs and two layers in a pipeline paradigm.
layers = [
    LayerSpec(SimpleLayer, 784, 128),
    LayerSpec(SimpleLayer, 128, 10)
]
# We create a pipeline model, specifying the number of stages - 2
pipeline_model = PipelineModule(layers=layers, num_stages=2)

# Initialize DeepSpeed for model parallelism
model_engine, optimizer, _, _ = deepspeed.initialize(
    config=deepspeed_config,
    model=pipeline_model
)

# Dummy inputs
inputs = torch.randn(64, 784)

# Forward pass through pipeline
outputs = model_engine(inputs)

This will process the forward pass across many GPUs in phases. The first GPU will process up to the first layer, while the second will process up to the second last layer.

Zero Redundancy Optimizer (ZeRO)

The most salient feature of DeepSpeed is perhaps the Zero Redundancy Optimizer, more conveniently called ZeRO and designed to solve the memory consumption problem of model training. It splits various states across different GPUs, allowing more efficient usage of memory: optimizer, gradients, and parameters.

ZeRO includes three phases −

  • Stage 1 − Partitioning the state of the optimizer.
  • Stage 2 − Partitioning the gradient state.
  • Stage 3 − Partitioning the parameter state.

Example of Zero Redundancy Optimizer

Following is a simple example of zero redundancy optimizer in python −

import torch
import deepspeed

# Use ZeRO optimization to define the model and DeepSpeed settings
deepspeed_config = {
    "train_batch_size": 64,
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.001
        }
    },
    "zero_optimization": {
        "stage": 2 # Toggle gradient partitioning using ZeRO Stage 2
    }
}

# Initialize model
model = SimpleModel()

# Initialize DeepSpeed with ZeRO optimization
model_engine, optimizer, _, _ = deepspeed.initialize(
    config=deepspeed_config,
    model=model
)

# Forward pass
inputs = torch.randn(64, 784)
outputs = model_engine(inputs)

# Backward pass and optimization
model_engine.backward(outputs)
model_engine.step()

This code runs on ZeRO Stage 2, which is a gradient state partitioned across GPUs and reduces memory consumption during training.

Scaling Models Across Multiple GPUs and Nodes

DeepSpeed scales models across multiple GPUs and nodes by leveraging a mixture of parallelism strategies with the advanced communication layer of DeepSpeed to realize the best scaling.

Scaling Example with Multiple Nodes

The NCCL backend was utilized for inter-GPU communication and scale training to multiple GPUs and nodes. We can make the following call to use DeepSpeed running on multiple GPUs and nodes:

To run on multiple GPUs and nodes using DeepSpeed, you can use the following command:

deepspeed --num_nodes 2 --num_gpus 8 train.py

This uses a total of 8 GPUs and 2 nodes for training.

Example of Training on Multiple GPUs with DeepSpeed

The follwing example demostrates how to work with Training on multiple GPUs using DeepSpeed −

import deepspeed
# Training on multiple GPUs
if torch.distributed.get_rank() == 0:
    print("Training on multiple GPUs with DeepSpeed")
# Initialize DeepSpeed with ZeRO optimization for multi-GPU
model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    config=deepspeed_config
)
# Training loop
for batch in train_loader:
    inputs, labels = batch
    outputs = model_engine(inputs)
    loss = torch.nn.functional.cross_entropy(outputs, labels)
    model_engine.backward(loss)
    model_engine.step()

This code uses DeepSpeed for training the model on various GPUs memory-efficiently, employing methods such as ZeRO for optimization.

Summing Up

DeepSpeed has been powerfully developed for scaling and optimizing distributed training in deep learning models. With the integration of ZeRO for further scaling up on multiple GPUs and nodes with a combination of data parallelism and model parallelism, DeepSpeed can fully address all those challenges in the efficient training of big models. That means that simultaneously, features from DeepSpeed will ensure that distributed training stays accessible and performance-enhanced as it grows.

Memory Optimization with DeepSpeed

Memory optimization is crucial during training with the growing complexity of deep learning models and large-scale computations. DeepSpeed provides memory-saving techniques such as offloading, gradient checkpointing, and ZeRO. Developers can train models of such gigantic sizes based on memory-saving techniques within the standard hardware. Additionally, these techniques give the ability to train models that were hitherto limited by hardware limits.

DeepSpeed has made impressive strides in not just the realms of research but equally so in industry domains and hence is an irreplaceable tool for deep learning practitioners. It means by such strategies, you can make your model consume less memory and actually push out the limits of your hardware.

Why Memory Optimization?

Memory optimization forms part of the most critical components in training deep learning models. Under such parameters as billions, like GPT and BERT, effective memory management must be at work while training on available hardware. DeepSpeed is an open-source library for training deep learning models featuring ZeRO-optimizer, offloading techniques, and gradient checkpointing to avoid major utilization in training.

Memory Issues in Deep Learning

Deep learning models' size and complexity have recently increased. Such massive models need a vast amount of memory to train. DeepSpeed, developed by Microsoft as a deep learning optimization library, presented powerful solutions to the challenges.

Deep learning models are a bit of art as they represent something as new as the memory-related issues that pop up as the models' size grows. Some of the most common memory-related issues include:

  • Model Parameters − Large models, such as GPT-3, have hundreds of billions of parameters and, thus, require much memory to store.
  • Gradients − The gradients for computing every parameter during training also have to be calculated and kept in memory, which consumes a lot more.
  • Activation Maps − All the intermediate values resulting during the forward pass need to be stored until the back passes only for gradient computation, which is called activation maps.
  • Batch Sizes − Larger batch sizes increase convergence speeds but consume more memory.
  • Data Parallelism − Bifurcation of the data among several GPUs is a great strategy to cut down the training time but, undoubtedly, it does gobble up a lot of memory unless it is kept in control.

Unless those pitfalls are identified, training large models becomes impossible even on consumer-grade hardware. DeepSpeed conquers these by using innovative memory-saving techniques.

DeepSpeed's Memory Optimization Techniques

There is more than one way DeepSpeed has for the optimization of memory usage when the models are trained. Some methods include ZeRO, which stands for Zero Redundancy Optimizer, gradient checkpointing, and activation recomputations.

1. Zero Redundancy Optimizer (ZeRO)

ZeRO is mainly concerned with memory optimization at the point where redundant copies of the optimizer's state, gradients, and model parameters are removed. ZeRO moves through these three phases:

  • Stage 1 − Sharding optimizer states across GPUs, with each GPU storing a portion of the optimizer state.
  • Stage 2 − Further reduction of memory as gradients are sharded across GPUs.
  • Stage 3 − Model parameters are sharded, and now models can be trained up to a trillion parameters.

Example

import deepspeed

model = MyModel()  # your dl model
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

# DeepSpeed configuration for ZeRO
ds_config = {
 "train_batch_size": 8,
 "zero_optimization": {
    "stage": 2,   # adjust the stage of ZeRO here
    "allgather_partitions": True,
    "reduce_scatter": True,
    "allgather_bucket_size": 5e8,
    "overlap_comm": True,
    "contiguous_gradients": True
 }
}
# Initialize DeepSpeed
model_engine, optimizer, _, _ = deepspeed.initialize(
  model=model,
  optimizer=optimizer,
  config_params=ds_config
)

# Training loop
for batch in train_dataloader:
    loss = model_engine(batch)
model_engine.backward(loss)
model_engine.step()

You'll notice that memory usage is much lower, especially for large models. A memory profiler can then highlight where the ZeRO optimization kicks in.

2. Gradient Checkpointing

Gradient checkpointing gives you reduced memory by not storing activations during the forward pass in the buffer. These are instead reconstructed on the backward pass, sacrificing a bit of computing to save some memory.

Example

import torch
from torch.utils.checkpoint import checkpoint

def custom_forward(*inputs):
    return model(*inputs)

# Gradient checkpointing
outputs = checkpoint(custom_forward, input_data)
loss = criterion(outputs, labels)
loss.backward()

In this case, memory saved will depend on the size of intermediate activations.

3. Offloading Techniques

DeepSpeed also provides offloading of another form. It lets you move parts of your model, like optimizer states and gradients, to the CPU or even NVMe storage, freeing GPU memory for a different usage.

CPU Offloading

DeepSpeed lets us offload optimizer states and gradients to the CPU. That also frees up that precious GPU memory. That can be really useful in case one has limited memory on the GPU but quite a bit of memory on the CPU.

Example

ds_config = {
    "train_batch_size": 8,
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": True
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        }
    }
}
model_engine, optimizer, _, _ = deepspeed.initialize(
   model=model,
   optimizer=optimizer,
   config_params=ds_config
)

Because transfer offloading to the CPU involves a cost with respect to inter-device, the training is relatively slow but at model sizes that otherwise wouldn't have fit on memory-constrained GPUs.

NVMe Offloading

This is not enough yet for big models. DeepSpeed also offloads optimizer states and gradients to the NVMe storage. This will increase the scale of the models that can teach even in addition.

Example

ds_config = {
    "train_batch_size": 8,
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "nvme",
            "nvme_path": "/local_nvme"
        },
        "offload_param": {
            "device": "nvme",
            "nvme_path": "/local_nvme"
        }
    }
}
model_engine, optimizer, _, _ = deepspeed.init(
   model=model,
   optimizer=optimizer,
   config_params=ds_config
)

Offloading using NVMe will enable the training of massive models, though the rate will largely rely on the I/O speed of the NVMe drive.

Memory Optimization Case Studies

Let's discuss some real-life case studies of memory optimization using DeepSpeed −

Case Study 1: Training GPT-2 using ZeRO Optimization

Using DeepSpeed, a research team scaled the training of this 1.5 billion parameter GPT-2 model down to consumer-grade GPUs. With ZeRO stage 3, it was possible to train it on 4 NVIDIA RTX 3090 GPUs, with a total of 24 GB of memory per GPU. Had it not been possible to do this using ZeRO, training would have been impossible due to the model's requirement for more than 50 GB of memory per GPU.

Case Study 2: Offloading with NVMe for a 175B Parameter Model

Microsoft leveraged DeepSpeed's offloading ability and trained a 175 billion parameter model on a cluster of GPUs with limited memory. The use of close to no memory bottlenecks for offloading optimizer states and parameters during training of the model shows how offloading can make way for super-large models even when GPU resources are limited.

DeepSpeed - Mixed Precision Training

Mixed precision training is a revolutionary approach in deep learning in which the models are trained much faster and more efficiently. This kind of approach uses mixed usage of 16-bit floating-point arithmetic, sometimes 32-bit floating-point arithmetic, to meet a good balance between model accuracy and maximum hardware efficiency. The DeepSpeed library from Microsoft allows easy scaling of large models by reducing memory and time for computation.

What is Mixed Precision Training?

Mixed Precision training uses lower precision arithmetic for most computations and reserves higher precision. In this case, FP32 is critical. The main goals are reduced computational cost, quicker speed during training, and saving memory usage.

Floating-Point Formats

The following are floating-point formats −

  • FP32 Single-Precision − A 32-bit floating point format commonly used in deep learning.
  • FP16 Half-Precision − A 16-bit floating point format that is computationally much faster than regular floating point.
  • BF16 (BFloat16) − A variant of FP16 that has a much wider exponent range and is further geared towards supporting even more reliable training.

The training model using FP16/BF16 alongside FP32 drastically reduces the training time. It generally occurs with big-scale model training on the GPUs and TPUs.

DeepSpeed FP16 and BF16

DeepSpeed natively supports both FP16 and BF16 mixed precision training modes. This would allow developers to scale out the deep learning models without affecting their performance and accuracy. Here's how that would look.

DeepSpeed FP16 Mixed Precision Training

All you need to do is slightly modify your config so that the fp16 will be added there. Here is an example configuration file where FP16 mixed precision training was initialized:

{
   "train_batch_size": 64,
   "gradient_accumulation_steps": 4,
   "fp16": {
      "enabled": true,
      "loss_scale": 0,
      "loss_scale_window": 1000,
      "hysteresis": 2,
      "min_loss_scale": 1
   }
}

Memory Efficiency − Almost halves the GPU memory footprint.

Training Speed − Acceleration by use of 16-bit precision.

BF16 Mixed Precision Training

BF16, or BFloat16, is handy when having FP16 precision makes your model unstable. DeepSpeed natively supports BF16 on AMD/NVidia's GPUs and Google TPUs. To train with DeepSpeed using BF16, you also need to update your configuration in this form:

{
   "train_batch_size": 64,
   "gradient_accumulation_steps": 4,
   "bf16": {
      "enabled": true
   }
}

Python Example (BF16)

The following example demonstrate usage of BF16 mixed precision training −

import deepspeed
def model_engine(model, optimizer, config):
    model, optimizer, _, _ = deepspeed.initialize(
        model=model,
        optimizer=optimizer,
        config=config
    )
    return model, optimizer

# Sample model and optimizer
model = YourModel()  # Use your model
optimizer = YourOptimizer(model.parameters())

# Load DeepSpeed config for BF16
ds_config = "deepspeed_bf16_config.json"  # path to your DeepSpeed config

# Initialize DeepSpeed model with BF16
model_engine, optimizer = model_engine(model, optimizer, ds_config)

# Train your model
for batch in dataloader:
    outputs = model_engine(batch)
    loss = criterion(outputs, targets)
    model_engine.backward(loss)
    model_engine.step()

Stable Training − BF16 ensures stability, especially for large models.

Efficient Training − Memory and computation efficiency are close to FP16.

Advantages of Mixed Precision Training

The following are key advantages of mixed precision training −

  • Less Memory Usage − It uses 16-bit precision in doing most of the calculations, half the memory usage for 32-bit precision. This will allow training on larger models or bigger batch sizes without increasing hardware requirement.
  • Speedup − Hardware accelerators, such as GPUs or TPUs, can evaluate low-precision computations orders of magnitude faster than standard (32-bit) floating-point numbers. That is a huge speedup, especially for big models.
  • No Loss of Accuracy − Mixed precision ensures that the computations most sensitive to accuracyfor instance, gradient accumulation indeed runs at 32-bit precision, and the accuracy of the model is thereby preserved even if it is used sparingly elsewhere.

Challenges of Mixed Precision Training

The following are some challenges in mixed precision training −

  • Numerical Stability − Training at lower precision can lead to loss of numerical stability, especially with FP16. This might lead to gradient underflow or overflow, resulting in poor convergence during optimization.
  • Loss of Precision − In some models, the performance while running in mixed precision may be affected and would thus have to be managed at various levels of precision.
  • Hardware Compatibility − Mixed precision training isn't supported by all hardware. So, before the start of training, while using the mixed precision strategy, ensure that your hardware is designed to support FP16 or BF16 precision. Some of the hardware supporting FP16 and BF16 are Nvidia's Tensor Cores, Google's TPUs, etc.

Best Practices of Mixed Precision Training

Here are some best practices to effectively implement mixed precision training −

1. Suitable Hardware

Mixed precision can only be used fully with hardware optimized for FP16 or BF16 computations, such as Nvidia's Tensor Cores or Google's TPUs.

2. Automatic Mixed Precision (AMP)

Automatic Mixed Precision Libraries − DeepSpeed and PyTorch support minimal code changes for mixed precision training. Just enable AMP, which lets the framework automatically do the dynamic switch between different precisions in FP16/32 or BF16/32 on your behalf.

import torch
from torch.cuda import amp

# Initialize amp autocast and GradScaler
autocast = amp.autocast
GradScaler = amp.GradScaler

# Create a GradScaler
scaler = GradScaler()

for data in dataloader:
    optimizer.zero_grad()
    with autocast():
        outputs = model(data)
        loss = criterion(outputs, target)

    # Scale the loss and backward pass
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Stable and Efficient Training − AMP ensures that operations are performed correctly in FP16/32, eliminating gradient underflow, among others.

3. Loss Scaling Tracking and Stabilization

DeepSpeed and PyTorch offer loss scaling automatically. The scale is automatically adjusted while training to avoid numerical instability.

{
   "fp16": {
      "enabled": true,
      "loss_scale": 0,  // Automatic loss scaling
      "loss_scale_window": 1000,
      "hysteresis": 2,
      "min_loss_scale": 1
   }
}

More Accurate Model − Loss scaling helps avoid vanishing gradients so that the model will converge in a stable manner.

4. Profiling for Memory and Speed

Profile your models to track the number of savings in memory and speedup from using mixed precision training. Use tools such as PyTorch's torch.profiler to monitor the following metrics −

with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
) as profiler:
    for step, batch in enumerate(dataloader):
        outputs = model(batch)
        loss = criterion(outputs, target)
        loss.backward()
    profiler.step()

print(profiler.key_averages().table(sort_by="cuda_time_total"))

Optimized Memory and Speed − Profiling helps ensure that the real benefits of mixed precision training are reappeared.

Summing Up

Though mixed precision training with DeepSpeed is brilliant in accelerating model training, memory conservation, and achieving accuracy, you can now process massive models and datasets with significantly lower computational costs when you take advantage of formats like FP16 or BF16. While not for the faint of heart, adoption of best practices around AMP, proper loss scaling, and hardware compatibility will help you tap into the true power of mixed precision. Mixed precision training will remain an essential tool for scaling models since models simply continue to grow larger, and growth has no upper bound.

DeepSpeed - PyTorch & Transformers

With integration into PyTorch and Hugging Face Transformers, DeepSpeed provides both highly efficient training and inference for large models. It supports basic configuration to memory-oriented optimization techniques for scaling machine learning models. You will learn to make those optimized PyTorch codebases and existing Hugging Face models adapt to DeepSpeed with speed improvement and memory usage reduction during model training, following the guidelines of this chapter.

Now let's dig into these step by step with code examples, outputs, and screenshots to help you get DeepSpeed integrated into machine learning workflows as seamlessly as possible.

DeepSpeed with PyTorch Models

DeepSpeed improves PyTorch models by reducing memory consumption and improving computational efficiency. Here is an example of getting DeepSpeed integrated into a PyTorch-based model training script; this process involves setting up DeepSpeed configuration files and modifications in the training loop:

Example: DeepSpeed with PyTorch

Following is a complete code example of implementation of DeepSpeed with PyTorch models using Python programming language −

import torch
import deepspeed

# Define a simple PyTorch model
class SimpleModel(torch.nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc = torch.nn.Linear(512, 10)

    def forward(self, x):
        return self.fc(x)

# Initialize model and optimizer
model = SimpleModel()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Create DeepSpeed configuration file
ds_config = {
    "train_batch_size": 32,
    "gradient_accumulation_steps": 2,
    "fp16": {
        "enabled": True
    }
}

# Initialize DeepSpeed
model, optimizer, _, _ = deepspeed.initialize(
    model=model,
    optimizer=optimizer,
    config=ds_config
)

# Example training loop
data = torch.randn(32, 512)
target = torch.randint(0, 10, (32,))

for epoch in range(10):
    model.train()
outputs = model(data)
    loss = torch.nn.functional.cross_entropy(outputs, target)

    model.backward(loss)
    model.step()
    print(f"Epoch {epoch+1}, Loss: {loss.item()}")

Output

Epoch 1, Loss: 2.302
Epoch 2, Loss: 2.176
.
Epoch 10, Loss: 1.862

We have so far defined a very simple PyTorch model, created a DeepSpeed configuration file, and initialized the model using deepspeed.initialize(). The training loop in this example is slightly modified to use model.backward() and model.step() instead of calling the PyTorch optimizer directly.

Integration with Hugging Face Transformers

Hugging Face's transformer library brings state-of-the-art models such as BERT, GPT, and T5 - often requiring heavy resources in computers. With DeepSpeed on board, we can optimize the training as well as inferencing of these large transformer models. Now, let's see how to use DeepSpeed with a Hugging Face transformer model.

Example: DeepSpeed with Hugging Face Transformers

Following is a complete example of implementation of DeepSpeed with Hugging Face Transformers −

from transformers import BertForSequenceClassification, Trainer, TrainingArguments
import deepspeed

# To classify sequences, load a BERT model that has already been trained.
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    evaluation_strategy="steps",
    save_steps=10,
    logging_dir='./logs', deepspeed="./ds_config.json",  # Provide DeepSpeed configuration file
)

# DeepSpeed configuration file (ds_config.json)
ds_config = {
    "fp16": {
        "enabled": True
    },
"optimizer": {
      "type": "AdamW",
      "params": {
          "lr": 5e-5,
          "betas": [0.9, 0.999],
          "eps": 1e-8,
"weight_decay": 0.01
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
"warmup_max_lr": 0.01,
            "warmup_num_steps": 100
        }
    }

# Save DeepSpeed config file
import json
with open("./ds_config.json", "w") as f:
    json.dump(ds_config, f)

# Initialize the Hugging Face Trainer
trainer = Trainer(
    model=model,
    args=training_args,
train_dataset=train_dataset,
    eval_dataset=eval_dataset
)

# Start training
trainer.train()

Output

The training output will print the training and evaluation progress, DeepSpeed now has the assistance of mixed precision and optimizer state partitioning, which accelerates efficiency.

{'loss': 0.67, 'learning_rate': 5e-5, 'epoch': 1.0, 'step': 100}
{'loss': 0.57, 'learning_rate': 4e-5, 'epoch': 2.0, 'step': 200}

We have done all of the above here, loading a pre-trained BERT model, defining a DeepSpeed configuration, and using the Trainer from Hugging Face for the training of the model with DeepSpeed enabled.

Porting PyTorch Codebases to DeepSpeed

For legacy PyTorch codebases, integrating DeepSpeed requires only minor changes. For most codebases, you need to add DeepSpeed only by initializing the model with deepspeed.initialize() and ensuring your training loop adheres to the DeepSpeed API. Step-by-Step Guide to Integrating Legacy PyTorch Codebases with DeepSpeed

Step 1: Install DeepSpeed

The following command may be used to install DeepSpeed.

pip install deepspeed

Step 2: Update Model Initialization

Replace the standard model and optimizer initialization with DeepSpeed's initialization.

model, optimizer, _, _ = deepspeed.initialize(\\
    model=model,
    optimizer=optimizer,
    config=ds_config
)

Step 3: Modify the Training Loop

Replace the calls to loss.backward() with model.backward(loss) and optimizer.step() with model.step()

Example: Integration with an Existing Codebase

Look at the following example code −

for epoch in range(num_epochs):\\
    for batch in train_loader:\\
        optimizer.zero_grad()
        outputs = model(batch['input'])
loss = criterion(outputs, batch['target'])
loss.backward()
optimizer.step()

# DeepSpeed Training Code
for epoch in range(num_epochs):
    for batch in train_loader:
        outputs = model(batch['input'])
        loss = criterion(outputs, batch['target'])
model.backward(loss)
        model.step()

By replacing backward and step operations with DeepSpeed's versions, you benefit from the optimizations DeepSpeed provides without changing much of the existing logic.

Advanced Integration Tips and Tricks

To make the most of DeepSpeed's capabilities, here are some advanced tips and tricks:

Memory optimization − ZeRO of DeepSpeed enables extreme in-device memory optimization via partitioning of model states. Saving a huge amount of memory is possible with the use of ZeRO stages 1, 2, or 3.

{
    "zero_optimization": {
        "stage": 2
    }
}

Mixed Precision Training: mixed precision training is very easy to enable through a switch in the DeepSpeed configuration file. This reduces memory usage and makes it run very fast on modern GPUs.

{
    "fp16": {
        "enabled": true
    }
}

Gradient Accumulation − in the case of limited GPU memory, DeepSpeed has an option for gradient accumulation over several batches before the update.

(
   "gradient_accumulation_steps": 4
}

In addition to the above, many other state-of-the-art features can be utilized to even optimize training large models under resource-constrained environments.

DeepSpeed - Inference Optimization

DeepSpeed is a dense framework that optimizes inference with large-scale deep learning models, supporting techniques such as quantization, kernel fusion, pipeline parallelism, and more. You can achieve faster model performance, lower latency, and scalable deployments whether on cloud servers, an edge device, or a serverless platform—with no disadvantage in intrinsic accuracy.

What is Inference Optimization?

Inference on models often proves to be the bottleneck in many applications, especially those using the large-scale deep learning model. The latency, hardware consumption, and deployment of ever more complex systems all surge with complexity- the higher the model complexity, the more latency and hardware consumed in deploying increasingly complex systems is higher than that. DeepSpeed solves such problems by offering advanced inference optimization features with promising faster model inferences, lower latency, and a greater throughput while maintaining accuracy on the model.

This chapter demonstrates how DeepSpeed optimizes model inference, the techniques it applies for latency reduction, and exactly how you can deploy models with optimized inference into real-world applications.

DeepSpeed and Efficient Inference of Models

DeepSpeed is a tool targeted explicitly at models with billions of parameters so that inference runs efficiently using merely a modest amount of hardware. However, it comes pre-optimized out-of-the-box for the speed-up of inference, including quantization and kernel fusion.

Example: Speeding Up Inference

Let's take an extremely simple example of how we might leverage DeepSpeed on Hugging Face model inference optimization.

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import deepspeed

# Load a pre-trained model and tokenizer from Hugging Face
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example input text
inputs = tokenizer("DeepSpeed makes model inference faster!", return_tensors="pt")

# Enable DeepSpeed inference optimization
model = deepspeed.init_inference(
    model,
    mp_size=1,
    dtype=torch.float16,
    replace_method="auto",
)

# Perform inference
with torch.no_grad():
    outputs = model(**inputs)

# Output prediction
print(outputs.logits)

Output

tensor([[ 1.0459, -1.0142]], device='cuda:0', dtype=torch.float16)

DeepSpeed optimizes memory usage and inference performance by 1.5x to 2x based on the hardware configuration.

Explanation

  • super speed. init_inference − This initializes the model for inference with some optimizations. mp_size indicates how many GPUs to use and dtype=torch.float16, it enables half-precision for faster computation.
  • replace_method − This is flagged to "auto" so that DeepSpeed will automatically apply other optimizations as well, like kernel fusion.

Inference Latency Reduction Techniques

The most critical concern for real-time applications is inference latency. The most important techniques provided by DeepSpeed are as follows:

1. Quantization

Quantization trains model weights at a lower precision than 32-bit floating point (FP32)—for instance, 16-bit floating point (FP16), or even 8-bit integers. This yields enormous amounts of savings both in computing and memory footprint with no loss of accuracy.

#Quantization in DeepSpeed
model = deepspeed.init_inference(model, mp_size=1, dtype=torch.int8, replace_method="auto")

Here, we would have used dtype=torch.int8 for 8-bit quantization, which thus saves a huge amount of model size and also time taken for the inference process.

2. Kernel Fusion

Kernel fusion is one other technique through which more than one operation is fused in a single kernel that minimizes the number of memory accesses. This optimization decreases the overhead resulting from kernel launches and memory bandwidth usage.

3. Pipeline Parallelism

Pipeline parallelism allows you to split huge models across multiple GPUs so that data flows through the model in parallel and results are returned quickly during inference. This is helpful when having very large models because the memory of one GPU probably would not be enough.

Following is an example of Pipeline Parallelism with DeepSpeed

# Model partitioning for pipeline parallelism
model = deepspeed.init_inference(model, mp_size=4, dtype=torch.float16, pipeline_parallel=True)

4. Tensor Slicing

Tensor slicing helps fit the model onto hardware with limited memory by slicing large tensors into chunks. Load is distributed across the GPUs, contributing to memory consumption reduction and enhancements in inference speeds.

DeepSpeed Inference Deployment Strategies

The model is optimized for inference and therefore comes with several strategies where it can be efficiently deployed. Here are some of the deployment strategies using DeepSpeed:

1. Serverless Inference using DeepSpeed

Serverless architectures like AWS Lambda can be used for deploying inference services in scale. DeepSpeed can be used to optimize the model to fit in serverless function memory limits and time constraints.

Following is an example of deploying DeepSpeed with FastAPI −

from fastapi import FastAPI
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import deepspeed

app = FastAPI()
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model = deepspeed.init_inference(model, mp_size=1, dtype=torch.float16, replace_method="auto")
@app.post("/predict")
def predict(text: str):
    inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
        outputs = model(**inputs)
    return { " logits": outputs.logits.tolist() }

We create a RESTful API using FastAPI. The model makes predictions over a simple endpoint.

DeepSpeed optimizations maximize inference throughput for super batching and multi-API request handling.

2. Batching for High Throughput

This can make it better in terms of throughput by ensuring that the system can serve more requests at one time through batching of multiple inputs during inference. DeepSpeed effectively deals with that as it splits the batches across the GPUs, then there is a parallelism to hasten the process.

3. Edge Deployment with DeepSpeed

DeepSpeed's quantization and low-memory methods enable the deployment of large models in edge devices that have limited computational power for very low-latency inference in applications such as mobile devices and IoT devices.

Examples of Optimized Inference in Real World

Following are some examples of optimized inference in real world –

1. Microsoft Turing-NLG

Microsoft has released Turing-NLG using DeepSpeed to infer an optimized version of this largest model. Techniques such as model parallelism and quantization have allowed Microsoft to reduce the inference latency of this huge model by up to 4x.

2. Hugging Face Models

Ironically, most models from Hugging Face are already hosted in production by DeepSpeed. To give an example, one could reach a 2x speedup over inference on BERT and GPT-2 models by optimizing them with the features of DeepSpeed such as quantization and kernel fusion.

3. Nvidia Megatron-LM

Thus, the same model-diffused Megatron model, Megatron-LM, was optimized in both training and inference using DeepSpeed. It consequently leads to faster times at model serving, lower memory overhead, and enables its practicality to be deployed at large scale on cloud infrastructure.

DeepSpeed - Advanced Features

DeepSpeed is a high-performing library in terms of deep learning optimization offered by Microsoft that has been leading innovation when it comes to training models of large-scale AI applications by techniques of scaling up and efficiency but with low resource usage.

While the users are already pretty familiar with the core functionalities, full utilization of the capability of DeepSpeed requires some in-depth knowledge of such advanced features as the custom operators, sophisticated options in configuration, and tools of profiling and debugging.

As such, it makes an even more in-depth exploration of these features, showing the reader how to tap into even more of what DeepSpeed can offer in their deep learning projects.

Custom Operators in DeepSpeed

DeepSpeed's custom operators allow the user to fine-tune specific parts of the model to pre-optimize them to conduct efficient computations with little overhead. They become necessary where the default implementation is weak for the task at hand. The power they impart on a developer has some room for finetuning the model parts to suit performance needs.

Example: Custom Operators

DeepSpeed gives the flexibility to easily register and embed custom CUDA and CPU operators. Below is how you can create a simple custom operator.

import torch
# Custom Add Operator using PyTorch's autograd
class AddOp(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x, y):
        result = x + y
        ctx.save_for_backward(x, y)
        return result

    @staticmethod
    def backward(ctx, grad_output):
        x, y = ctx.saved_tensors
        return grad_output, grad_output

# Wrapper class for the custom operator
class CustomAddOp:
    def build(self):
        return AddOp.apply

# Testing the custom operator
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
y = torch.tensor([4.0, 5.0, 6.0], requires_grad=True)

custom_add = CustomAddOp().build()
output = custom_add(x, y)
output.backward(torch.ones_like(x))

print(f"Output: {output}")
print(f"x Gradient: {x.grad}")
print(f"y Gradient: {y.grad}")

Output

tensor([5., 7., 9.], grad_fn=<AddOpBackward>)
x Gradient: tensor([1., 1., 1.])
y Gradient: tensor([1., 1., 1.])

Custom operators are for developers who need to fine-tune and optimize a certain model layer, allowing full flexibility for large models and computationally intensive processes.

More Fine-Tuned Configuration Options

DeepSpeed features a very strong setup system in which users can control fine-grained how models are trained. Flexibility is accessible by specifying options with JSON configuration files.

Example Configuration File

Here is a minimal DeepSpeed config file with advanced options −

{
    "train_batch_size": 32,
    "gradient_accumulation_steps": 2,
    "fp16": {
        "enabled": true
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 0.001,
            "betas": [0.9, 0.999],
            "eps": 1e-08,
            "weight_decay": 0.01
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0.0,
            "warmup_max_lr": 0.001,
            "warmup_num_steps": 100
        }
    },
    "zero_optimization": {
        "stage": 2,
        "contiguous_gradients": true,
        "reduce_scatter": true,
        "allgather_partitions": true
    }
}

Advanced Options

  • fp16 − Mixed precision training greatly improves performance with negligible loss of accuracy.
  • zero_optimization − It uses the Zero Redundancy Optimizer that reduces the memory required for large models by reducing the gradients and states of the optimizers.
  • gradient_accumulation_steps − It enables handling much higher effective batch sizes through stratification of the different batches into smaller pieces and no longer requires fitting the entire batch in memory. Even training on hardware-constrained resources will be highly efficient.

Loading and Using the Configuration

To load and use the config in your training script, you do: "nnCanBeConverted"

import deepspeed
import torch

# Assuming a model and dataloader are already defined
# For example: model = YourModelClass() and dataloader = DataLoader(...)

# Load DeepSpeed configuration
# ds_config.json should contain the configuration details for DeepSpeed
model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    model_parameters=model.parameters(),
    config='ds_config.json'
)

# Training loop
for batch in dataloader:
    # Forward pass
    outputs = model_engine(batch)
    loss = outputs['loss'] if isinstance(outputs, dict) and 'loss' in outputs else outputs

    # Backward pass and optimization step
    model_engine.backward(loss)
    model_engine.step()
# This setup allows large models to leverage DeepSpeed's advanced memory and computational optimizations.

This allows training DeepSpeed advanced functionality to be accelerated, but especially for large models.

Profiling and Debugging DeepSpeed Applications

Profiling and debugging big models is a tool used for the identification of bottlenecks and ensuring that your code runs reasonably efficiently. DeepSpeed gives out several such tools, including built-in logging and interoperability with mainstream profiling tools like NVIDIA Nsight Systems.

Using DeepSpeed Profiler

DeepSpeed provides hooks to add performance profiling when training a model. You can easily add these hooks to your training script.

import deepspeed
import torch

# Define DeepSpeed configuration with profiling enabled
deepspeed_config = {
    "train_batch_size": 32,
    "steps_per_print": 10,
    "gradient_clipping": 1.0,
    "wall_clock_breakdown": True  # Enables detailed timing breakdown for profiling
}

# Assuming `model` and `data_loader` are defined
# Initialize DeepSpeed with profiling enabled
model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    config_params=deepspeed_config,
    model_parameters=model.parameters()
)

# Start training with profiling enabled
for batch in data_loader:
    # Forward pass
    outputs = model_engine(batch)
    loss = outputs['loss'] if isinstance(outputs, dict) and 'loss' in outputs else outputs

    # Backward pass and optimization step
    model_engine.backward(loss)
    model_engine.step()

# Profiling enabled by `wall_clock_breakdown` provides detailed insights into performance bottlenecks.

Debugging Techniques

There are also other utilities in DeepSpeed related to debugging purposes such as logging and real-time resource monitoring. Those could be used for the detection of memory leaks or communication overheads or anything that could lead to inefficiency in the updates of gradients.

To put the config in verbose mode, you can replace it with something like this −

To enable chatty logging, modify the configuration as follows −

{
    "logging": {
        "level": "info",
        "steps_per_print": 50
    }
}

You can also debug by attaching an external debugger such as pdb or gdb at some parts of DeepSpeed to trace real-time errors.

Experimental and Cutting-Edge Features

DeepSpeed is a moving target, and quite some experimental features have already been added to recent releases. The benefits in terms of performance for edge cases are huge, though such features must be heavily tested.

3D Parallelism

DeepSpeed 3D Parallelism scales up model parallelism over tensor, pipeline, and data dimensions for unprecedented scalability for models with billions of parameters.

Here's an example configuration for 3D parallelism −

{
    "train_batch_size": 64,
    "tensor_parallel" : {
        "tp_size": 8
    },
    "pipeline_parallel":
"p_size": 4,
        "activation_checkpointing": true
}

With this configuration, the model is split into tensor parallel groups of size 8, and the pipeline is divided into 4 stages, thereby ensuring efficient memory usage while training massive models.

Activation Checkpointing

This method reduces memory at training because only the latest activations are saved and they are recomputed in the backward pass.

import deepspeed
from deepspeed.runtime.activation_checkpointing import checkpointing

# To activate activation checkpointing
checkpointing.configure(None, deepspeed_config="ds_config.json")

Activation checkpointing is rather important while training very deep models or constrained to the limited amount of memory on the GPU.

DeepSpeed - Large Language Models (LLMs)

DeepSpeed makes a pathway for how training and fine-tuning of large scale language models like GPT and BERT happen. It will enable the possibility of training in much more reduced hardware resources without losing performance.

DeepSpeed addresses each difficulty LLM training comes with using optimizations such as ZeRO, mixed-precision training, and gradient accumulation. Processes of tuning and deployment get streamlined; therefore, LLMs are much better approximated and applied to real-world practicalities efficiently.

DeepSpeed makes large-scale AI models less intimidating and more achievable for developers and researchers alike.

Training Large-Scale Language Models with DeepSpeed

Large-scale language models, such as GPT and BERT, require a tremendous amount of hardware resources, memory, and time to train. DeepSpeed facilitates the mitigation of these problems through several key features:

1. ZeRO (Zero Redundancy Optimizer)

ZeRO optimization lies at the heart of DeepSpeed, achieving orders-of-magnitude reductions in memory usage for training huge models. It works by spreading all model states-gradients, optimizer states, and parameters across multiple GPUs. This now becomes feasible: training on trillions of parameters-a scale that was infeasible earlier on traditional configurations of hardware.

import deepspeed
import torch
import torch.nn as nn

# Define the model
class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.linear = nn.Linear(512, 512)

    def forward(self, x):
        return self.linear(x)

# Model, optimizer, and DeepSpeed configuration
model = MyModel()
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)

# DeepSpeed configuration dictionary
ds_config = {
    "train_batch_size": 64,
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 3e-4
        }
    },
    "zero_optimization": {
        "stage": 1
    },
    "fp16": {
        "enabled": True
    }
}

# Initialize the model with DeepSpeed
model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    optimizer=optimizer,
    config_params=ds_config
)

DeepSpeed will automatically partition gradients and optimizer states among the available GPUs for training large models in memory-friendly ways.

2. Mixed Precision Training

DeepSpeed supports mixed precision training with FP16. The memory usage does reduce, yet the train speed is accelerated without deteriorating model accuracy. This is a necessary condition for training LLMs under hardware constraints.

import deepspeed

# FP16 mixed precision training configuration in DeepSpeed
ds_config = {
    "train_batch_size": 64,
    "fp16": {
        "enabled": True
    }
}

# Initialize the model with DeepSpeed
model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    optimizer=optimizer,
    config_params=ds_config
)

3. Gradient Accumulation and Checkpointing

DeepSpeed also provides gradient accumulation that enables splitting the backpropagation step over several iterations to handle larger batch sizes. Checkpointing of activations decreases the memory by redoing parts of the layers during the backward pass.

python
ds_config = {
    "gradient_accumulation_steps": 4,
"checkpoint_activations": True,
}

The memory will be learned much more effectively by the training. It is achieved with gradient accumulation and checkpointing. The mentioned methods can cope with larger batch sizes on significantly inferior hardware.

Case Studies: GPT, BERT, and Beyond

Some of the bigger language models are now starting to experience the complete benefits of the optimizations by DeepSpeed.

1. Training GPT-3 Using DeepSpeed Scaling

Cometh the biggest model ever with 175 billion parameters: GPT-3. Normally, training GPT-3 would have taken thousands of GPUs and months in time; with DeepSpeed's ZeRO optimizer, one can simulate thousands of GPUs across a couple of hundred GPUs.

Example: GPT-3 Mini Training using DeepSpeed

import deepspeed
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load tokenizer and model (using GPT-2 as a placeholder for a GPT-3 scaled training example)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# DeepSpeed configuration
ds_config = {
    "train_batch_size": 8,
    "fp16": {
        "enabled": True
    },
    "zero_optimization": {
        "stage": 1
    }
}

# Initialize DeepSpeed
model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    config_params=ds_config,
    model_parameters=model.parameters()
)

# Encode input text
input_text = "DeepSpeed helps train large models"
input_ids = tokenizer.encode(input_text, return_tensors="pt")

# Move input to the correct device
input_ids = input_ids.to(model_engine.device)

# Forward pass
outputs = model_engine(input_ids=input_ids)
print("Model output:", outputs)

2. BERT Optimization Using DeepSpeed

BERT models are massively applied to NLP applications, including question answering and text classification tasks. BERT is computationally expensive in the sense that it requires large amounts of datasets and computation resources to train appropriately. DeepSpeed facilitates faster training with efficient resource utilization of BERT, especially in the case of a multi-GPU setup or fine-tuning.

Example: Fine-tuning BERT with DeepSpeed

import deepspeed
import torch
from transformers import BertForSequenceClassification, BertTokenizer

# Load model and tokenizer
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# DeepSpeed configuration for BERT fine-tuning
ds_config = {
    "train_batch_size": 32,
    "zero_optimization": {
        "stage": 2
    },
    "fp16": {
        "enabled": True
    }
}

# Create an optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-5)

# Initialize DeepSpeed with the optimizer
model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    optimizer=optimizer,  # Pass the optimizer here
    config_params=ds_config,
    model_parameters=model.parameters()
)

# Continue with encoding inputs and training as before...
input_ids = tokenizer.encode("This is an amazing product!", return_tensors="pt").to(model_engine.device)
labels = torch.tensor([1], dtype=torch.long).unsqueeze(0).to(model_engine.device)

# Forward pass
outputs = model_engine(input_ids=input_ids, labels=labels)
loss = outputs.loss

# Backward pass and optimization step
model_engine.backward(loss)
model_engine.step()

print("Loss:", loss.item())

Output

Loss: 0.4567  # (This is a hypothetical value; actual output will vary based on input and training)

Overcoming the Challenges Encountered during the Training of LLM

The training of LLMs is definitely not a walk in the park. Some of the most common issues that crop up during the training of LLMs are given below:

1. High Memory Usage

Among the techniques developed by DeepSpeed are ZeRO stage optimizations, mixed precision training, and activation checkpointing-all of which will enable full models on hardware devices and have relatively lower capacities, in terms of memory.

2. Overhead of Communication in Training of Multiple GPUs

Communication overhead is one of the slowdown factors for the training speed of most training approaches that continue to move forward for many GPUs. DeepSpeed has come up with advanced strategies on tensor parallelism and model parallelism with the aim of least possible communication bottlenecks.

3. Long Training Times

DeepSpeed also employs pipeline parallelism in this way that training gets split across layers, which thus improves the throughput and decreases the time required for training in the general process. The next efficiency improvement technique is gradient accumulation and ZeRO.

# Configuration of memory and communication overhead management
ds_config = {
    "train_batch_size": 32,
    "pipeline": {
        "enabled": True
},
"zero_optimization":
"stage": 3
  }, 
  "fp16": {
      "enabled": True
  }

Fine-tuning and Deploying LLMs with DeepSpeed

Fine-tuning is also necessary for a pre-trained model to suit the task-specific needs. DeepSpeed's efficient training strategies make fine-tuning large models on smaller datasets and hardware configurations accessible.

Example: Fine-tuning GPT-2 with DeepSpeed

from transformers import GPT2LMHeadModel, GPT2Tokenizer
# Load pre-trained GPT-2 model
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Initialize DeepSpeed
ds_config = {
    "train_batch_size": 16,
    "fp16": {
        "enabled": True
    }
}
model_engine, optimizer, _, _ = deepspeed.initialize(model=model, config_params=ds_config)

# Fine-tuning with custom data
input_ids = tokenizer.encode("DeepSpeed makes training efficient", return_tensors="pt")
loss = model_engine(input_ids=input_ids, labels=input_ids).loss
loss.backward()
model_engine.step()
from transformers import GPT2LMHeadModel, GPT2Tokenizer
# Load pre-trained GPT-2 model
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Initialize DeepSpeed
ds_config = {
    "train_batch_size": 16,
    "fp16": {
        "enabled": True
    }
}
model_engine, optimizer, _, _ = deepspeed.initialize(model=model, config_params=ds_config)

# Fine-tuning with custom data
input_ids = tokenizer.encode("DeepSpeed makes training efficient", return_tensors="pt")
loss = model_engine(input_ids=input_ids, labels=input_ids).loss
loss.backward()
model_engine.step()

Deployment

The models can then be deployed after tuning with relative ease in DeepSpeed. It has good interoperability with the model-serving libraries, hence making it easy to carry out inference with huge models in an efficient manner.

DeepSpeed - Troubleshooting and Common Issues

DeepSpeed is that revolutionary tool for scaling and optimizing deep learning models, but just like with any powerful technology, it will get you into trouble sometimes. You really need to know how to diagnose common errors, solve performance bottlenecks, and use community support to make the experience smoother.

There are multiple optimizing parameters, mixed precision, and ZeRO stages, which will bring enormous performance gains for the users. Using the facilities and guidelines of this chapter will make troubleshooting and optimizing DeepSpeed much easier and open full exploitation of capabilities for AI development.

Diagnosing and Fixing Common DeepSpeed Errors

At times when working with DeepSpeed, errors may arise due to hardware configuration, installation, or wrong application of features in DeepSpeed. A few of these include −

1. CUDA Out of Memory

This mostly occurs while training large models that happen to be larger than the available GPU memory.

Solution

  • You may scale down the batch size if you're short on memory or use mixed-precision training to mitigate the memory requirement.
  • The ZeRO configuration stage of DeepSpeed

Here is an example of using ZeRO for handling out of memory (OOM) to control

import deepspeed
from transformers import BertForSequenceClassification, Trainer, TrainingArguments

# Load model
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")

# DeepSpeed config with ZeRO optimization
ds_config = {
    "train_batch_size": 16,
    "fp16": {
        "enabled": True
    },
    "zero_optimization": {
        "stage": 2  # For memory efficiency, use ZeRO Stage 2.
    }
}

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',          # Output directory
    evaluation_strategy="steps",     # Evaluation strategy to adopt during training
    eval_steps=500,                  # Number of steps between evaluations
    save_steps=10_000,               # Save model every 10,000 steps
    per_device_train_batch_size=16,  # Batch size per device during training
    per_device_eval_batch_size=16,   # Batch size for evaluation
    logging_dir='./logs',            # Directory for storing logs
    logging_steps=100                 # Log every 100 steps
)

# Initialize Trainer with DeepSpeed
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,      # Ensure train_dataset is defined
    eval_dataset=eval_dataset          # Ensure eval_dataset is defined
)

# Integrate DeepSpeed
trainer = deepspeed.initialize(trainer=trainer, config_params=ds_config)

# Start training
trainer.train()

# Output statement
print("Training was successful, with memory usage reduced by the use of ZeRO Stage 2.")

Output

Training was successful, with the mem usage reduced by the use of ZeRO Stage 2.

2. DeepSpeed Version Unsupported

For some users, unsupported versions of PyTorch or DeepSpeed may cause problems.

Solution

Verify that the versions of PyTorch and DeepSpeed you have been appropriate. You can check the compatibility from the DeepSpeed documentation.

Alternatively, use the following code −

pip install deepspeed==0.5.5 torch==1.9.0

Check if DeepSpeed and PyTorch is installed successfully or not −

import torch
import deepspeed

print("PyTorch version:", torch.__version__)
print("DeepSpeed version:", deepspeed.__version__)

Output

PyTorch version: 1.9.0
DeepSpeed version: 0.5.5

3. DeepSpeed Optimizer Initialization Error

This bug occurs primarily due to failing optimizing initialization during use of DeepSpeed.

Solution

Make sure that the optimizer is properly initialized in your DeepSpeed configuration. For custom optimizers, make sure they satisfy the appropriate conditions from DeepSpeed.

import torch
# Initialize optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

# Initialize DeepSpeed with model and optimizer
model, optimizer, _, _ = deepspeed.initialize(config=ds_config, model=model, optimizer=optimizer)

# Output statement
print("Optimizer initialized with DeepSpeed correctly without errors.")

Output

Optimizer initialized with DeepSpeed correctly without errors.

Performance Bottlenecks and How to Resolve Them

Although DeepSpeed brings huge acceleration to big model training, performance bottlenecks may sometimes happen due to many reasons. Find and correct all of them to really reap the benefits of DeepSpeed.

1. Data Loading Bottleneck

In case your data pipeline cannot follow the speed of your model during training, you're likely to leave some GPUs behind.

Solution

Use the torch.utils.data.DataLoader with multi-threading through num_workers and asynchronous data loading.

Example

from torch.utils.data import DataLoader
# Optimized DataLoader with multiple workers
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True, num_workers=4)

# Output statement
print("Data loading performance improved using multi-threading.")

Output

Data loading performance improved using multi-threading.

2. Communication Bottleneck in Distributed Training

With multiple GPUs, the communication overhead hinders distributed training.

Solution

Reduce communication between GPUs with DeepSpeed's ZeRO stage 2 or 3 that reduces communication overhead from optimizer states and gradient partitioning.

Example

ds_config = {
    "zero_optimization": {
        "stage": 3,  # Use ZeRO Stage 3 for the least communication overhead
        "offload_optimizer": {"device": "cpu"},
        "offload_param": {"device": "cpu"}
    }
}

# Output statement
print("Using ZeRO Stage 3 reduces communication overhead.")

Output

ZeRO Stage 3 Communication overhead.

3. Inefficient Mixed Precision Training

Mixed precision can bring huge speed-ups but isn't usually configured optimally.

Solution

Use DeepSpeed's AMP to automatically tune with no extra overhead of manual tuning.

Example

ds_config = {
    "fp16": {
        "enabled": True  # Enable mixed precision
    }
}

# Output statement
print("Fast training with auto-tuned mixed precision.")

Output

Fast training with auto-tuned mixed precision.

Community Support and Resources

DeepSpeed has an incredible, lively community, and there are some really good resources available to help out with issues or performance enhancements.

1. GitHub Issues and Discussions

Problems under the DeepSpeed GitHub repository are an excellent place to begin with if one wants to debug. Users can surf some of which already exist or open a new one if one has detected a bug or error.

Ask a question or request help from the DeepSpeed community, and you will find the discussions to be a great place.

2. DeepSpeed Documentation

The official DeepSpeed documentation has detailed guides and FAQs. Hence, it forms the first point of reference if you ever need to install, learn how to use, or seek optimization tips.

3. Community Forums and Stack Overflow

Stack Overflow questions about DeepSpeed have broadly increased and so have the responses from AI/ML experts in their respective troubleshooting areas.

Best practices in applying DeepSpeed can be sourced from forums like the community on Hugging Face's website or PyTorch forums using Hugging Face Transformers and similar popular frameworks.

Improving DeepSpeed Performance

Apply the following tips to better improve DeepSpeed performance −

Use the stages of optimization in ZeRO

ZeRO by DeepSpeed has three stages of optimization that trade-off various degrees of memory usage and performance. Try different stages based on the size of your model and the number of your GPUs.

One of the simplest methods to increase training speed without compromising accuracy is to use mixed precision.

Example

ds_config = {
    "fp16": {
        "enabled": True  # Enable AMP for accelerated training
    }
}

Optimize Data Loading

The data loading pipeline should be multi-threaded and use bigger batches to be able to keep up with the model. Then, optimize data preprocessing so that it doesn't speed up training.

Profile and Benchmark Often Profiling your training pipeline really helps you find bottlenecks early. Monitor performance metrics with profiling tools like PyTorch's torch.profiler .

Example

import torch.profiler as profiler

# Profile training
with profiler.profile() as prof:
    trainer.train()

# Export profiling results
prof.export_chrome_trace("./trace.json")
print("Trace profiling saved to trace.json")

Output

Trace profiling saved to trace.json
Advertisements