Getting Started with DeepSpeed



Deep learning models are becoming increasingly complex with rising training computational costs. DeepSpeed, developed by Microsoft, efficiently trains large-scale models on minimum resources. This chapter will take you through basic steps that get them up and running with DeepSpeed in a flow from installation and setting up of the environment to running their first model.

Installing DeepSpeed

The first thing we need to do is install the library before digging further into the details of DeepSpeed. Using pip, this is simple to accomplish −

pip install deepspeed

While installing, you may see the result something like below −

Collecting deepspeed
Downloading deepspeed-0.6.0-py3-none-any.whl (696 kB)
|| 696 kB 3.2 MB/s 
Collecting torch
Downloading torch-1.9.1-cp38-cp38-manylinux1_x86_64.whl (804.1 MB)
||
deepspeed-0.6.0 torch-1.9.1 installed successfully

You can also clone the GitHub repository and install from the source, if you so desire −

git clone https://github.com/microsoft/DeepSpeed.git
cd DeepSpeed
pip install .

This will give you the latest features, which might not be released yet in the stable.

Environment Setup

After installing DeepSpeed, one has to set up the environment. First, make sure that all the required dependencies are present.

Create a virtual environment for managing the dependencies −

python -m venv deepspeed-env
source deepspeed-env/bin/activate  # On Windows, use 'deepspeed-env\\Scripts\\activate'

Install PyTorch if you haven't already −

pip install torch torchvision torchaudio

Further, depending on your use case, you might need CUDA or other types of acceleration for GPUs. If you are on a machine with GPUs, installation of the CUDA version of PyTorch is as simple as running the following in your terminal −

pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

This will ensure the setup so that DeepSpeed uses all your machine's hardware capabilities.

Basic Concepts and Terminology

Before running your first model, let's cover some basic concepts and terminology that you will encounter quite frequently in DeepSpeed.

  • Optimizer − DeepSpeed currently supports multiple optimizers that can be used to optimize the training of large models. The optimizer handles the gradient update while training the model.
  • Scheduler− Schedulers update the learning rate during training. By default, DeepSpeed integrates all PyTorch schedulers and further provides additional custom schedulers developed for large models.
  • Zero Redundancy Optimizer (ZeRO) − It is a memory optimization technique that reduces the memory footprint of large models by partitioning the model states across many GPUs.
  • Accumulate Gradients − This can facilitate the use of larger batch sizes than allow GPU memory by summing gradients over multiple iterations before model weight updates.
  • Checkpoint Activations − This saves some memory at the cost of additional computation, recomputing the forward pass activations during back-propagation.

Understanding these concepts should provide you context enough to go through most of the advanced features in DeepSpeed and customize your training pipeline.

Running Your First Model With DeepSpeed

Now that your environment is set up and you are familiar with basic terminology, let's run a simple DeepSpeed model. We will first create a basic PyTorch model and then add DeepSpeed to it to see performance gains.

Step 1: Create a Simple PyTorch Model

import torch
import torch.nn as nn
import torch.optim as optim

class SimpleModel(nn.Module):
def __init__(self):
super(SimpleModel, self).__init__()
self.fc1 = nn.Linear(10, 50) # input layer (10) -> hidden layer (50)
self.fc2 = nn.Linear(50, 1) # hidden layer (50) -> output layer (1)

def forward(self, x):
x = torch.relu(self.fc1(x)) # hidden layer activation function
x = self.fc2(x)
return x

model = SimpleModel()

Step 2: Implement DeepSpeed

Now, let's refactor the code so it works with DeepSpeed. We will initialize the model with DeepSpeed and some basic configuration.

import deepspeed

ds_config = {
   "train_batch_size": 32,
   "fp16": {
      "enabled": True
   },
   "zero_optimization": {
      "stage": 1
   }
}

model_engine, optimizer, _, _ = deepspeed.initialize(
   model=model,
   model_parameters=model.parameters(),
   config=ds_config
)

Output

If all goes well, DeepSpeed will initialize and print out the configuration settings −

[INFO] DeepSpeed info: version=0.6.0, git-hash=unknown, git-branch=unknown
[INFO] Initializing model parallel group with size 1
[INFO] Initializing optimizer with DeepSpeed Zero Optimizer

Step 3: Train Model

At this point, you should now be able to train your model using DeepSpeed. Below is an example training loop.

for epoch in range(5) − 
inputs = torch.randn(32, 10)
labels = torch.randn(32, 1)

model_engine.train()
outputs = model_engine(inputs)
loss = nn.MSELoss()(outputs, labels)

model_engine.backward(loss)
model_engine.step()
print(f'Epoch {epoch+1}, Loss: {loss.item()}')

Output

Each epoch will give you a result something like this:

Epoch 1, Loss: 0.4857
Epoch 2, Loss: 0.3598
Epoch 3, Loss: 0.2893
Epoch 4, Loss: 0.2194
Epoch 5, Loss: 0.1745

Step 4: Save the Model

Finally, you can save the model that is so far trained −

model_engine.save_checkpoint('./checkpoint', epoch=5)

Output

[INFO] Saving model checkpoint to ./checkpoint

Advanced Capabilities of DeepSpeed

Let's now look into some advanced capabilities of DeepSpeed, having had a basic view of what DeepSpeed is. These advanced features are implemented to deal with the complexity of training large models, reducing memory consumption, and improving computation efficiency.

  • Mixed Precision Training FP16 − One of the reasons for fast model training in DeepSpeed is that it supports mixed precision training by using half precision.
  • ZeRO Optimization Stages − DeepSpeed has a game-changing technique known as ZeRO, which reduces memory by partitioning model states across multiple GPUs.
  • Gradient Accumulation − Another strategy that DeepSpeed supports is gradient accumulation, which can simulate larger batch sizes without requiring more GPU memory.
  • Offloading − Even for very large models, optimizations provided by ZeRO Stage 3 may be insufficient.

Summing Up

The major steps that form part of getting started with DeepSpeed are installation of the library, setup of your environment, knowing some basic concepts, and running your first model. DeepSpeed allows the training of large models with much higher efficiency at higher memory and lower overall training time. This basic chapter will enable you to go further into the advanced features of DeepSpeed in driving your deep learning projects.

Advertisements