
- DeepSpeed - Home
- DeepSpeed - Getting Started
- DeepSpeed - Model Training
- DeepSpeed - Optimizer
- DeepSpeed - Learning Rate Scheduler
- DeepSpeed - Distributed Training
- DeepSpeed - Memory Optimization
- DeepSpeed - Mixed Precision Training
- DeepSpeed - PyTorch & Transformers
- DeepSpeed - Inference Optimization
- DeepSpeed - Advanced Features
- DeepSpeed - Large Language Models
- DeepSpeed - Troubleshooting Common Issues
DeepSpeed Useful Resources
Getting Started with DeepSpeed
Deep learning models are becoming increasingly complex with rising training computational costs. DeepSpeed, developed by Microsoft, efficiently trains large-scale models on minimum resources. This chapter will take you through basic steps that get them up and running with DeepSpeed in a flow from installation and setting up of the environment to running their first model.
Installing DeepSpeed
The first thing we need to do is install the library before digging further into the details of DeepSpeed. Using pip, this is simple to accomplish −
pip install deepspeed
While installing, you may see the result something like below −
Collecting deepspeed Downloading deepspeed-0.6.0-py3-none-any.whl (696 kB) || 696 kB 3.2 MB/s Collecting torch Downloading torch-1.9.1-cp38-cp38-manylinux1_x86_64.whl (804.1 MB) || deepspeed-0.6.0 torch-1.9.1 installed successfully
You can also clone the GitHub repository and install from the source, if you so desire −
git clone https://github.com/microsoft/DeepSpeed.git cd DeepSpeed pip install .
This will give you the latest features, which might not be released yet in the stable.
Environment Setup
After installing DeepSpeed, one has to set up the environment. First, make sure that all the required dependencies are present.
Create a virtual environment for managing the dependencies −
python -m venv deepspeed-env source deepspeed-env/bin/activate # On Windows, use 'deepspeed-env\\Scripts\\activate'
Install PyTorch if you haven't already −
pip install torch torchvision torchaudio
Further, depending on your use case, you might need CUDA or other types of acceleration for GPUs. If you are on a machine with GPUs, installation of the CUDA version of PyTorch is as simple as running the following in your terminal −
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
This will ensure the setup so that DeepSpeed uses all your machine's hardware capabilities.
Basic Concepts and Terminology
Before running your first model, let's cover some basic concepts and terminology that you will encounter quite frequently in DeepSpeed.
- Optimizer − DeepSpeed currently supports multiple optimizers that can be used to optimize the training of large models. The optimizer handles the gradient update while training the model.
- Scheduler− Schedulers update the learning rate during training. By default, DeepSpeed integrates all PyTorch schedulers and further provides additional custom schedulers developed for large models.
- Zero Redundancy Optimizer (ZeRO) − It is a memory optimization technique that reduces the memory footprint of large models by partitioning the model states across many GPUs.
- Accumulate Gradients − This can facilitate the use of larger batch sizes than allow GPU memory by summing gradients over multiple iterations before model weight updates.
- Checkpoint Activations − This saves some memory at the cost of additional computation, recomputing the forward pass activations during back-propagation.
Understanding these concepts should provide you context enough to go through most of the advanced features in DeepSpeed and customize your training pipeline.
Running Your First Model With DeepSpeed
Now that your environment is set up and you are familiar with basic terminology, let's run a simple DeepSpeed model. We will first create a basic PyTorch model and then add DeepSpeed to it to see performance gains.
Step 1: Create a Simple PyTorch Model
import torch import torch.nn as nn import torch.optim as optim class SimpleModel(nn.Module): def __init__(self): super(SimpleModel, self).__init__() self.fc1 = nn.Linear(10, 50) # input layer (10) -> hidden layer (50) self.fc2 = nn.Linear(50, 1) # hidden layer (50) -> output layer (1) def forward(self, x): x = torch.relu(self.fc1(x)) # hidden layer activation function x = self.fc2(x) return x model = SimpleModel()
Step 2: Implement DeepSpeed
Now, let's refactor the code so it works with DeepSpeed. We will initialize the model with DeepSpeed and some basic configuration.
import deepspeed ds_config = { "train_batch_size": 32, "fp16": { "enabled": True }, "zero_optimization": { "stage": 1 } } model_engine, optimizer, _, _ = deepspeed.initialize( model=model, model_parameters=model.parameters(), config=ds_config )
Output
If all goes well, DeepSpeed will initialize and print out the configuration settings −
[INFO] DeepSpeed info: version=0.6.0, git-hash=unknown, git-branch=unknown [INFO] Initializing model parallel group with size 1 [INFO] Initializing optimizer with DeepSpeed Zero Optimizer
Step 3: Train Model
At this point, you should now be able to train your model using DeepSpeed. Below is an example training loop.
for epoch in range(5) − inputs = torch.randn(32, 10) labels = torch.randn(32, 1) model_engine.train() outputs = model_engine(inputs) loss = nn.MSELoss()(outputs, labels) model_engine.backward(loss) model_engine.step() print(f'Epoch {epoch+1}, Loss: {loss.item()}')
Output
Each epoch will give you a result something like this:
Epoch 1, Loss: 0.4857 Epoch 2, Loss: 0.3598 Epoch 3, Loss: 0.2893 Epoch 4, Loss: 0.2194 Epoch 5, Loss: 0.1745
Step 4: Save the Model
Finally, you can save the model that is so far trained −
model_engine.save_checkpoint('./checkpoint', epoch=5)
Output
[INFO] Saving model checkpoint to ./checkpoint
Advanced Capabilities of DeepSpeed
Let's now look into some advanced capabilities of DeepSpeed, having had a basic view of what DeepSpeed is. These advanced features are implemented to deal with the complexity of training large models, reducing memory consumption, and improving computation efficiency.
- Mixed Precision Training FP16 − One of the reasons for fast model training in DeepSpeed is that it supports mixed precision training by using half precision.
- ZeRO Optimization Stages − DeepSpeed has a game-changing technique known as ZeRO, which reduces memory by partitioning model states across multiple GPUs.
- Gradient Accumulation − Another strategy that DeepSpeed supports is gradient accumulation, which can simulate larger batch sizes without requiring more GPU memory.
- Offloading − Even for very large models, optimizations provided by ZeRO Stage 3 may be insufficient.
Summing Up
The major steps that form part of getting started with DeepSpeed are installation of the library, setup of your environment, knowing some basic concepts, and running your first model. DeepSpeed allows the training of large models with much higher efficiency at higher memory and lower overall training time. This basic chapter will enable you to go further into the advanced features of DeepSpeed in driving your deep learning projects.