
- DeepSpeed - Home
- DeepSpeed - Getting Started
- DeepSpeed - Model Training
- DeepSpeed - Optimizer
- DeepSpeed - Learning Rate Scheduler
- DeepSpeed - Distributed Training
- DeepSpeed - Memory Optimization
- DeepSpeed - Mixed Precision Training
- DeepSpeed - PyTorch & Transformers
- DeepSpeed - Inference Optimization
- DeepSpeed - Advanced Features
- DeepSpeed - Large Language Models
- DeepSpeed - Troubleshooting Common Issues
DeepSpeed Useful Resources
DeepSpeed - Optimizer
Optimization and scheduling form the grounds for better performance in deep learning for large-scale models. DeepSpeed is an open-source, deep learning optimization library that assists model training more efficiently using its various supported techniques: memory optimization, gradient accumulation, and mixed-precision training.
The two key components of DeepSpeed are DeepSpeed Optimizer and DeepSpeed Scheduler. These work together to efficiently manage system resources, accelerate training, and reduce memory footprint on humble hardware setups- to train a model that can have billions of parameters.
Let's understand in detail how DeepSpeed Optimizer works with examples in the code of how it is used. We will look at the DeepSpeed scheduler in the following chapter.
What is DeepSpeed Optimizer?
DeepSpeed Optimizer manages model optimization by efficiently distributing memory. It supports optimizations natively interfaced with any of the popular deep learning frameworks such as PyTorch hence, it handles optimizer states that include momentum and gradient accumulation. This is a deep speed optimizer, which includes zero redundancy optimizer, ZeRO, mixed precision training, and gradient checkpointing among its main features.
Key Features of DeepSpeed Optimizer
The following are key features of DeepSpeed Optimizer −
1. Zero Redundancy Optimizer (ZeRO)
This reduces memory consumption for the states of optimizers, gradients, and model parameters by partitioning them across multiple devices.
This enables training giant models on capacity-limited devices.
2. Mixed Precision Training
By using both 16-bit and 32-bit floating-point representation, mixed precision training allows minimum memory consumption while not reducing model accuracy.
3. Gradient Checkpointing
It shards the models into chunks and stores only a subset of activations during the forward pass; hence, it might compute the intermediate values during the backward pass to save memory.
Example of Using DeepSpeed Optimizer
Following is a PyTorch-based example using DeepSpeed Optimizer with ZeRO −
import deepspeed import torch import torch.nn as nn import torch.optim as optim # Sample model definition class SampleModel(nn.Module): def __init__(self): super(SampleModel, self).__init__() self.fc = nn.Linear(10, 1) def forward(self, x): return self.fc(x) # Initialize model and optimizer model = SampleModel() optimizer = optim.Adam(model.parameters(), lr=0.001) # DeepSpeed configuration ds_config = { "train_batch_size": 8, "gradient_accumulation_steps": 2, "optimizer": { "type": "Adam", "params": { "lr": 0.001, } }, "zero_optimization": { "stage": 1 } } # Initialize DeepSpeed model_engine, optimizer, _, _ = deepspeed.initialize(model=model, optimizer=optimizer, config_params=ds_config) # Sample input and forward pass inputs = torch.randn(8, 10) outputs = model_engine(inputs) loss = outputs.mean() # Backward pass and optimization model_engine.backward(loss) model_engine.step()
Output
When executed in an IDE environment like PyCharm or VSCode, it will look like −
Deepspeed is initiated Input tensor: torch.Size([8, 10]) Forward pass completed Loss: -0.015 Backward pass and optimizer step complete
Above is an example of the IDE like PyCharm or VSCode that shows the code snippet with the already applied optimizer inside and the Terminal output that will show the successful execution of this optimizer.
Applying these examples and outputs shown in this chapter will make applying these tools to your deep learning workflow much easier.