
- DeepSpeed - Home
- DeepSpeed - Getting Started
- DeepSpeed - Model Training
- DeepSpeed - Optimizer
- DeepSpeed - Learning Rate Scheduler
- DeepSpeed - Distributed Training
- DeepSpeed - Memory Optimization
- DeepSpeed - Mixed Precision Training
- DeepSpeed - PyTorch & Transformers
- DeepSpeed - Inference Optimization
- DeepSpeed - Advanced Features
- DeepSpeed - Large Language Models
- DeepSpeed - Troubleshooting Common Issues
DeepSpeed Useful Resources
Distributed Training with DeepSpeed
Single-GPU training becomes inefficient or impossible in most cases with the increase in model sizes and dataset sizes. Therefore, the models can scale easily from a single GPU to multiple GPUs and nodes based on distributed training. In conjunction with optimizing this training method, DeepSpeed by Microsoft is one of the best-equipped frameworks. It enables the handling of large models and reduces memory overhead through salient techniques such as data parallelism, model parallelism, and the Zero Redundancy Optimizer (ZeRO).
Basic Distributed Training
There are a lot of parts in training a machine learning model, and usually, it splits into parts over many computing resources such as GPUs or cluster nodes. Scaling up data and computation generally faces an important challenge, leading to the big model's easy and efficient training.
Why Distributed Training?
The following are key reasons to consider for distributed training while working on large deep learning models −
- Scalability − It is prohibitively hard to train very large models with tens of millions, or even billions, of parameters on a single GPU. This process can be scaled across many GPUs by using distributed training.
- Faster Convergence − Spreading out the training process among several GPUs accelerates the process of convergence, hence faster model development.
- Resource Efficiency − This kind of training will serve the purpose of putting your available hardware to its maximum use, hence saving time and money spent.
- Data Parallelism − This is the case when one model is spread across many GPUs, and each GPU processes different batches of the dataset.
- Model Parallelism − The model is parallelized on multiple GPUs; each GPU calculates parts of the model's operations.
- Hybrid Parallelism − Mix data and model parallelism. In other words, split the data on GPUs and further split the model.
Data Parallelism
DeepSpeed facilitates distributed training by offering adaptable models and data concurrency. Let's explore these in depth.
Each GPU or worker receives a portion of the data to process when data parallelism is used. It then averages those results after processing to update the model weights. Therefore, one can train with larger batch sizes without running out of memory.
Example of Data Parallelism With DeepSpeed
The following is a simple Python example to show data parallelism with DeepSpeed −
import torch import deepspeed # Define a simple neural network model class SimpleModel(torch.nn.Module): def __init__(self): super(SimpleModel, self).__init__() self.fc1 = torch.nn.Linear(784, 128) self.fc2 = torch.nn.Linear(128, 10) def forward(self, x): x = torch.relu(self.fc1(x)) return self.fc2(x) # Initialize DeepSpeed configuration deepspeed_config = { "train_batch_size": 64, "optimizer": { "type": "Adam", "params": { "lr": 0.001 } } } # Initialize model model = SimpleModel() # Initialize DeepSpeed for distributed data parallelity model_engine, optimizer, _, _ = deepspeed.initialize( config=deepspeed_config, model=model ) # Dummy data inputs = torch.randn(64, 784) labels = torch.randint(0, 10, (64,)) # Forward pass outputs = model_engine(inputs) loss = torch.nn.functional.cross_entropy(outputs, labels) # Backward pass and optimization model_engine.backward(loss) model_engine.step()
The neural network would now be training on several GPUs; each GPU takes care of a portion of the data.
Model Parallelism
Model parallelism deals with splitting the model across multiple GPUs. This becomes helpful when a single model does not fit into the memory of a single GPU.
Model Parallelism With DeepSpeed
It splits the model across multiple GPUs, where different parts of the model can execute on different GPUs concurrently.
Example of Model Parallelism With DeepSpeed
The following is a simple Python program to show the working of model parallelism using DeepSpeed −
import torch import deepspeed from deepspeed.pipe import PipelineModule, LayerSpec # Define a simple pipeline model class SimpleLayer(torch.nn.Module): def __init__(self, input_size, output_size): super(SimpleLayer, self).__init__() self.fc = torch.nn.Linear(input_size, output_size) def forward(self, x): return torch.relu(self.fc(x)) # Two GPUs and two layers in a pipeline paradigm. layers = [ LayerSpec(SimpleLayer, 784, 128), LayerSpec(SimpleLayer, 128, 10) ] # We create a pipeline model, specifying the number of stages - 2 pipeline_model = PipelineModule(layers=layers, num_stages=2) # Initialize DeepSpeed for model parallelism model_engine, optimizer, _, _ = deepspeed.initialize( config=deepspeed_config, model=pipeline_model ) # Dummy inputs inputs = torch.randn(64, 784) # Forward pass through pipeline outputs = model_engine(inputs)
This will process the forward pass across many GPUs in phases. The first GPU will process up to the first layer, while the second will process up to the second last layer.
Zero Redundancy Optimizer (ZeRO)
The most salient feature of DeepSpeed is perhaps the Zero Redundancy Optimizer, more conveniently called ZeRO and designed to solve the memory consumption problem of model training. It splits various states across different GPUs, allowing more efficient usage of memory: optimizer, gradients, and parameters.
ZeRO includes three phases −
- Stage 1 − Partitioning the state of the optimizer.
- Stage 2 − Partitioning the gradient state.
- Stage 3 − Partitioning the parameter state.
Example of Zero Redundancy Optimizer
Following is a simple example of zero redundancy optimizer in python −
import torch import deepspeed # Use ZeRO optimization to define the model and DeepSpeed settings deepspeed_config = { "train_batch_size": 64, "optimizer": { "type": "Adam", "params": { "lr": 0.001 } }, "zero_optimization": { "stage": 2 # Toggle gradient partitioning using ZeRO Stage 2 } } # Initialize model model = SimpleModel() # Initialize DeepSpeed with ZeRO optimization model_engine, optimizer, _, _ = deepspeed.initialize( config=deepspeed_config, model=model ) # Forward pass inputs = torch.randn(64, 784) outputs = model_engine(inputs) # Backward pass and optimization model_engine.backward(outputs) model_engine.step()
This code runs on ZeRO Stage 2, which is a gradient state partitioned across GPUs and reduces memory consumption during training.
Scaling Models Across Multiple GPUs and Nodes
DeepSpeed scales models across multiple GPUs and nodes by leveraging a mixture of parallelism strategies with the advanced communication layer of DeepSpeed to realize the best scaling.
Scaling Example with Multiple Nodes
The NCCL backend was utilized for inter-GPU communication and scale training to multiple GPUs and nodes. We can make the following call to use DeepSpeed running on multiple GPUs and nodes:
To run on multiple GPUs and nodes using DeepSpeed, you can use the following command:
deepspeed --num_nodes 2 --num_gpus 8 train.py
This uses a total of 8 GPUs and 2 nodes for training.
Example of Training on Multiple GPUs with DeepSpeed
The follwing example demostrates how to work with Training on multiple GPUs using DeepSpeed −
import deepspeed # Training on multiple GPUs if torch.distributed.get_rank() == 0: print("Training on multiple GPUs with DeepSpeed") # Initialize DeepSpeed with ZeRO optimization for multi-GPU model_engine, optimizer, _, _ = deepspeed.initialize( model=model, config=deepspeed_config ) # Training loop for batch in train_loader: inputs, labels = batch outputs = model_engine(inputs) loss = torch.nn.functional.cross_entropy(outputs, labels) model_engine.backward(loss) model_engine.step()
This code uses DeepSpeed for training the model on various GPUs memory-efficiently, employing methods such as ZeRO for optimization.
Summing Up
DeepSpeed has been powerfully developed for scaling and optimizing distributed training in deep learning models. With the integration of ZeRO for further scaling up on multiple GPUs and nodes with a combination of data parallelism and model parallelism, DeepSpeed can fully address all those challenges in the efficient training of big models. That means that simultaneously, features from DeepSpeed will ensure that distributed training stays accessible and performance-enhanced as it grows.