
- DeepSpeed - Home
- DeepSpeed - Getting Started
- DeepSpeed - Model Training
- DeepSpeed - Optimizer
- DeepSpeed - Learning Rate Scheduler
- DeepSpeed - Distributed Training
- DeepSpeed - Memory Optimization
- DeepSpeed - Mixed Precision Training
- DeepSpeed - PyTorch & Transformers
- DeepSpeed - Inference Optimization
- DeepSpeed - Advanced Features
- DeepSpeed - Large Language Models
- DeepSpeed - Troubleshooting Common Issues
DeepSpeed Useful Resources
Model Training with DeepSpeed
Deep learning models have grown big and complicated, making the training process even more difficult to carry out effectively. That is where DeepSpeed-Deep Learning Optimization Library from Microsoft comes in. The library was destined for the training of big models; it also boasts a collection of features aimed at memory optimization, computational efficiency, and overall training performance. Objectives by the end of this chapter will include training with DeepSpeed, looking at configuration files that set up the optimization features, and giving some examples of training popular models using this power tool.
Deep Learning Model Training with DeepSpeed
Training deep learning models is a compute-bound task, especially when working on large datasets and complex architectures. DeepSpeed is built for this challenge by providing a set of capabilities comprising mixed precision training, ZeRO (Zero Redundancy Optimizer), and gradient accumulation all in one framework that ensures extremely high efficiency while scaling up model training without necessarily exponentially scaling computation resources.
Now we will start by implementing DeepSpeed into a simple model training pipeline.
Step 1: Model and Dataset
Assume that a simple PyTorch model is solving the regression problem:
import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import DataLoader, TensorDataset # A simple regression model class RegressionModel(nn.Module): def __init__(self): super(RegressionModel, self).__init__() self.fc1 = nn.Linear(10, 50) self.fc2 = nn.Linear(50, 1) def forward(self, x): x = torch.relu(self.fc1(x)) x = self.fc2(x) return x # Generating synthetic data inputs = torch.randn(1000, 10) targets = torch.randn(1000, 1) dataset = TensorDataset(inputs, targets) dataloader = DataLoader(dataset, batch_size=32, shuffle=True) model = RegressionModel()
Step 2: Add DeepSpeed
Next step is to add DeepSpeed to your configuration file to enable training optimization.
DeepSpeed Configuration Files
DeepSpeed configuration files are JSON files, which specify a number of parameters in optimizing model training. An example is as follows:
{ "train_batch_size": 32, "fp16": { "enabled": true }, "zero_optimization": { "stage": 1, "allgather_partitions": true, "reduce_scatter": true, "allgather_bucket_size": 2e8, "overlap_comm": true }, "optimizer": { "type": "Adam", "params": { "lr": 0.001, "betas": [0.9, 0.999], "eps": 1e-8, "weight_decay": 3e-7 } } }
Save the preceding text to a file in your project folder called ds_config.json.
Step 3: DeepSpeed Initialization
This is where things get interesting. With a configuration file setup, you're ready to initialize DeepSpeed in your training script as follows:
import deepspeed # Initialize DeepSpeed ds_config_path = "ds_config.json" model_engine, optimizer, _, _ = deepspeed.initialize( model=model, model_parameters=model.parameters(), config=ds_config_path )
Output
Running the above code will initialize DeepSpeed with the config specified below −
[INFO] DeepSpeed info: version=0.6.0, git-hash=unknown, git-branch=unknown [INFO] Initializing model parallel group with size 1 [INFO] Initialize optimizer with DeepSpeed Zero Optimizer
Optimizing Training with DeepSpeed's Features
DeepSpeed comes with a set of features that could optimize model training. We will discuss some of the key features here in.
- Mixed Precision Training − It trains the models in 16-bit floating-point representation, hence requiring less memory and therefore faster computations.
- ZeRO Optimization − The Zero Redundancy Optimizer (ZeRO) can substantially reduce the memory footprint for large models by partitioning model states across thousands of GPUs. You can control how much optimization is done with the value of the stage parameter in the zero_optimization section.
- Gradient Accumulation − This feature allows increasing the effective batch size without needing a proportional increase of GPU memory. You can enable gradient accumulation by setting the value for gradient_accumulation_steps in the config file.
- Activation Checkpointing − This approach is a computation versus memory saving approach since it saves memory at the cost of recomputing some activations in the backward pass. That means it reduces overall memory consumption at train time.
These features can be combined in various ways depending on what is optimal for your particular requirements.
Example of Training BERT Model Using DeepSpeed
Demonstrating the power of DeepSpeed, take the training of a famous model like BERT − Bidirectional Encoder Representations from Transformers.
Step 1: Prepare and Load the BERT Model
You can load a pre-trained BERT model using the Hugging Face Transformers library easily −
from transformers import BertForSequenceClassification, BertTokenizer tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") model = BertForSequenceClassification.from_pretrained("bert-base-uncased") # Sample data inputs = tokenizer("DeepSpeed makes BERT training efficient!", return_tensors="pt") labels = torch.tensor([1]).unsqueeze(0) # Dataloader dataloader = DataLoader([(inputs, labels)], batch_size=1)
Step 2: Add DeepSpeed Integration
As before, we add DeepSpeed integration by initializing with your model and config file −
model_engine, optimizer, _, _ = deepspeed.initialize( model=model, model_parameters=model.parameters(), config="ds_config.json" )
Step 3: Run Model
The the model as follows −
for batch in dataloader: inputs, labels = batch outputs = model_engine(**inputs) loss = nn.CrossEntropyLoss()(outputs.logits,labels) model_engine.backward(loss) model_engine.step() print(f"Epoch {epoch+1}, Loss: {loss.item()}")
Output
Training BERT with DeepSpeed will output the loss for every epoch, assuring us that the model is being trained efficiently −
Epoch 1, Loss: 0.6785 Epoch 2, Loss: 0.5432 Epoch 3, Loss: 0.4218
Handling Large Datasets with DeepSpeed
Large datasets pose problems that go well beyond model architecture. How you manage memory and computational resources efficiently while processing big volumes of data will save you from bottlenecks. DeepSpeed tackles these very challenges through its advanced features in the domain of data handling.
1. Dynamic Data Loading
DeepSpeed performs dynamic loading of the data, thereby loading into memory only the batches being used at one time during training. This cuts down on the memory footprint, hence allowing you to train on more substantial datasets without necessarily needing more powerful hardware. Besides that, you will keep the memory usage minimal; hence, you minimize the time taken by input/output operations of data, which enhances the overall speed of training.
2. Data Parallelism
Another important capability enabled by DeepSpeed is that of data parallelism. It supports natively distributed data across many GPUs. Because of that, different batches can be processed at once. This parallel will speed up the training process. It can occupy GPU resources efficiently. Therefore, in practice, applying data parallelism using DeepSpeed to your training pipeline is not painful because it's integrated into PyTorch's DataLoader.
3. Memory-Efficient Data Shuffling
Large datasets normally require shuffling to avoid overfitting and learning by pattern based on how data has been ordered. However, this is extremely memory-consuming for large datasets. DeepSpeed optimizes this process using very memory-efficient algorithms able to provide effective shuffling without a huge memory increase. This makes sure that on large datasets, training will be smooth and efficient.
4. Data Augmentation Support
Data augmentation in general includes certain methods that increase the size of a dataset artificially by modifying existing data. DeepSpeed supports on-the-fly data augmentation, which means one doesn't have to store augmented data in memory but can perform data augmentation on the fly during training. This can reduce the memory pressure even further and also provide much more extensive utilization of data augmentation techniques.
5. Batch Size Scaling
With DeepSpeed gradient accumulation and ZeRO optimization, that allows scaling up of batch sizes even when working with enormous datasets. Larger batch sizes can sometimes improve model convergence and training stability. DeepSpeed is enabled, which allows scaling of batch size with management of the GPU memory requirement; hence, your model should be able to train on big datasets effectively.
The above DeepSpeed features help in that aspect by being able to manage a large dataset, thus making it possible for you to design and train high-performance models with no hardware restrictions. Whether you're training your model on a very big corpus of text or processing images in super-high resolution, this feature in handling data by DeepSpeed keeps your training pipeline optimized and scalable.
Summing Up
DeepSpeed allows having an effective training framework for deep learning models, especially in scaling size and complexity. Therefore, learning advanced features of how to use mixed precision training, ZeRO optimization, and activation checkpointing are ways in which added value optimizes the process. This chapter has information about model training using DeepSpeed preparing the environment for DeepSpeed, the configuration of DeepSpeed, and running the training processes. With this tool and technique in hand, now you can handle large-scale deep-learning projects with better performance and low consumption of resources.