
- DeepSpeed - Home
- DeepSpeed - Getting Started
- DeepSpeed - Model Training
- DeepSpeed - Optimizer
- DeepSpeed - Learning Rate Scheduler
- DeepSpeed - Distributed Training
- DeepSpeed - Memory Optimization
- DeepSpeed - Mixed Precision Training
- DeepSpeed - PyTorch & Transformers
- DeepSpeed - Inference Optimization
- DeepSpeed - Advanced Features
- DeepSpeed - Large Language Models
- DeepSpeed - Troubleshooting Common Issues
DeepSpeed Useful Resources
DeepSpeed - Troubleshooting and Common Issues
DeepSpeed is that revolutionary tool for scaling and optimizing deep learning models, but just like with any powerful technology, it will get you into trouble sometimes. You really need to know how to diagnose common errors, solve performance bottlenecks, and use community support to make the experience smoother.
There are multiple optimizing parameters, mixed precision, and ZeRO stages, which will bring enormous performance gains for the users. Using the facilities and guidelines of this chapter will make troubleshooting and optimizing DeepSpeed much easier and open full exploitation of capabilities for AI development.
Diagnosing and Fixing Common DeepSpeed Errors
At times when working with DeepSpeed, errors may arise due to hardware configuration, installation, or wrong application of features in DeepSpeed. A few of these include −
1. CUDA Out of Memory
This mostly occurs while training large models that happen to be larger than the available GPU memory.
Solution
- You may scale down the batch size if you're short on memory or use mixed-precision training to mitigate the memory requirement.
- The ZeRO configuration stage of DeepSpeed
Here is an example of using ZeRO for handling out of memory (OOM) to control
import deepspeed from transformers import BertForSequenceClassification, Trainer, TrainingArguments # Load model model = BertForSequenceClassification.from_pretrained("bert-base-uncased") # DeepSpeed config with ZeRO optimization ds_config = { "train_batch_size": 16, "fp16": { "enabled": True }, "zero_optimization": { "stage": 2 # For memory efficiency, use ZeRO Stage 2. } } # Define training arguments training_args = TrainingArguments( output_dir='./results', # Output directory evaluation_strategy="steps", # Evaluation strategy to adopt during training eval_steps=500, # Number of steps between evaluations save_steps=10_000, # Save model every 10,000 steps per_device_train_batch_size=16, # Batch size per device during training per_device_eval_batch_size=16, # Batch size for evaluation logging_dir='./logs', # Directory for storing logs logging_steps=100 # Log every 100 steps ) # Initialize Trainer with DeepSpeed trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, # Ensure train_dataset is defined eval_dataset=eval_dataset # Ensure eval_dataset is defined ) # Integrate DeepSpeed trainer = deepspeed.initialize(trainer=trainer, config_params=ds_config) # Start training trainer.train() # Output statement print("Training was successful, with memory usage reduced by the use of ZeRO Stage 2.")
Output
Training was successful, with the mem usage reduced by the use of ZeRO Stage 2.
2. DeepSpeed Version Unsupported
For some users, unsupported versions of PyTorch or DeepSpeed may cause problems.
Solution
Verify that the versions of PyTorch and DeepSpeed you have been appropriate. You can check the compatibility from the DeepSpeed documentation.
Alternatively, use the following code −
pip install deepspeed==0.5.5 torch==1.9.0
Check if DeepSpeed and PyTorch is installed successfully or not −
import torch import deepspeed print("PyTorch version:", torch.__version__) print("DeepSpeed version:", deepspeed.__version__)
Output
PyTorch version: 1.9.0 DeepSpeed version: 0.5.5
3. DeepSpeed Optimizer Initialization Error
This bug occurs primarily due to failing optimizing initialization during use of DeepSpeed.
Solution
Make sure that the optimizer is properly initialized in your DeepSpeed configuration. For custom optimizers, make sure they satisfy the appropriate conditions from DeepSpeed.
import torch # Initialize optimizer optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5) # Initialize DeepSpeed with model and optimizer model, optimizer, _, _ = deepspeed.initialize(config=ds_config, model=model, optimizer=optimizer) # Output statement print("Optimizer initialized with DeepSpeed correctly without errors.")
Output
Optimizer initialized with DeepSpeed correctly without errors.
Performance Bottlenecks and How to Resolve Them
Although DeepSpeed brings huge acceleration to big model training, performance bottlenecks may sometimes happen due to many reasons. Find and correct all of them to really reap the benefits of DeepSpeed.
1. Data Loading Bottleneck
In case your data pipeline cannot follow the speed of your model during training, you're likely to leave some GPUs behind.
Solution
Use the torch.utils.data.DataLoader with multi-threading through num_workers and asynchronous data loading.
Example
from torch.utils.data import DataLoader # Optimized DataLoader with multiple workers train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True, num_workers=4) # Output statement print("Data loading performance improved using multi-threading.")
Output
Data loading performance improved using multi-threading.
2. Communication Bottleneck in Distributed Training
With multiple GPUs, the communication overhead hinders distributed training.
Solution
Reduce communication between GPUs with DeepSpeed's ZeRO stage 2 or 3 that reduces communication overhead from optimizer states and gradient partitioning.
Example
ds_config = { "zero_optimization": { "stage": 3, # Use ZeRO Stage 3 for the least communication overhead "offload_optimizer": {"device": "cpu"}, "offload_param": {"device": "cpu"} } } # Output statement print("Using ZeRO Stage 3 reduces communication overhead.")
Output
ZeRO Stage 3 Communication overhead.
3. Inefficient Mixed Precision Training
Mixed precision can bring huge speed-ups but isn't usually configured optimally.
Solution
Use DeepSpeed's AMP to automatically tune with no extra overhead of manual tuning.
Example
ds_config = { "fp16": { "enabled": True # Enable mixed precision } } # Output statement print("Fast training with auto-tuned mixed precision.")
Output
Fast training with auto-tuned mixed precision.
Community Support and Resources
DeepSpeed has an incredible, lively community, and there are some really good resources available to help out with issues or performance enhancements.
1. GitHub Issues and Discussions
Problems under the DeepSpeed GitHub repository are an excellent place to begin with if one wants to debug. Users can surf some of which already exist or open a new one if one has detected a bug or error.
Ask a question or request help from the DeepSpeed community, and you will find the discussions to be a great place.
2. DeepSpeed Documentation
The official DeepSpeed documentation has detailed guides and FAQs. Hence, it forms the first point of reference if you ever need to install, learn how to use, or seek optimization tips.
3. Community Forums and Stack Overflow
Stack Overflow questions about DeepSpeed have broadly increased and so have the responses from AI/ML experts in their respective troubleshooting areas.
Best practices in applying DeepSpeed can be sourced from forums like the community on Hugging Face's website or PyTorch forums using Hugging Face Transformers and similar popular frameworks.
Improving DeepSpeed Performance
Apply the following tips to better improve DeepSpeed performance −
Use the stages of optimization in ZeRO
ZeRO by DeepSpeed has three stages of optimization that trade-off various degrees of memory usage and performance. Try different stages based on the size of your model and the number of your GPUs.
One of the simplest methods to increase training speed without compromising accuracy is to use mixed precision.
Example
ds_config = { "fp16": { "enabled": True # Enable AMP for accelerated training } }
Optimize Data Loading
The data loading pipeline should be multi-threaded and use bigger batches to be able to keep up with the model. Then, optimize data preprocessing so that it doesn't speed up training.
Profile and Benchmark Often Profiling your training pipeline really helps you find bottlenecks early. Monitor performance metrics with profiling tools like PyTorch's torch.profiler .
Example
import torch.profiler as profiler # Profile training with profiler.profile() as prof: trainer.train() # Export profiling results prof.export_chrome_trace("./trace.json") print("Trace profiling saved to trace.json")
Output
Trace profiling saved to trace.json