
- DeepSpeed - Home
- DeepSpeed - Getting Started
- DeepSpeed - Model Training
- DeepSpeed - Optimizer
- DeepSpeed - Learning Rate Scheduler
- DeepSpeed - Distributed Training
- DeepSpeed - Memory Optimization
- DeepSpeed - Mixed Precision Training
- DeepSpeed - PyTorch & Transformers
- DeepSpeed - Inference Optimization
- DeepSpeed - Advanced Features
- DeepSpeed - Large Language Models
- DeepSpeed - Troubleshooting Common Issues
DeepSpeed Useful Resources
DeepSpeed - Mixed Precision Training
Mixed precision training is a revolutionary approach in deep learning in which the models are trained much faster and more efficiently. This kind of approach uses mixed usage of 16-bit floating-point arithmetic, sometimes 32-bit floating-point arithmetic, to meet a good balance between model accuracy and maximum hardware efficiency. The DeepSpeed library from Microsoft allows easy scaling of large models by reducing memory and time for computation.
What is Mixed Precision Training?
Mixed Precision training uses lower precision arithmetic for most computations and reserves higher precision. In this case, FP32 is critical. The main goals are reduced computational cost, quicker speed during training, and saving memory usage.
Floating-Point Formats
The following are floating-point formats −
- FP32 Single-Precision − A 32-bit floating point format commonly used in deep learning.
- FP16 Half-Precision − A 16-bit floating point format that is computationally much faster than regular floating point.
- BF16 (BFloat16) − A variant of FP16 that has a much wider exponent range and is further geared towards supporting even more reliable training.
The training model using FP16/BF16 alongside FP32 drastically reduces the training time. It generally occurs with big-scale model training on the GPUs and TPUs.
DeepSpeed FP16 and BF16
DeepSpeed natively supports both FP16 and BF16 mixed precision training modes. This would allow developers to scale out the deep learning models without affecting their performance and accuracy. Here's how that would look.
DeepSpeed FP16 Mixed Precision Training
All you need to do is slightly modify your config so that the fp16 will be added there. Here is an example configuration file where FP16 mixed precision training was initialized:
{ "train_batch_size": 64, "gradient_accumulation_steps": 4, "fp16": { "enabled": true, "loss_scale": 0, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 } }
Memory Efficiency − Almost halves the GPU memory footprint.
Training Speed − Acceleration by use of 16-bit precision.
BF16 Mixed Precision Training
BF16, or BFloat16, is handy when having FP16 precision makes your model unstable. DeepSpeed natively supports BF16 on AMD/NVidia's GPUs and Google TPUs. To train with DeepSpeed using BF16, you also need to update your configuration in this form:
{ "train_batch_size": 64, "gradient_accumulation_steps": 4, "bf16": { "enabled": true } }
Python Example (BF16)
The following example demonstrate usage of BF16 mixed precision training −
import deepspeed def model_engine(model, optimizer, config): model, optimizer, _, _ = deepspeed.initialize( model=model, optimizer=optimizer, config=config ) return model, optimizer # Sample model and optimizer model = YourModel() # Use your model optimizer = YourOptimizer(model.parameters()) # Load DeepSpeed config for BF16 ds_config = "deepspeed_bf16_config.json" # path to your DeepSpeed config # Initialize DeepSpeed model with BF16 model_engine, optimizer = model_engine(model, optimizer, ds_config) # Train your model for batch in dataloader: outputs = model_engine(batch) loss = criterion(outputs, targets) model_engine.backward(loss) model_engine.step()
Stable Training − BF16 ensures stability, especially for large models.
Efficient Training − Memory and computation efficiency are close to FP16.
Advantages of Mixed Precision Training
The following are key advantages of mixed precision training −
- Less Memory Usage − It uses 16-bit precision in doing most of the calculations, half the memory usage for 32-bit precision. This will allow training on larger models or bigger batch sizes without increasing hardware requirement.
- Speedup − Hardware accelerators, such as GPUs or TPUs, can evaluate low-precision computations orders of magnitude faster than standard (32-bit) floating-point numbers. That is a huge speedup, especially for big models.
- No Loss of Accuracy − Mixed precision ensures that the computations most sensitive to accuracyfor instance, gradient accumulation indeed runs at 32-bit precision, and the accuracy of the model is thereby preserved even if it is used sparingly elsewhere.
Challenges of Mixed Precision Training
The following are some challenges in mixed precision training −
- Numerical Stability − Training at lower precision can lead to loss of numerical stability, especially with FP16. This might lead to gradient underflow or overflow, resulting in poor convergence during optimization.
- Loss of Precision − In some models, the performance while running in mixed precision may be affected and would thus have to be managed at various levels of precision.
- Hardware Compatibility − Mixed precision training isn't supported by all hardware. So, before the start of training, while using the mixed precision strategy, ensure that your hardware is designed to support FP16 or BF16 precision. Some of the hardware supporting FP16 and BF16 are Nvidia's Tensor Cores, Google's TPUs, etc.
Best Practices of Mixed Precision Training
Here are some best practices to effectively implement mixed precision training −
1. Suitable Hardware
Mixed precision can only be used fully with hardware optimized for FP16 or BF16 computations, such as Nvidia's Tensor Cores or Google's TPUs.
2. Automatic Mixed Precision (AMP)
Automatic Mixed Precision Libraries − DeepSpeed and PyTorch support minimal code changes for mixed precision training. Just enable AMP, which lets the framework automatically do the dynamic switch between different precisions in FP16/32 or BF16/32 on your behalf.
import torch from torch.cuda import amp # Initialize amp autocast and GradScaler autocast = amp.autocast GradScaler = amp.GradScaler # Create a GradScaler scaler = GradScaler() for data in dataloader: optimizer.zero_grad() with autocast(): outputs = model(data) loss = criterion(outputs, target) # Scale the loss and backward pass scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()
Stable and Efficient Training − AMP ensures that operations are performed correctly in FP16/32, eliminating gradient underflow, among others.
3. Loss Scaling Tracking and Stabilization
DeepSpeed and PyTorch offer loss scaling automatically. The scale is automatically adjusted while training to avoid numerical instability.
{ "fp16": { "enabled": true, "loss_scale": 0, // Automatic loss scaling "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 } }
More Accurate Model − Loss scaling helps avoid vanishing gradients so that the model will converge in a stable manner.
4. Profiling for Memory and Speed
Profile your models to track the number of savings in memory and speedup from using mixed precision training. Use tools such as PyTorch's torch.profiler to monitor the following metrics −
with torch.profiler.profile( activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA], record_shapes=True, profile_memory=True, ) as profiler: for step, batch in enumerate(dataloader): outputs = model(batch) loss = criterion(outputs, target) loss.backward() profiler.step() print(profiler.key_averages().table(sort_by="cuda_time_total"))
Optimized Memory and Speed − Profiling helps ensure that the real benefits of mixed precision training are reappeared.
Summing Up
Though mixed precision training with DeepSpeed is brilliant in accelerating model training, memory conservation, and achieving accuracy, you can now process massive models and datasets with significantly lower computational costs when you take advantage of formats like FP16 or BF16. While not for the faint of heart, adoption of best practices around AMP, proper loss scaling, and hardware compatibility will help you tap into the true power of mixed precision. Mixed precision training will remain an essential tool for scaling models since models simply continue to grow larger, and growth has no upper bound.