DeepSpeed - Mixed Precision Training



Mixed precision training is a revolutionary approach in deep learning in which the models are trained much faster and more efficiently. This kind of approach uses mixed usage of 16-bit floating-point arithmetic, sometimes 32-bit floating-point arithmetic, to meet a good balance between model accuracy and maximum hardware efficiency. The DeepSpeed library from Microsoft allows easy scaling of large models by reducing memory and time for computation.

What is Mixed Precision Training?

Mixed Precision training uses lower precision arithmetic for most computations and reserves higher precision. In this case, FP32 is critical. The main goals are reduced computational cost, quicker speed during training, and saving memory usage.

Floating-Point Formats

The following are floating-point formats −

  • FP32 Single-Precision − A 32-bit floating point format commonly used in deep learning.
  • FP16 Half-Precision − A 16-bit floating point format that is computationally much faster than regular floating point.
  • BF16 (BFloat16) − A variant of FP16 that has a much wider exponent range and is further geared towards supporting even more reliable training.

The training model using FP16/BF16 alongside FP32 drastically reduces the training time. It generally occurs with big-scale model training on the GPUs and TPUs.

DeepSpeed FP16 and BF16

DeepSpeed natively supports both FP16 and BF16 mixed precision training modes. This would allow developers to scale out the deep learning models without affecting their performance and accuracy. Here's how that would look.

DeepSpeed FP16 Mixed Precision Training

All you need to do is slightly modify your config so that the fp16 will be added there. Here is an example configuration file where FP16 mixed precision training was initialized:

{
   "train_batch_size": 64,
   "gradient_accumulation_steps": 4,
   "fp16": {
      "enabled": true,
      "loss_scale": 0,
      "loss_scale_window": 1000,
      "hysteresis": 2,
      "min_loss_scale": 1
   }
}

Memory Efficiency − Almost halves the GPU memory footprint.

Training Speed − Acceleration by use of 16-bit precision.

BF16 Mixed Precision Training

BF16, or BFloat16, is handy when having FP16 precision makes your model unstable. DeepSpeed natively supports BF16 on AMD/NVidia's GPUs and Google TPUs. To train with DeepSpeed using BF16, you also need to update your configuration in this form:

{
   "train_batch_size": 64,
   "gradient_accumulation_steps": 4,
   "bf16": {
      "enabled": true
   }
}

Python Example (BF16)

The following example demonstrate usage of BF16 mixed precision training −

import deepspeed
def model_engine(model, optimizer, config):
    model, optimizer, _, _ = deepspeed.initialize(
        model=model,
        optimizer=optimizer,
        config=config
    )
    return model, optimizer

# Sample model and optimizer
model = YourModel()  # Use your model
optimizer = YourOptimizer(model.parameters())

# Load DeepSpeed config for BF16
ds_config = "deepspeed_bf16_config.json"  # path to your DeepSpeed config

# Initialize DeepSpeed model with BF16
model_engine, optimizer = model_engine(model, optimizer, ds_config)

# Train your model
for batch in dataloader:
    outputs = model_engine(batch)
    loss = criterion(outputs, targets)
    model_engine.backward(loss)
    model_engine.step()

Stable Training − BF16 ensures stability, especially for large models.

Efficient Training − Memory and computation efficiency are close to FP16.

Advantages of Mixed Precision Training

The following are key advantages of mixed precision training −

  • Less Memory Usage − It uses 16-bit precision in doing most of the calculations, half the memory usage for 32-bit precision. This will allow training on larger models or bigger batch sizes without increasing hardware requirement.
  • Speedup − Hardware accelerators, such as GPUs or TPUs, can evaluate low-precision computations orders of magnitude faster than standard (32-bit) floating-point numbers. That is a huge speedup, especially for big models.
  • No Loss of Accuracy − Mixed precision ensures that the computations most sensitive to accuracyfor instance, gradient accumulation indeed runs at 32-bit precision, and the accuracy of the model is thereby preserved even if it is used sparingly elsewhere.

Challenges of Mixed Precision Training

The following are some challenges in mixed precision training −

  • Numerical Stability − Training at lower precision can lead to loss of numerical stability, especially with FP16. This might lead to gradient underflow or overflow, resulting in poor convergence during optimization.
  • Loss of Precision − In some models, the performance while running in mixed precision may be affected and would thus have to be managed at various levels of precision.
  • Hardware Compatibility − Mixed precision training isn't supported by all hardware. So, before the start of training, while using the mixed precision strategy, ensure that your hardware is designed to support FP16 or BF16 precision. Some of the hardware supporting FP16 and BF16 are Nvidia's Tensor Cores, Google's TPUs, etc.

Best Practices of Mixed Precision Training

Here are some best practices to effectively implement mixed precision training −

1. Suitable Hardware

Mixed precision can only be used fully with hardware optimized for FP16 or BF16 computations, such as Nvidia's Tensor Cores or Google's TPUs.

2. Automatic Mixed Precision (AMP)

Automatic Mixed Precision Libraries − DeepSpeed and PyTorch support minimal code changes for mixed precision training. Just enable AMP, which lets the framework automatically do the dynamic switch between different precisions in FP16/32 or BF16/32 on your behalf.

import torch
from torch.cuda import amp

# Initialize amp autocast and GradScaler
autocast = amp.autocast
GradScaler = amp.GradScaler

# Create a GradScaler
scaler = GradScaler()

for data in dataloader:
    optimizer.zero_grad()
    with autocast():
        outputs = model(data)
        loss = criterion(outputs, target)

    # Scale the loss and backward pass
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Stable and Efficient Training − AMP ensures that operations are performed correctly in FP16/32, eliminating gradient underflow, among others.

3. Loss Scaling Tracking and Stabilization

DeepSpeed and PyTorch offer loss scaling automatically. The scale is automatically adjusted while training to avoid numerical instability.

{
   "fp16": {
      "enabled": true,
      "loss_scale": 0,  // Automatic loss scaling
      "loss_scale_window": 1000,
      "hysteresis": 2,
      "min_loss_scale": 1
   }
}

More Accurate Model − Loss scaling helps avoid vanishing gradients so that the model will converge in a stable manner.

4. Profiling for Memory and Speed

Profile your models to track the number of savings in memory and speedup from using mixed precision training. Use tools such as PyTorch's torch.profiler to monitor the following metrics −

with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
) as profiler:
    for step, batch in enumerate(dataloader):
        outputs = model(batch)
        loss = criterion(outputs, target)
        loss.backward()
    profiler.step()

print(profiler.key_averages().table(sort_by="cuda_time_total"))

Optimized Memory and Speed − Profiling helps ensure that the real benefits of mixed precision training are reappeared.

Summing Up

Though mixed precision training with DeepSpeed is brilliant in accelerating model training, memory conservation, and achieving accuracy, you can now process massive models and datasets with significantly lower computational costs when you take advantage of formats like FP16 or BF16. While not for the faint of heart, adoption of best practices around AMP, proper loss scaling, and hardware compatibility will help you tap into the true power of mixed precision. Mixed precision training will remain an essential tool for scaling models since models simply continue to grow larger, and growth has no upper bound.

Advertisements