Model Training with DeepSpeed



Deep learning models have grown big and complicated, making the training process even more difficult to carry out effectively. That is where DeepSpeed-Deep Learning Optimization Library from Microsoft comes in. The library was destined for the training of big models; it also boasts a collection of features aimed at memory optimization, computational efficiency, and overall training performance. Objectives by the end of this chapter will include training with DeepSpeed, looking at configuration files that set up the optimization features, and giving some examples of training popular models using this power tool.

Deep Learning Model Training with DeepSpeed

Training deep learning models is a compute-bound task, especially when working on large datasets and complex architectures. DeepSpeed is built for this challenge by providing a set of capabilities comprising mixed precision training, ZeRO (Zero Redundancy Optimizer), and gradient accumulation all in one framework that ensures extremely high efficiency while scaling up model training without necessarily exponentially scaling computation resources.

Now we will start by implementing DeepSpeed into a simple model training pipeline.

Step 1: Model and Dataset

Assume that a simple PyTorch model is solving the regression problem:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# A simple regression model
class RegressionModel(nn.Module):
    def __init__(self):
        super(RegressionModel, self).__init__()
        self.fc1 = nn.Linear(10, 50)
self.fc2 = nn.Linear(50, 1)
def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Generating synthetic data
inputs = torch.randn(1000, 10)
targets = torch.randn(1000, 1)
dataset = TensorDataset(inputs, targets)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

model = RegressionModel()

Step 2: Add DeepSpeed

Next step is to add DeepSpeed to your configuration file to enable training optimization.

DeepSpeed Configuration Files

DeepSpeed configuration files are JSON files, which specify a number of parameters in optimizing model training. An example is as follows:

{
    "train_batch_size": 32,
    "fp16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 1,
        "allgather_partitions": true,
        "reduce_scatter": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true
    },
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 0.001,
            "betas": [0.9, 0.999],
            "eps": 1e-8,
            "weight_decay": 3e-7
        }
    }
}

Save the preceding text to a file in your project folder called ds_config.json.

Step 3: DeepSpeed Initialization

This is where things get interesting. With a configuration file setup, you're ready to initialize DeepSpeed in your training script as follows:

import deepspeed

# Initialize DeepSpeed
ds_config_path = "ds_config.json"
model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    model_parameters=model.parameters(),
    config=ds_config_path
)

Output

Running the above code will initialize DeepSpeed with the config specified below −

[INFO] DeepSpeed info: version=0.6.0, git-hash=unknown, git-branch=unknown
[INFO] Initializing model parallel group with size 1
[INFO] Initialize optimizer with DeepSpeed Zero Optimizer

Optimizing Training with DeepSpeed's Features

DeepSpeed comes with a set of features that could optimize model training. We will discuss some of the key features here in.

  • Mixed Precision Training − It trains the models in 16-bit floating-point representation, hence requiring less memory and therefore faster computations.
  • ZeRO Optimization − The Zero Redundancy Optimizer (ZeRO) can substantially reduce the memory footprint for large models by partitioning model states across thousands of GPUs. You can control how much optimization is done with the value of the stage parameter in the zero_optimization section.
  • Gradient Accumulation − This feature allows increasing the effective batch size without needing a proportional increase of GPU memory. You can enable gradient accumulation by setting the value for gradient_accumulation_steps in the config file.
  • Activation Checkpointing − This approach is a computation versus memory saving approach since it saves memory at the cost of recomputing some activations in the backward pass. That means it reduces overall memory consumption at train time.

These features can be combined in various ways depending on what is optimal for your particular requirements.

Example of Training BERT Model Using DeepSpeed

Demonstrating the power of DeepSpeed, take the training of a famous model like BERT − Bidirectional Encoder Representations from Transformers.

Step 1: Prepare and Load the BERT Model

You can load a pre-trained BERT model using the Hugging Face Transformers library easily −

from transformers import BertForSequenceClassification, BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")

# Sample data
inputs = tokenizer("DeepSpeed makes BERT training efficient!", return_tensors="pt")
labels = torch.tensor([1]).unsqueeze(0)

# Dataloader
dataloader = DataLoader([(inputs, labels)], batch_size=1)

Step 2: Add DeepSpeed Integration

As before, we add DeepSpeed integration by initializing with your model and config file −

model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    model_parameters=model.parameters(),
    config="ds_config.json"
)

Step 3: Run Model

The the model as follows −

for batch in dataloader:
        inputs, labels = batch
        outputs = model_engine(**inputs)
loss = nn.CrossEntropyLoss()(outputs.logits,labels)

        model_engine.backward(loss)
        model_engine.step()
print(f"Epoch {epoch+1}, Loss: {loss.item()}")

Output

Training BERT with DeepSpeed will output the loss for every epoch, assuring us that the model is being trained efficiently −

Epoch 1, Loss: 0.6785
Epoch 2, Loss: 0.5432
Epoch 3, Loss: 0.4218

Handling Large Datasets with DeepSpeed

Large datasets pose problems that go well beyond model architecture. How you manage memory and computational resources efficiently while processing big volumes of data will save you from bottlenecks. DeepSpeed tackles these very challenges through its advanced features in the domain of data handling.

1. Dynamic Data Loading

DeepSpeed performs dynamic loading of the data, thereby loading into memory only the batches being used at one time during training. This cuts down on the memory footprint, hence allowing you to train on more substantial datasets without necessarily needing more powerful hardware. Besides that, you will keep the memory usage minimal; hence, you minimize the time taken by input/output operations of data, which enhances the overall speed of training.

2. Data Parallelism

Another important capability enabled by DeepSpeed is that of data parallelism. It supports natively distributed data across many GPUs. Because of that, different batches can be processed at once. This parallel will speed up the training process. It can occupy GPU resources efficiently. Therefore, in practice, applying data parallelism using DeepSpeed to your training pipeline is not painful because it's integrated into PyTorch's DataLoader.

3. Memory-Efficient Data Shuffling

Large datasets normally require shuffling to avoid overfitting and learning by pattern based on how data has been ordered. However, this is extremely memory-consuming for large datasets. DeepSpeed optimizes this process using very memory-efficient algorithms able to provide effective shuffling without a huge memory increase. This makes sure that on large datasets, training will be smooth and efficient.

4. Data Augmentation Support

Data augmentation in general includes certain methods that increase the size of a dataset artificially by modifying existing data. DeepSpeed supports on-the-fly data augmentation, which means one doesn't have to store augmented data in memory but can perform data augmentation on the fly during training. This can reduce the memory pressure even further and also provide much more extensive utilization of data augmentation techniques.

5. Batch Size Scaling

With DeepSpeed gradient accumulation and ZeRO optimization, that allows scaling up of batch sizes even when working with enormous datasets. Larger batch sizes can sometimes improve model convergence and training stability. DeepSpeed is enabled, which allows scaling of batch size with management of the GPU memory requirement; hence, your model should be able to train on big datasets effectively.

The above DeepSpeed features help in that aspect by being able to manage a large dataset, thus making it possible for you to design and train high-performance models with no hardware restrictions. Whether you're training your model on a very big corpus of text or processing images in super-high resolution, this feature in handling data by DeepSpeed keeps your training pipeline optimized and scalable.

Summing Up

DeepSpeed allows having an effective training framework for deep learning models, especially in scaling size and complexity. Therefore, learning advanced features of how to use mixed precision training, ZeRO optimization, and activation checkpointing are ways in which added value optimizes the process. This chapter has information about model training using DeepSpeed preparing the environment for DeepSpeed, the configuration of DeepSpeed, and running the training processes. With this tool and technique in hand, now you can handle large-scale deep-learning projects with better performance and low consumption of resources.

Advertisements