DeepSpeed - PyTorch & Transformers



With integration into PyTorch and Hugging Face Transformers, DeepSpeed provides both highly efficient training and inference for large models. It supports basic configuration to memory-oriented optimization techniques for scaling machine learning models. You will learn to make those optimized PyTorch codebases and existing Hugging Face models adapt to DeepSpeed with speed improvement and memory usage reduction during model training, following the guidelines of this chapter.

Now let's dig into these step by step with code examples, outputs, and screenshots to help you get DeepSpeed integrated into machine learning workflows as seamlessly as possible.

DeepSpeed with PyTorch Models

DeepSpeed improves PyTorch models by reducing memory consumption and improving computational efficiency. Here is an example of getting DeepSpeed integrated into a PyTorch-based model training script; this process involves setting up DeepSpeed configuration files and modifications in the training loop:

Example: DeepSpeed with PyTorch

Following is a complete code example of implementation of DeepSpeed with PyTorch models using Python programming language −

import torch
import deepspeed

# Define a simple PyTorch model
class SimpleModel(torch.nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc = torch.nn.Linear(512, 10)

    def forward(self, x):
        return self.fc(x)

# Initialize model and optimizer
model = SimpleModel()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Create DeepSpeed configuration file
ds_config = {
    "train_batch_size": 32,
    "gradient_accumulation_steps": 2,
    "fp16": {
        "enabled": True
    }
}

# Initialize DeepSpeed
model, optimizer, _, _ = deepspeed.initialize(
    model=model,
    optimizer=optimizer,
    config=ds_config
)

# Example training loop
data = torch.randn(32, 512)
target = torch.randint(0, 10, (32,))

for epoch in range(10):
    model.train()
outputs = model(data)
    loss = torch.nn.functional.cross_entropy(outputs, target)

    model.backward(loss)
    model.step()
    print(f"Epoch {epoch+1}, Loss: {loss.item()}")

Output

Epoch 1, Loss: 2.302
Epoch 2, Loss: 2.176
.
Epoch 10, Loss: 1.862

We have so far defined a very simple PyTorch model, created a DeepSpeed configuration file, and initialized the model using deepspeed.initialize(). The training loop in this example is slightly modified to use model.backward() and model.step() instead of calling the PyTorch optimizer directly.

Integration with Hugging Face Transformers

Hugging Face's transformer library brings state-of-the-art models such as BERT, GPT, and T5 - often requiring heavy resources in computers. With DeepSpeed on board, we can optimize the training as well as inferencing of these large transformer models. Now, let's see how to use DeepSpeed with a Hugging Face transformer model.

Example: DeepSpeed with Hugging Face Transformers

Following is a complete example of implementation of DeepSpeed with Hugging Face Transformers −

from transformers import BertForSequenceClassification, Trainer, TrainingArguments
import deepspeed

# To classify sequences, load a BERT model that has already been trained.
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    evaluation_strategy="steps",
    save_steps=10,
    logging_dir='./logs', deepspeed="./ds_config.json",  # Provide DeepSpeed configuration file
)

# DeepSpeed configuration file (ds_config.json)
ds_config = {
    "fp16": {
        "enabled": True
    },
"optimizer": {
      "type": "AdamW",
      "params": {
          "lr": 5e-5,
          "betas": [0.9, 0.999],
          "eps": 1e-8,
"weight_decay": 0.01
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
"warmup_max_lr": 0.01,
            "warmup_num_steps": 100
        }
    }

# Save DeepSpeed config file
import json
with open("./ds_config.json", "w") as f:
    json.dump(ds_config, f)

# Initialize the Hugging Face Trainer
trainer = Trainer(
    model=model,
    args=training_args,
train_dataset=train_dataset,
    eval_dataset=eval_dataset
)

# Start training
trainer.train()

Output

The training output will print the training and evaluation progress, DeepSpeed now has the assistance of mixed precision and optimizer state partitioning, which accelerates efficiency.

{'loss': 0.67, 'learning_rate': 5e-5, 'epoch': 1.0, 'step': 100}
{'loss': 0.57, 'learning_rate': 4e-5, 'epoch': 2.0, 'step': 200}

We have done all of the above here, loading a pre-trained BERT model, defining a DeepSpeed configuration, and using the Trainer from Hugging Face for the training of the model with DeepSpeed enabled.

Porting PyTorch Codebases to DeepSpeed

For legacy PyTorch codebases, integrating DeepSpeed requires only minor changes. For most codebases, you need to add DeepSpeed only by initializing the model with deepspeed.initialize() and ensuring your training loop adheres to the DeepSpeed API. Step-by-Step Guide to Integrating Legacy PyTorch Codebases with DeepSpeed

Step 1: Install DeepSpeed

The following command may be used to install DeepSpeed.

pip install deepspeed

Step 2: Update Model Initialization

Replace the standard model and optimizer initialization with DeepSpeed's initialization.

model, optimizer, _, _ = deepspeed.initialize(\\
    model=model,
    optimizer=optimizer,
    config=ds_config
)

Step 3: Modify the Training Loop

Replace the calls to loss.backward() with model.backward(loss) and optimizer.step() with model.step()

Example: Integration with an Existing Codebase

Look at the following example code −

for epoch in range(num_epochs):\\
    for batch in train_loader:\\
        optimizer.zero_grad()
        outputs = model(batch['input'])
loss = criterion(outputs, batch['target'])
loss.backward()
optimizer.step()

# DeepSpeed Training Code
for epoch in range(num_epochs):
    for batch in train_loader:
        outputs = model(batch['input'])
        loss = criterion(outputs, batch['target'])
model.backward(loss)
        model.step()

By replacing backward and step operations with DeepSpeed's versions, you benefit from the optimizations DeepSpeed provides without changing much of the existing logic.

Advanced Integration Tips and Tricks

To make the most of DeepSpeed's capabilities, here are some advanced tips and tricks:

Memory optimization − ZeRO of DeepSpeed enables extreme in-device memory optimization via partitioning of model states. Saving a huge amount of memory is possible with the use of ZeRO stages 1, 2, or 3.

{
    "zero_optimization": {
        "stage": 2
    }
}

Mixed Precision Training: mixed precision training is very easy to enable through a switch in the DeepSpeed configuration file. This reduces memory usage and makes it run very fast on modern GPUs.

{
    "fp16": {
        "enabled": true
    }
}

Gradient Accumulation − in the case of limited GPU memory, DeepSpeed has an option for gradient accumulation over several batches before the update.

(
   "gradient_accumulation_steps": 4
}

In addition to the above, many other state-of-the-art features can be utilized to even optimize training large models under resource-constrained environments.

Advertisements