DeepSpeed - Inference Optimization



DeepSpeed is a dense framework that optimizes inference with large-scale deep learning models, supporting techniques such as quantization, kernel fusion, pipeline parallelism, and more. You can achieve faster model performance, lower latency, and scalable deployments whether on cloud servers, an edge device, or a serverless platform—with no disadvantage in intrinsic accuracy.

What is Inference Optimization?

Inference on models often proves to be the bottleneck in many applications, especially those using the large-scale deep learning model. The latency, hardware consumption, and deployment of ever more complex systems all surge with complexity- the higher the model complexity, the more latency and hardware consumed in deploying increasingly complex systems is higher than that. DeepSpeed solves such problems by offering advanced inference optimization features with promising faster model inferences, lower latency, and a greater throughput while maintaining accuracy on the model.

This chapter demonstrates how DeepSpeed optimizes model inference, the techniques it applies for latency reduction, and exactly how you can deploy models with optimized inference into real-world applications.

DeepSpeed and Efficient Inference of Models

DeepSpeed is a tool targeted explicitly at models with billions of parameters so that inference runs efficiently using merely a modest amount of hardware. However, it comes pre-optimized out-of-the-box for the speed-up of inference, including quantization and kernel fusion.

Example: Speeding Up Inference

Let's take an extremely simple example of how we might leverage DeepSpeed on Hugging Face model inference optimization.

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import deepspeed

# Load a pre-trained model and tokenizer from Hugging Face
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example input text
inputs = tokenizer("DeepSpeed makes model inference faster!", return_tensors="pt")

# Enable DeepSpeed inference optimization
model = deepspeed.init_inference(
    model,
    mp_size=1,
    dtype=torch.float16,
    replace_method="auto",
)

# Perform inference
with torch.no_grad():
    outputs = model(**inputs)

# Output prediction
print(outputs.logits)

Output

tensor([[ 1.0459, -1.0142]], device='cuda:0', dtype=torch.float16)

DeepSpeed optimizes memory usage and inference performance by 1.5x to 2x based on the hardware configuration.

Explanation

  • super speed. init_inference − This initializes the model for inference with some optimizations. mp_size indicates how many GPUs to use and dtype=torch.float16, it enables half-precision for faster computation.
  • replace_method − This is flagged to "auto" so that DeepSpeed will automatically apply other optimizations as well, like kernel fusion.

Inference Latency Reduction Techniques

The most critical concern for real-time applications is inference latency. The most important techniques provided by DeepSpeed are as follows:

1. Quantization

Quantization trains model weights at a lower precision than 32-bit floating point (FP32)—for instance, 16-bit floating point (FP16), or even 8-bit integers. This yields enormous amounts of savings both in computing and memory footprint with no loss of accuracy.

#Quantization in DeepSpeed
model = deepspeed.init_inference(model, mp_size=1, dtype=torch.int8, replace_method="auto")

Here, we would have used dtype=torch.int8 for 8-bit quantization, which thus saves a huge amount of model size and also time taken for the inference process.

2. Kernel Fusion

Kernel fusion is one other technique through which more than one operation is fused in a single kernel that minimizes the number of memory accesses. This optimization decreases the overhead resulting from kernel launches and memory bandwidth usage.

3. Pipeline Parallelism

Pipeline parallelism allows you to split huge models across multiple GPUs so that data flows through the model in parallel and results are returned quickly during inference. This is helpful when having very large models because the memory of one GPU probably would not be enough.

Following is an example of Pipeline Parallelism with DeepSpeed

# Model partitioning for pipeline parallelism
model = deepspeed.init_inference(model, mp_size=4, dtype=torch.float16, pipeline_parallel=True)

4. Tensor Slicing

Tensor slicing helps fit the model onto hardware with limited memory by slicing large tensors into chunks. Load is distributed across the GPUs, contributing to memory consumption reduction and enhancements in inference speeds.

DeepSpeed Inference Deployment Strategies

The model is optimized for inference and therefore comes with several strategies where it can be efficiently deployed. Here are some of the deployment strategies using DeepSpeed:

1. Serverless Inference using DeepSpeed

Serverless architectures like AWS Lambda can be used for deploying inference services in scale. DeepSpeed can be used to optimize the model to fit in serverless function memory limits and time constraints.

Following is an example of deploying DeepSpeed with FastAPI −

from fastapi import FastAPI
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import deepspeed

app = FastAPI()
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model = deepspeed.init_inference(model, mp_size=1, dtype=torch.float16, replace_method="auto")
@app.post("/predict")
def predict(text: str):
    inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
        outputs = model(**inputs)
    return { " logits": outputs.logits.tolist() }

We create a RESTful API using FastAPI. The model makes predictions over a simple endpoint.

DeepSpeed optimizations maximize inference throughput for super batching and multi-API request handling.

2. Batching for High Throughput

This can make it better in terms of throughput by ensuring that the system can serve more requests at one time through batching of multiple inputs during inference. DeepSpeed effectively deals with that as it splits the batches across the GPUs, then there is a parallelism to hasten the process.

3. Edge Deployment with DeepSpeed

DeepSpeed's quantization and low-memory methods enable the deployment of large models in edge devices that have limited computational power for very low-latency inference in applications such as mobile devices and IoT devices.

Examples of Optimized Inference in Real World

Following are some examples of optimized inference in real world –

1. Microsoft Turing-NLG

Microsoft has released Turing-NLG using DeepSpeed to infer an optimized version of this largest model. Techniques such as model parallelism and quantization have allowed Microsoft to reduce the inference latency of this huge model by up to 4x.

2. Hugging Face Models

Ironically, most models from Hugging Face are already hosted in production by DeepSpeed. To give an example, one could reach a 2x speedup over inference on BERT and GPT-2 models by optimizing them with the features of DeepSpeed such as quantization and kernel fusion.

3. Nvidia Megatron-LM

Thus, the same model-diffused Megatron model, Megatron-LM, was optimized in both training and inference using DeepSpeed. It consequently leads to faster times at model serving, lower memory overhead, and enables its practicality to be deployed at large scale on cloud infrastructure.

Advertisements