
- DeepSpeed - Home
- DeepSpeed - Getting Started
- DeepSpeed - Model Training
- DeepSpeed - Optimizer
- DeepSpeed - Learning Rate Scheduler
- DeepSpeed - Distributed Training
- DeepSpeed - Memory Optimization
- DeepSpeed - Mixed Precision Training
- DeepSpeed - PyTorch & Transformers
- DeepSpeed - Inference Optimization
- DeepSpeed - Advanced Features
- DeepSpeed - Large Language Models
- DeepSpeed - Troubleshooting Common Issues
DeepSpeed Useful Resources
DeepSpeed - Advanced Features
DeepSpeed is a high-performing library in terms of deep learning optimization offered by Microsoft that has been leading innovation when it comes to training models of large-scale AI applications by techniques of scaling up and efficiency but with low resource usage.
While the users are already pretty familiar with the core functionalities, full utilization of the capability of DeepSpeed requires some in-depth knowledge of such advanced features as the custom operators, sophisticated options in configuration, and tools of profiling and debugging.
As such, it makes an even more in-depth exploration of these features, showing the reader how to tap into even more of what DeepSpeed can offer in their deep learning projects.
Custom Operators in DeepSpeed
DeepSpeed's custom operators allow the user to fine-tune specific parts of the model to pre-optimize them to conduct efficient computations with little overhead. They become necessary where the default implementation is weak for the task at hand. The power they impart on a developer has some room for finetuning the model parts to suit performance needs.
Example: Custom Operators
DeepSpeed gives the flexibility to easily register and embed custom CUDA and CPU operators. Below is how you can create a simple custom operator.
import torch # Custom Add Operator using PyTorch's autograd class AddOp(torch.autograd.Function): @staticmethod def forward(ctx, x, y): result = x + y ctx.save_for_backward(x, y) return result @staticmethod def backward(ctx, grad_output): x, y = ctx.saved_tensors return grad_output, grad_output # Wrapper class for the custom operator class CustomAddOp: def build(self): return AddOp.apply # Testing the custom operator x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True) y = torch.tensor([4.0, 5.0, 6.0], requires_grad=True) custom_add = CustomAddOp().build() output = custom_add(x, y) output.backward(torch.ones_like(x)) print(f"Output: {output}") print(f"x Gradient: {x.grad}") print(f"y Gradient: {y.grad}")
Output
tensor([5., 7., 9.], grad_fn=<AddOpBackward>) x Gradient: tensor([1., 1., 1.]) y Gradient: tensor([1., 1., 1.])
Custom operators are for developers who need to fine-tune and optimize a certain model layer, allowing full flexibility for large models and computationally intensive processes.
More Fine-Tuned Configuration Options
DeepSpeed features a very strong setup system in which users can control fine-grained how models are trained. Flexibility is accessible by specifying options with JSON configuration files.
Example Configuration File
Here is a minimal DeepSpeed config file with advanced options −
{ "train_batch_size": 32, "gradient_accumulation_steps": 2, "fp16": { "enabled": true }, "optimizer": { "type": "AdamW", "params": { "lr": 0.001, "betas": [0.9, 0.999], "eps": 1e-08, "weight_decay": 0.01 } }, "scheduler": { "type": "WarmupLR", "params": { "warmup_min_lr": 0.0, "warmup_max_lr": 0.001, "warmup_num_steps": 100 } }, "zero_optimization": { "stage": 2, "contiguous_gradients": true, "reduce_scatter": true, "allgather_partitions": true } }
Advanced Options
- fp16 − Mixed precision training greatly improves performance with negligible loss of accuracy.
- zero_optimization − It uses the Zero Redundancy Optimizer that reduces the memory required for large models by reducing the gradients and states of the optimizers.
- gradient_accumulation_steps − It enables handling much higher effective batch sizes through stratification of the different batches into smaller pieces and no longer requires fitting the entire batch in memory. Even training on hardware-constrained resources will be highly efficient.
Loading and Using the Configuration
To load and use the config in your training script, you do: "nnCanBeConverted"
import deepspeed import torch # Assuming a model and dataloader are already defined # For example: model = YourModelClass() and dataloader = DataLoader(...) # Load DeepSpeed configuration # ds_config.json should contain the configuration details for DeepSpeed model_engine, optimizer, _, _ = deepspeed.initialize( model=model, model_parameters=model.parameters(), config='ds_config.json' ) # Training loop for batch in dataloader: # Forward pass outputs = model_engine(batch) loss = outputs['loss'] if isinstance(outputs, dict) and 'loss' in outputs else outputs # Backward pass and optimization step model_engine.backward(loss) model_engine.step() # This setup allows large models to leverage DeepSpeed's advanced memory and computational optimizations.
This allows training DeepSpeed advanced functionality to be accelerated, but especially for large models.
Profiling and Debugging DeepSpeed Applications
Profiling and debugging big models is a tool used for the identification of bottlenecks and ensuring that your code runs reasonably efficiently. DeepSpeed gives out several such tools, including built-in logging and interoperability with mainstream profiling tools like NVIDIA Nsight Systems.
Using DeepSpeed Profiler
DeepSpeed provides hooks to add performance profiling when training a model. You can easily add these hooks to your training script.
import deepspeed import torch # Define DeepSpeed configuration with profiling enabled deepspeed_config = { "train_batch_size": 32, "steps_per_print": 10, "gradient_clipping": 1.0, "wall_clock_breakdown": True # Enables detailed timing breakdown for profiling } # Assuming `model` and `data_loader` are defined # Initialize DeepSpeed with profiling enabled model_engine, optimizer, _, _ = deepspeed.initialize( model=model, config_params=deepspeed_config, model_parameters=model.parameters() ) # Start training with profiling enabled for batch in data_loader: # Forward pass outputs = model_engine(batch) loss = outputs['loss'] if isinstance(outputs, dict) and 'loss' in outputs else outputs # Backward pass and optimization step model_engine.backward(loss) model_engine.step() # Profiling enabled by `wall_clock_breakdown` provides detailed insights into performance bottlenecks.
Debugging Techniques
There are also other utilities in DeepSpeed related to debugging purposes such as logging and real-time resource monitoring. Those could be used for the detection of memory leaks or communication overheads or anything that could lead to inefficiency in the updates of gradients.
To put the config in verbose mode, you can replace it with something like this −
To enable chatty logging, modify the configuration as follows −
{ "logging": { "level": "info", "steps_per_print": 50 } }
You can also debug by attaching an external debugger such as pdb or gdb at some parts of DeepSpeed to trace real-time errors.
Experimental and Cutting-Edge Features
DeepSpeed is a moving target, and quite some experimental features have already been added to recent releases. The benefits in terms of performance for edge cases are huge, though such features must be heavily tested.
3D Parallelism
DeepSpeed 3D Parallelism scales up model parallelism over tensor, pipeline, and data dimensions for unprecedented scalability for models with billions of parameters.
Here's an example configuration for 3D parallelism −
{ "train_batch_size": 64, "tensor_parallel" : { "tp_size": 8 }, "pipeline_parallel": "p_size": 4, "activation_checkpointing": true }
With this configuration, the model is split into tensor parallel groups of size 8, and the pipeline is divided into 4 stages, thereby ensuring efficient memory usage while training massive models.
Activation Checkpointing
This method reduces memory at training because only the latest activations are saved and they are recomputed in the backward pass.
import deepspeed from deepspeed.runtime.activation_checkpointing import checkpointing # To activate activation checkpointing checkpointing.configure(None, deepspeed_config="ds_config.json")
Activation checkpointing is rather important while training very deep models or constrained to the limited amount of memory on the GPU.