Serving Large models - VLLM, LLAMA CPP Server, and SGLang

Both Large Language Models (LLMs) and Vision-Language Models (VLMs) have exploded in popularity over the last two years. Powered by recent advancements in GPU tech, these models have been pre-trained on trillions of tokens and allow developers to easily leverage state-of-the-art AI, either by fine-tuning them or just using them outright.

But how would one go about hosting these models? In this article, we'll compare 3 of the most popular solutions: vLLM, llama.cpp, and SGLang.

vLLM

Released in June 2023 by researchers from UC Berkeley, vLLM is a high-performance model LLM backend based on a technique called PagedAttention. PagedAttention optimizes memory management which allows vLLM to efficiently inference at harge scale. Since its release, vLLM has gained traction as both a library and a server solution, supporting a variety of LLM architectures.

vLLM includes an OpenAI-compatible API server, which allows developers to easily switch from proprietary LLM services without changing their code. It also offers dynamic LoRA (Low-Rank Adaptation) loading at very high speeds, which is a crucial feature when you need to load and unload lightweight model adaptations on the fly. Additionally, vLLM supports multi-GPU and multi-node serving, making it highly scalable, both vertically and horizontally.

vLLM also maintains a Docker image for simple deployment on popular cloud services. However, it's important to note the Docker image requires a very recent version of CUDA (12.4) to run, which can be a hurdle for some older environments. vLLM was the fastest method for serving open-source LLMs when it was originally released.

llama.cpp

llama.cpp is a low-level C/C++ implementation originally designed for LLaMa-based models, but later expanded to support a variety of other LLM architectures.

While not as fast as vLLM, llama.cpp supports inference on both GPU and CPU nodes , and even Metal on MacOS, making it the most flexible choice. vLLM on the other hand can only run on CUDA nodes.

llama.cpp also provides bindings for popular programming languages such as Python, Go, and Node.js to be used as a library, and includes a Docker image for easy deployment.

Another advantage of llama.cpp is its support for variable-bit quantization, which can allow for quantization at 3-bit or 5-bit to squeeze the max amount of power on restricted hardware.

SGLang

SGLang is the newest player in the world of LLM serving solutions. Developed by LMSYS, the creators of the popular Chatbot Arena platform, SGLang was built to address heavy traffic situations. According to a blog post by LMSYS, SGLang is currently the fastest LLM backend available.

SGLang comes with an OpenAI-compatible API, making it easy to integrate with existing software. It also supports multi-GPU and multi-node setups. Like vLLM, it offers a pre-built Docker image for easy deployment.

However, SGLang has some limitations. It currently only supports a limited number of LLM architectures. Also, SGLang does not support dynamic LoRA loading, which means it's less flexible if you need to switch between multiple lightweight model adaptations quickly.

Which One Should You Choose?

Choosing the right solution for serving large models depends on your specific needs:

If you plan to host models locally, especially if you're working with both GPUs and CPUs or need flexibility in programming language support, llama.cpp is your best option. It offers versatility with slightly slower performance compared to its competitors but is well-suited for multi-platform environments.
For production environments where speed is critical, SGLang is the optimal choice (if your model architecture is supported, that is). It's the fastest solution available, making it perfect for high-demand applications.
If you have a scenario where dynamic LoRA loading is crucial, or your model architecture is not supported by SGLang, vLLM remains a top choice. Its balance of flexibility and speed makes it still an excellent option.

Although we have covered the three most popular LLM backends at the time of writing this article, it's important to keep your finger on the pulse since there seems to be a new solution coming out every 8 months. Fortunately, switching between these solutions is not a hard task since most of them are interoperable via an OpenAI-compatible API.

Komninos Chatzipapas

Expert in AI & Tech

Updated on: 2024-10-24T15:26:04+05:30

730 Views

Kickstart Your Career

Get certified by completing the course

Get Started