GPUStack

GPUStack is an open-source GPU cluster manager for running large language models(LLMs).

Supports a Wide Variety of Hardware: Run with different brands of GPUs in Apple MacBooks, Windows PCs, and Linux servers.
Scales with Your GPU Inventory: Easily add more GPUs or nodes to scale up your operations.
Distributed Inference: Supports both single-node multi-GPU and multi-node inference and serving.
Multiple Inference Backends: Supports llama-box (llama.cpp) and vLLM as the inference backend.
Lightweight Python Package: Minimal dependencies and operational overhead.
OpenAI-compatible APIs: Serve APIs that are compatible with OpenAI standards.
User and API key management: Simplified management of users and API keys.
GPU metrics monitoring: Monitor GPU performance and utilization in real-time.
Token usage and rate metrics: Track token usage and manage rate limits effectively.

Supported Platforms

The following Linux distributions are verified to work with GPUStack:

Note

The installation of GPUStack worker on a Linux system requires that the GLIBC version be 2.29 or higher.

GPUStack supports both AMD64 and ARM64 architectures, with the following notes:

On MacOS and Linux, if using Python versions below 3.12, ensure you install the Python distribution matching your architecture.
On Windows, please use the AMD64 distribution of Python, as wheel packages for certain dependencies are unavailable for ARM64. If you use tools like conda, this will be handled automatically, as conda installs the AMD64 distribution by default.

We plan to support the following accelerators in future releases.

GPUStack uses llama.cpp and vLLM as the backends and supports a wide range of models. Models from the following sources are supported:

Example language models:

Example multimodal models:

For full list of supported models, please refer to the supported models section in the inference backends documentation.

GPUStack serves OpenAI compatible APIs. For details, please refer to OpenAI Compatible APIs