Skip to content

GPUStack

demo

GPUStack is an open-source GPU cluster manager for running AI models.

Key Features

  • Broad Hardware Compatibility: Run with different brands of GPUs in Apple MacBooks, Windows PCs, and Linux servers.
  • Broad Model Support: From LLMs to diffusion models, audio, embedding, and reranker models.
  • Scales with Your GPU Inventory: Easily add more GPUs or nodes to scale up your operations.
  • Distributed Inference: Supports both single-node multi-GPU and multi-node inference and serving.
  • Multiple Inference Backends: Supports llama-box (llama.cpp & stable-diffusion.cpp), vox-box and vLLM as the inference backend.
  • Lightweight Python Package: Minimal dependencies and operational overhead.
  • OpenAI-compatible APIs: Serve APIs that are compatible with OpenAI standards.
  • User and API key management: Simplified management of users and API keys.
  • GPU metrics monitoring: Monitor GPU performance and utilization in real-time.
  • Token usage and rate metrics: Track token usage and manage rate limits effectively.

Supported Platforms

  • macOS
  • Windows
  • Linux

The following operating systems are verified to work with GPUStack:

OS Versions
Windows 10, 11
Ubuntu >= 20.04
Debian >= 11
RHEL >= 8
Rocky >= 8
Fedora >= 36
OpenSUSE >= 15.3 (leap)
OpenEuler >= 22.03

Note

The installation of GPUStack worker on a Linux system requires that the GLIBC version be 2.29 or higher.

Supported Architectures

GPUStack supports both AMD64 and ARM64 architectures, with the following notes:

  • On Linux and macOS, if using Python versions below 3.12, ensure you install the Python distribution matching your architecture.
  • On Windows, please use the AMD64 distribution of Python, as wheel packages for certain dependencies are unavailable for ARM64. If you use tools like conda, this will be handled automatically, as conda installs the AMD64 distribution by default.

Supported Accelerators

  • Apple Metal (M-series chips)
  • NVIDIA CUDA (Compute Capability 6.0 and above)
  • Ascend CANN
  • Moore Threads MUSA

We plan to support the following accelerators in future releases.

  • AMD ROCm
  • Intel oneAPI
  • Qualcomm AI Engine

Supported Models

GPUStack uses llama-box (bundled llama.cpp and stable-diffusion.cpp server), vLLM and vox-box as the backends and supports a wide range of models. Models from the following sources are supported:

  1. Hugging Face

  2. ModelScope

  3. Ollama Library

  4. Local File Path

Example Models:

Category Models
Large Language Models(LLMs) Qwen, LLaMA, Mistral, Deepseek, Phi, Yi
Vision Language Models(VLMs) Llama3.2-Vision, Pixtral , Qwen2-VL, LLaVA, InternVL2
Diffusion Models Stable Diffusion, FLUX
Rerankers GTE, BCE, BGE, Jina
Audio Models Whisper (speech-to-text), CosyVoice (text-to-speech)

For full list of supported models, please refer to the supported models section in the inference backends documentation.

OpenAI-Compatible APIs

GPUStack serves OpenAI compatible APIs. For details, please refer to OpenAI Compatible APIs