Skip to content

GPUStack

demo

GPUStack is an open-source GPU cluster manager for running AI models.

Key Features

  • Broad Hardware Compatibility: Run with different brands of GPUs in Apple MacBooks, Windows PCs, and Linux servers.
  • Broad Model Support: From LLMs to diffusion models, audio, embedding, and reranker models.
  • Scales with Your GPU Inventory: Easily add more GPUs or nodes to scale up your operations.
  • Distributed Inference: Supports both single-node multi-GPU and multi-node inference and serving.
  • Multiple Inference Backends: Supports llama-box (llama.cpp & stable-diffusion.cpp), vox-box and vLLM as the inference backend.
  • Lightweight Python Package: Minimal dependencies and operational overhead.
  • OpenAI-compatible APIs: Serve APIs that are compatible with OpenAI standards.
  • User and API key management: Simplified management of users and API keys.
  • GPU metrics monitoring: Monitor GPU performance and utilization in real-time.
  • Token usage and rate metrics: Track token usage and manage rate limits effectively.

Supported Platforms

  • macOS
  • Windows
  • Linux

Supported Accelerators

  • Apple Metal (M-series chips)
  • NVIDIA CUDA (Compute Capability 6.0 and above)
  • Ascend CANN
  • Moore Threads MUSA

We plan to support the following accelerators in future releases.

  • AMD ROCm
  • Intel oneAPI
  • Qualcomm AI Engine

Supported Models

GPUStack uses llama-box (bundled llama.cpp and stable-diffusion.cpp server), vLLM and vox-box as the backends and supports a wide range of models. Models from the following sources are supported:

  1. Hugging Face

  2. ModelScope

  3. Ollama Library

  4. Local File Path

Example Models:

Category Models
Large Language Models(LLMs) Qwen, LLaMA, Mistral, Deepseek, Phi, Yi
Vision Language Models(VLMs) Llama3.2-Vision, Pixtral , Qwen2-VL, LLaVA, InternVL2
Diffusion Models Stable Diffusion, FLUX
Rerankers GTE, BCE, BGE, Jina
Audio Models Whisper (speech-to-text), CosyVoice (text-to-speech)

For full list of supported models, please refer to the supported models section in the inference backends documentation.

OpenAI-Compatible APIs

GPUStack serves OpenAI compatible APIs. For details, please refer to OpenAI Compatible APIs