Skip to content

GPUStack

demo

GPUStack is an open-source GPU cluster manager for running AI models.

Key Features

  • Broad GPU Compatibility: Seamlessly supports GPUs from various vendors across Apple Macs, Windows PCs, and Linux servers.
  • Extensive Model Support: Supports a wide range of models including LLMs, VLMs, image models, audio models, embedding models, and rerank models.
  • Flexible Inference Backends: Integrates with llama-box (llama.cpp & stable-diffusion.cpp), vox-box, vLLM, and Ascend MindIE.
  • Multi-Version Backend Support: Run multiple versions of inference backends concurrently to meet the diverse runtime requirements of different models.
  • Distributed Inference: Supports single-node and multi-node multi-GPU inference, including heterogeneous GPUs across vendors and runtime environments.
  • Scalable GPU Architecture: Easily scale up by adding more GPUs or nodes to your infrastructure.
  • Robust Model Stability: Ensures high availability with automatic failure recovery, multi-instance redundancy, and load balancing for inference requests.
  • Intelligent Deployment Evaluation: Automatically assess model resource requirements, backend and architecture compatibility, OS compatibility, and other deployment-related factors.
  • Automated Scheduling: Dynamically allocate models based on available resources.
  • Lightweight Python Package: Minimal dependencies and low operational overhead.
  • OpenAI-Compatible APIs: Fully compatible with OpenAI’s API specifications for seamless integration.
  • User & API Key Management: Simplified management of users and API keys.
  • Real-Time GPU Monitoring: Track GPU performance and utilization in real time.
  • Token and Rate Metrics: Monitor token usage and API request rates.

Supported Platforms

  • macOS
  • Windows
  • Linux

Supported Accelerators

  • NVIDIA CUDA (Compute Capability 6.0 and above)
  • Apple Metal (M-series chips)
  • AMD ROCm
  • Ascend CANN
  • Hygon DTK
  • Moore Threads MUSA

We plan to support the following accelerators in future releases.

  • Intel oneAPI
  • Qualcomm AI Engine

Supported Models

GPUStack uses llama-box (bundled llama.cpp and stable-diffusion.cpp server), vLLM, Ascend MindIE and vox-box as the backends and supports a wide range of models. Models from the following sources are supported:

  1. Hugging Face

  2. ModelScope

  3. Local File Path

Example Models:

Category Models
Large Language Models(LLMs) Qwen, LLaMA, Mistral, DeepSeek, Phi, Gemma
Vision Language Models(VLMs) Llama3.2-Vision, Pixtral , Qwen2.5-VL, LLaVA, InternVL2.5
Diffusion Models Stable Diffusion, FLUX
Embedding Models BGE, BCE, Jina
Reranker Models BGE, BCE, Jina
Audio Models Whisper (Speech-to-Text), CosyVoice (Text-to-Speech)

For full list of supported models, please refer to the supported models section in the inference backends documentation.

OpenAI-Compatible APIs

GPUStack serves OpenAI compatible APIs. For details, please refer to OpenAI Compatible APIs