GPUStack

GPUStack is an open-source GPU cluster manager for running AI models.

Key Features

Broad GPU Compatibility: Seamlessly supports GPUs from various vendors across Apple Macs, Windows PCs, and Linux servers.
Extensive Model Support: Supports a wide range of models including LLMs, VLMs, image models, audio models, embedding models, and rerank models.
Flexible Inference Backends: Flexibly integrates with multiple inference backends including llama-box (llama.cpp & stable-diffusion.cpp), vox-box, vLLM and Ascend MindIE.
Multi-Version Backend Support: Run multiple versions of inference backends concurrently to meet the diverse runtime requirements of different models.
Distributed Inference: Supports single-node and multi-node multi-GPU inference, including heterogeneous GPUs across vendors and runtime environments.
Scalable GPU Architecture: Easily scale up by adding more GPUs or nodes to your infrastructure.
Robust Model Stability: Ensures high availability with automatic failure recovery, multi-instance redundancy, and load balancing for inference requests.
Intelligent Deployment Evaluation: Automatically assess model resource requirements, backend and architecture compatibility, OS compatibility, and other deployment-related factors.
Automated Scheduling: Dynamically allocate models based on available resources.
Lightweight Python Package: Minimal dependencies and low operational overhead.
OpenAI-Compatible APIs: Fully compatible with OpenAI’s API specifications for seamless integration.
User & API Key Management: Simplified management of users and API keys.
Real-Time GPU Monitoring: Track GPU performance and utilization in real time.
Token and Rate Metrics: Monitor token usage and API request rates.

Supported Platforms

macOS
Windows
Linux

Supported Accelerators

We plan to support the following accelerators in future releases.

Intel oneAPI
Qualcomm AI Engine

Supported Models

GPUStack uses llama-box (bundled llama.cpp and stable-diffusion.cpp server), vLLM, Ascend MindIE and vox-box as the backends and supports a wide range of models. Models from the following sources are supported:

Hugging Face
ModelScope
Local File Path

Example Models:

Category	Models
Large Language Models(LLMs)	Qwen, LLaMA, Mistral, DeepSeek, Phi, Gemma
Vision Language Models(VLMs)	Llama3.2-Vision, Pixtral , Qwen2.5-VL, LLaVA, InternVL2.5
Diffusion Models	Stable Diffusion, FLUX
Embedding Models	BGE, BCE, Jina
Reranker Models	BGE, BCE, Jina
Audio Models	Whisper (Speech-to-Text), CosyVoice (Text-to-Speech)

For full list of supported models, please refer to the supported models section in the inference backends documentation.

OpenAI-Compatible APIs

GPUStack serves OpenAI compatible APIs. For details, please refer to OpenAI Compatible APIs