GPUStack
GPUStack is an open-source GPU cluster manager for running AI models.
Key Features
- Broad GPU Compatibility: Seamlessly supports GPUs from various vendors across Apple Macs, Windows PCs, and Linux servers.
- Extensive Model Support: Supports a wide range of models including LLMs, VLMs, image models, audio models, embedding models, and rerank models.
- Flexible Inference Backends: Integrates with llama-box (llama.cpp & stable-diffusion.cpp), vox-box, vLLM, and Ascend MindIE.
- Multi-Version Backend Support: Run multiple versions of inference backends concurrently to meet the diverse runtime requirements of different models.
- Distributed Inference: Supports single-node and multi-node multi-GPU inference, including heterogeneous GPUs across vendors and runtime environments.
- Scalable GPU Architecture: Easily scale up by adding more GPUs or nodes to your infrastructure.
- Robust Model Stability: Ensures high availability with automatic failure recovery, multi-instance redundancy, and load balancing for inference requests.
- Intelligent Deployment Evaluation: Automatically assess model resource requirements, backend and architecture compatibility, OS compatibility, and other deployment-related factors.
- Automated Scheduling: Dynamically allocate models based on available resources.
- Lightweight Python Package: Minimal dependencies and low operational overhead.
- OpenAI-Compatible APIs: Fully compatible with OpenAI’s API specifications for seamless integration.
- User & API Key Management: Simplified management of users and API keys.
- Real-Time GPU Monitoring: Track GPU performance and utilization in real time.
- Token and Rate Metrics: Monitor token usage and API request rates.
Supported Platforms
- macOS
- Windows
- Linux
Supported Accelerators
- NVIDIA CUDA (Compute Capability 6.0 and above)
- Apple Metal (M-series chips)
- AMD ROCm
- Ascend CANN
- Hygon DTK
- Moore Threads MUSA
We plan to support the following accelerators in future releases.
- Intel oneAPI
- Qualcomm AI Engine
Supported Models
GPUStack uses llama-box (bundled llama.cpp and stable-diffusion.cpp server), vLLM, Ascend MindIE and vox-box as the backends and supports a wide range of models. Models from the following sources are supported:
-
Local File Path
Example Models:
Category | Models |
---|---|
Large Language Models(LLMs) | Qwen, LLaMA, Mistral, DeepSeek, Phi, Gemma |
Vision Language Models(VLMs) | Llama3.2-Vision, Pixtral , Qwen2.5-VL, LLaVA, InternVL2.5 |
Diffusion Models | Stable Diffusion, FLUX |
Embedding Models | BGE, BCE, Jina |
Reranker Models | BGE, BCE, Jina |
Audio Models | Whisper (Speech-to-Text), CosyVoice (Text-to-Speech) |
For full list of supported models, please refer to the supported models section in the inference backends documentation.
OpenAI-Compatible APIs
GPUStack serves OpenAI compatible APIs. For details, please refer to OpenAI Compatible APIs