GPUStack

GPUStack is an open-source GPU cluster manager for running AI models.

Broad Hardware Compatibility: Run with different brands of GPUs in Apple MacBooks, Windows PCs, and Linux servers.
Broad Model Support: From LLMs to diffusion models, audio, embedding, and reranker models.
Scales with Your GPU Inventory: Easily add more GPUs or nodes to scale up your operations.
Distributed Inference: Supports both single-node multi-GPU and multi-node inference and serving.
Multiple Inference Backends: Supports llama-box (llama.cpp & stable-diffusion.cpp), vox-box and vLLM as the inference backend.
Lightweight Python Package: Minimal dependencies and operational overhead.
OpenAI-compatible APIs: Serve APIs that are compatible with OpenAI standards.
User and API key management: Simplified management of users and API keys.
GPU metrics monitoring: Monitor GPU performance and utilization in real-time.
Token usage and rate metrics: Track token usage and manage rate limits effectively.

Supported Platforms

We plan to support the following accelerators in future releases.

GPUStack uses llama-box (bundled llama.cpp and stable-diffusion.cpp server), vLLM and vox-box as the backends and supports a wide range of models. Models from the following sources are supported:

Category	Models
Large Language Models(LLMs)	Qwen, LLaMA, Mistral, Deepseek, Phi, Yi
Vision Language Models(VLMs)	Llama3.2-Vision, Pixtral , Qwen2-VL, LLaVA, InternVL2
Diffusion Models	Stable Diffusion, FLUX
Rerankers	GTE, BCE, BGE, Jina
Audio Models	Whisper (speech-to-text), CosyVoice (text-to-speech)

For full list of supported models, please refer to the supported models section in the inference backends documentation.

GPUStack serves OpenAI compatible APIs. For details, please refer to OpenAI Compatible APIs