GPUStack
GPUStack is an open-source GPU cluster manager for running large language models(LLMs).
Key Features
- Supports a Wide Variety of Hardware: Run with different brands of GPUs in Apple MacBooks, Windows PCs, and Linux servers.
- Scales with Your GPU Inventory: Easily add more GPUs or nodes to scale up your operations.
- Distributed Inference: Supports both single-node multi-GPU and multi-node inference and serving.
- Lightweight Python Package: Minimal dependencies and operational overhead.
- OpenAI-compatible APIs: Serve APIs that are compatible with OpenAI standards.
- User and API key management: Simplified management of users and API keys.
- GPU metrics monitoring: Monitor GPU performance and utilization in real-time.
- Token usage and rate metrics: Track token usage and manage rate limits effectively.
Supported Platforms
- MacOS
- Linux
- Windows
Supported Accelerators
- Apple Metal
- NVIDIA CUDA(Compute Capability 6.0 and above)
We plan to support the following accelerators in future releases.
- AMD ROCm
- Intel oneAPI
- MTHREADS MUSA
- Qualcomm AI Engine
Supported Models
GPUStack uses llama.cpp as the backend and supports large language models in GGUF format. Models from the following sources are supported:
Here are some example models:
- LLaMA
- Mistral 7B
- Mixtral MoE
- DBRX
- Falcon
- Baichuan
- Aquila
- Yi
- StableLM
- Deepseek
- Qwen
- Phi
- Gemma
- Mamba
- Grok-1
OpenAI-Compatible APIs
GPUStack serves OpenAI compatible APIs. For details, please refer to OpenAI Compatible APIs