GPUStack

demo

GPUStack is an open-source GPU cluster manager for running large language models(LLMs).

Supports a Wide Variety of Hardware: Run with different brands of GPUs in Apple MacBooks, Windows PCs, and Linux servers.
Scales with Your GPU Inventory: Easily add more GPUs or nodes to scale up your operations.
Distributed Inference: Supports both single-node multi-GPU and multi-node inference and serving.
Lightweight Python Package: Minimal dependencies and operational overhead.
OpenAI-compatible APIs: Serve APIs that are compatible with OpenAI standards.
User and API key management: Simplified management of users and API keys.
GPU metrics monitoring: Monitor GPU performance and utilization in real-time.
Token usage and rate metrics: Track token usage and manage rate limits effectively.

Supported Platforms

We plan to support the following accelerators in future releases.

GPUStack uses llama.cpp as the backend and supports large language models in GGUF format. Models from the following sources are supported:

Here are some example models:

GPUStack serves OpenAI compatible APIs. For details, please refer to OpenAI Compatible APIs