Observability
This document describes how to monitor GPUStack Server/Worker/LLM serving runtime metrics using Prometheus and Grafana.
Overview
GPUStack provides a comprehensive set of metrics for model serving and GPU resource management. By integrating Prometheus and Grafana, users can collect, store, and visualize these metrics in real time, enabling efficient monitoring and troubleshooting.
Metrics Exposed by GPUStack
The following metrics are exposed by GPUStack and can be scraped by Prometheus. Each metric includes hierarchical labels for cluster, worker, model, and instance identification.
LLM Serving Runtime Metrics
| Metric Name | Type | Description |
|---|---|---|
| gpustack:num_requests_running | Gauge | Number of requests currently being processed. |
| gpustack:num_requests_waiting | Gauge | Number of requests waiting in the queue. |
| gpustack:num_requests_swapped | Gauge | Number of requests swapped out to CPU. |
| gpustack:prefix_cache_hit_rate | Gauge | Prefix cache hit rate. |
| gpustack:kv_cache_usage_ratio | Gauge | KV-cache usage ratio. 1.0 means fully used. |
| gpustack:prefix_cache_queries | Counter | Number of prefix cache queries (total tokens). |
| gpustack:prefix_cache_hits | Counter | Number of prefix cache hits (total tokens). |
| gpustack:prompt_tokens | Counter | Total number of prefill tokens processed. |
| gpustack:generation_tokens | Counter | Total number of generated tokens. |
| gpustack:request_prompt_tokens | Histogram | Number of prefill tokens processed per request. |
| gpustack:request_generation_tokens | Histogram | Number of generation tokens processed per request. |
| gpustack:time_to_first_token_seconds | Histogram | Time to generate first token. |
| gpustack:inter_token_latency_seconds | Histogram | Time to generate the next token after the previous token has been produced. |
| gpustack:time_per_output_token_seconds | Histogram | Time per generated token. |
| gpustack:e2e_request_latency_seconds | Histogram | End-to-end request latency. |
| gpustack:request_success | Counter | Total number of successful requests. |
These metrics are mapped from various runtime engines (vLLM, SGLang, MindIE) as defined in metrics_config.yaml.
Worker Metrics
| Metric Name | Type | Description |
|---|---|---|
| gpustack:worker_status | Gauge | Worker status (with state label). |
| gpustack:worker_node_os | Info | Operating system information of the worker node. |
| gpustack:worker_node_kernel | Info | Kernel information of the worker node. |
| gpustack:worker_node_uptime_seconds | Gauge | Uptime in seconds of the worker node. |
| gpustack:worker_node_cpu_cores | Gauge | Total CPU cores of the worker node. |
| gpustack:worker_node_cpu_utilization_rate | Gauge | CPU utilization rate of the worker node. |
| gpustack:worker_node_memory_total_bytes | Gauge | Total memory in bytes of the worker node. |
| gpustack:worker_node_memory_used_bytes | Gauge | Memory used in bytes of the worker node. |
| gpustack:worker_node_memory_utilization_rate | Gauge | Memory utilization rate of the worker node. |
| gpustack:worker_node_gpu | Info | GPU information of the worker node. |
| gpustack:worker_node_gpu_cores | Gauge | Total GPU cores of the worker node. |
| gpustack:worker_node_gpu_utilization_rate | Gauge | GPU utilization rate of the worker node. |
| gpustack:worker_node_gpu_temperature_celsius | Gauge | GPU temperature in Celsius. |
| gpustack:worker_node_gram_total_bytes | Gauge | Total GPU RAM in bytes. |
| gpustack:worker_node_gram_allocated_bytes | Gauge | Allocated GPU RAM in bytes. |
| gpustack:worker_node_gram_used_bytes | Gauge | Used GPU RAM in bytes. |
| gpustack:worker_node_gram_utilization_rate | Gauge | GPU RAM utilization rate. |
| gpustack:worker_node_filesystem_total_bytes | Gauge | Total filesystem size in bytes. |
| gpustack:worker_node_filesystem_used_bytes | Gauge | Used filesystem size in bytes. |
| gpustack:worker_node_filesystem_utilization_rate | Gauge | Filesystem utilization rate. |
Server Metrics
| Metric Name | Type | Description |
|---|---|---|
| gpustack:cluster | Info | Cluster information (ID, name, provider). |
| gpustack:cluster_status | Gauge | Cluster status (with state label). |
| gpustack:model | Info | Model information (ID, name, runtime, source). |
| gpustack:model_desired_instances | Gauge | Desired number of model instances. |
| gpustack:model_running_instances | Gauge | Number of running model instances. |
| gpustack:model_instance_status | Gauge | Status of each model instance (with state label). |
Note: All metrics are labeled with relevant identifiers (cluster, worker, model, instance, user) for fine-grained monitoring and filtering.
Deploy Observability Stack
The observability stack consists of two components:
- Prometheus: Scrapes metrics from GPUStack and stores them.
- Grafana: Visualizes metrics from Prometheus.
All components can be deployed together via Docker Compose for easy management.
Deploying alongside GPUStack Server
You can deploy GPUStack together with the observability stack using the provided docker-compose.yaml for a one-step setup. For details, refer to the Installation via Docker Compose.
Deploying Separately
If you started GPUStack using docker run (not Compose), you can deploy the observability components separately and connect them to your existing GPUStack server as follows:
Steps
-
Docker Compose Configuration
Compose file location: docker-compose.observability.yaml (GitHub)
services: prometheus: image: prom/prometheus container_name: gpustack-prometheus restart: unless-stopped network_mode: host command: - "--config.file=/etc/prometheus/prometheus.yml" - "--web.enable-remote-write-receiver" volumes: - ./prometheus:/etc/prometheus - prom_data:/prometheus grafana: image: grafana/grafana container_name: gpustack-grafana restart: unless-stopped network_mode: host environment: - GF_SERVER_HTTP_PORT=3000 - GF_SECURITY_ADMIN_USER=admin - GF_SECURITY_ADMIN_PASSWORD=grafana - GF_FEATURE_TOGGLES_ENABLE=flameGraph traceqlSearch traceQLStreaming correlations metricsSummary traceqlEditor traceToMetrics traceToProfiles volumes: - ./grafana/grafana_provisioning:/etc/grafana/provisioning:ro - ./grafana/grafana_dashboards:/etc/dashboards:ro volumes: prom_data: {} -
Edit the Prometheus configuration file
prometheus.yml.Configure Prometheus to scrape metrics from GPUStack by editing the
prometheus.yml, file location: prometheus.yaml (GitHub)scrape_configs: - job_name: gpustack-worker-discovery scrape_interval: 5s http_sd_configs: - url: "http://<gpustack_server_host>:10161/metrics/targets" refresh_interval: 1m - job_name: gpustack-server scrape_interval: 10s static_configs: - targets: ["<gpustack_server_host>:10161"]
Accessing Metrics
- GPUStack Metrics Endpoint:
Access metrics athttp://<gpustack_server_host>:10161/metrics - GPUStack Worker Metrics Targets:
Access metrics athttp://<gpustack_server_host>:10161/metrics/targets - Prometheus UI:
Access Prometheus athttp://<host>:9090 - Grafana UI:
Access Grafana athttp://<host>:3000(default user:admin, password:grafana)
Customizing Metrics Mapping
GPUStack supports dynamic customization of metrics mapping through its configuration API. This allows you to update how runtime engine metrics are mapped to GPUStack metrics without restarting the service. The configuration is managed centrally on the server and can be accessed or modified via HTTP API.
API Endpoints
-
Get Current Metrics Config
-
GET
http://<gpustack_server_host>:<gpustack_server_port>/v1/metrics/config -
Returns the current metrics mapping configuration in JSON format.
-
Update Metrics Config
-
POST
http://<gpustack_server_host>:<gpustack_server_port>/v1/metrics/config -
Accepts a JSON payload to update the metrics mapping configuration. Changes take effect immediately for all workers.
-
Get Default Metrics Config
- GET
http://<gpustack_server_host>:<gpustack_server_port>/v1/metrics/default-config - Returns the default metrics mapping configuration in JSON format, useful for reference or resetting.
Example Usage
Get current config:
curl http://<gpustack_server_host>:<gpustack_server_port>/v1/metrics/config
Update config:
curl -X POST http://<gpustack_server_host>:<gpustack_server_port>/v1/metrics/config \
-H "Content-Type: application/json" \
-d @custom_metrics_config.json
(where custom_metrics_config.json is your new config file)
Get default config:
curl -X POST http://<gpustack_server_host>:<gpustack_server_port>/v1/metrics/default-config
Note: The configuration should be provided in valid JSON format.