Using Custom Inference Backends

This guide shows how to add a custom inference backend that is not built into GPUStack, using TensorRT-LLM as an example. Configuration examples for common inference backends are provided at the end of the article. For a description of each parameter, see the User Guide.

Core Steps

Prepare the Docker image for the required inference backend.
Understand the image's ENTRYPOINT or CMD to determine the inference backend startup command.
Add configuration on the Inference Backend page.
Deploy models on the Deployments page and select the newly added backend.

Example

The following uses TensorRT-LLM as an example to illustrate how to add and use an inference backend.

These examples are functional demonstrations, not performance-optimized configurations. For better performance, consult each backend’s official documentation for tuning.

Find the required image from the release page linked from the TensorRT-LLM documentation.
TensorRT-LLM images must launch the inference service using trtllm-serve; otherwise, they start an interactive shell session. The run_command supports placeholders such as {{model_path}} and {{port}} (and optionally {{model_name}}, {{worker_ip}}), which are automatically replaced with the actual values when the deployment is scheduled to a worker.

Add configuration on the Inference Backend page; YAML import is supported. Example:

backend_name: TensorRT-LLM-custom
default_version: 1.2.0rc0
version_configs:
  1.2.0rc0:
    image_name: nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc0
    run_command: 'trtllm-serve {{model_path}} --host 0.0.0.0 --port {{port}}'
    custom_framework: cuda

Note

Some inference backends are labeled as Built-in (e.g., vLLM, MindIE) on the Inference Backend page. These are GPUStack's built-in inference backends. When using built-in backends, appropriate container images matching the worker environment are automatically obtained based on the runtime. You can add custom versions to these built-in inference backends and specify the image names you need.

On the Deployments page, select the newly added backend and deploy the model.

Result

After the inference backend service starts, you can see the model_instance status becomes RUNNING. You can engage in conversations in the Playground.

Typical Examples

Deploy GGUF Models with llama.cpp

Find the image name in the documentation: ghcr.io/ggml-org/llama.cpp:server (ensure you select the variant that matches your worker platform).

Add the following backend configuration on the Inference Backend page:

backend_name: llama.cpp-custom
default_run_command: '-m {{model_path}} --host 0.0.0.0 --port {{port}}'
version_configs:
  v1-cuda:
    image_name: ghcr.io/ggml-org/llama.cpp:server-cuda
    custom_framework: cuda
  v1-cpu:
    image_name: ghcr.io/ggml-org/llama.cpp:server
    custom_framework: cpu
default_version: v1-custom

On the Deployment page, locate a GGUF-format model, select llama.cpp, and deploy.

For more information, refer to the llama.cpp GitHub repository.

Screenshots:

Use Kokoro-FastAPI

Find the image name in the documentation, and choose the variant that matches your worker platform: - ghcr.io/remsky/kokoro-fastapi-cpu:latest - ghcr.io/remsky/kokoro-fastapi-gpu:latest

Warning

This image includes a built-in model, so the model you select on the Deployments page may be ignored. To avoid unexpected errors, choose a model consistent with the one bundled in the image. The kokoro-fastapi image uses the Kokoro-82M model.

Add the following backend configuration on the Inference Backend page:

backend_name: kokoro-custom
version_configs:
  v1:
    image_name: ghcr.io/remsky/kokoro-fastapi-gpu:latest
    custom_framework: cuda
default_run_command: python -m uvicorn api.src.main:app --host 0.0.0.0 --port {{port}} --log-level debug

On the Deployments page, select the Kokoro-82M model, choose kokoro as the backend, and set Name to one of the supported keys (e.g., kokoro).

Known Limitations for Name

In kokoro-fastapi, the model_name is restricted to the keys below; other values will result in an "unsupported" error.

"models": {
    "tts-1": "kokoro-v1_0",
    "tts-1-hd": "kokoro-v1_0",
    "kokoro": "kokoro-v1_0"
}

Therefore, restrict the Name during deployment to one of these supported keys.

Screenshots: