Using Custom Inference Backends
This guide shows how to add a custom inference backend that is not built into GPUStack, using TensorRT-LLM as an example. Configuration examples for common inference backends are provided at the end of the article. For a description of each parameter, see the User Guide.
Core Steps
- Prepare the Docker image for the required inference backend.
- Understand the image's ENTRYPOINT or CMD to determine the inference backend startup command.
- Add configuration on the Inference Backend page.
- Deploy models on the Deployments page and select the newly added backend.
Example
The following uses TensorRT-LLM as an example to illustrate how to add and use an inference backend.
These examples are functional demonstrations, not performance-optimized configurations. For better performance, consult each backend’s official documentation for tuning.
- Find the required image from the release page linked from the TensorRT-LLM documentation.
- TensorRT-LLM images must launch the inference service using
trtllm-serve; otherwise, they start an interactive shell session. Therun_commandsupports placeholders such as{{model_path}}and{{port}}(and optionally{{model_name}},{{worker_ip}}), which are automatically replaced with the actual values when the deployment is scheduled to a worker. - Add configuration on the Inference Backend page; YAML import is supported. Example:
backend_name: TensorRT-LLM-custom default_version: 1.2.0rc0 version_configs: 1.2.0rc0: image_name: nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc0 run_command: 'trtllm-serve {{model_path}} --host 0.0.0.0 --port {{port}}' custom_framework: cuda
Note
Some inference backends are labeled as Built-in (e.g., vLLM, MindIE) on the Inference Backend page. These are GPUStack's built-in inference backends. When using built-in backends, appropriate container images matching the worker environment are automatically obtained based on the runtime. You can add custom versions to these built-in inference backends and specify the image names you need.
Result
After the inference backend service starts, you can see the model_instance status becomes RUNNING.
You can engage in conversations in the Playground.

Typical Examples
Deploy GGUF Models with llama.cpp
- Find the image name in the documentation:
ghcr.io/ggml-org/llama.cpp:server(ensure you select the variant that matches your worker platform). - Add the following backend configuration on the Inference Backend page:
backend_name: llama.cpp-custom default_run_command: '-m {{model_path}} --host 0.0.0.0 --port {{port}}' version_configs: v1-cuda: image_name: ghcr.io/ggml-org/llama.cpp:server-cuda custom_framework: cuda v1-cpu: image_name: ghcr.io/ggml-org/llama.cpp:server custom_framework: cpu default_version: v1-custom - On the Deployment page, locate a GGUF-format model, select
llama.cpp, and deploy.
For more information, refer to the llama.cpp GitHub repository.
Use Kokoro-FastAPI
- Find the image name in the documentation, and choose the variant that matches your worker platform:
-
ghcr.io/remsky/kokoro-fastapi-cpu:latest-ghcr.io/remsky/kokoro-fastapi-gpu:latest
Warning
This image includes a built-in model, so the model you select on the Deployments page may be ignored. To avoid unexpected errors, choose a model consistent with the one bundled in the image. The kokoro-fastapi image uses the Kokoro-82M model.
- Add the following backend configuration on the Inference Backend page:
backend_name: kokoro-custom version_configs: v1: image_name: ghcr.io/remsky/kokoro-fastapi-gpu:latest custom_framework: cuda default_run_command: python -m uvicorn api.src.main:app --host 0.0.0.0 --port {{port}} --log-level debug - On the Deployments page, select the Kokoro-82M model, choose
kokoroas the backend, and setNameto one of the supported keys (e.g.,kokoro).
Known Limitations for Name
In kokoro-fastapi, the model_name is restricted to the keys below; other values will result in an "unsupported" error.
"models": {
"tts-1": "kokoro-v1_0",
"tts-1-hd": "kokoro-v1_0",
"kokoro": "kokoro-v1_0"
}
Therefore, restrict the Name during deployment to one of these supported keys.
Screenshots:




