Skip to content

FAQ

Support Matrix

Hybird Cluster Support

It supports a mix of Linux, Windows, and macOS nodes, as well as x86_64 and arm64 architectures. Additionally, It also supports various GPUs, including NVIDIA, Apple Metal, AMD, Ascend, Hygon and Moore Threads.

Distributed Inference Support

Single-Node Multi-GPU

  • llama-box (Image Generation models are not supported)
  • vLLM
  • MindIE
  • vox-box

Multi-Node Multi-GPU

  • llama-box
  • vLLM
  • MindIE

Heterogeneous-Node Multi-GPU

  • llama-box

Tip

Related documentations:

vLLMDistributed Inference and Serving

llama-boxDistributed LLM inference with llama.cpp

Installation

How can I change the default GPUStack port?

By default, the GPUStack server uses port 80. You can change it using the following method:

Script Installation

  • Linux
sudo vim /etc/systemd/system/gpustack.service

Add the --port parameter:

ExecStart=/root/.local/bin/gpustack start --port 9090

Save and restart GPUStack:

sudo systemctl daemon-reload && sudo systemctl restart gpustack
  • macOS
sudo launchctl bootout system /Library/LaunchDaemons/ai.gpustack.plist
sudo vim /Library/LaunchDaemons/ai.gpustack.plist

Add the --port parameter:

  <array>
    <string>/Users/gpustack/.local/bin/gpustack</string>
    <string>start</string>
    <string>--port</string>
    <string>9090</string>
  </array>

Save and start GPUStack:

sudo launchctl bootstrap system /Library/LaunchDaemons/ai.gpustack.plist
  • Windows
nssm edit GPUStack

Add parameter after start:

start --port 9090

Save and restart GPUStack:

Restart-Service -Name "GPUStack"

Docker Installation

Add the --port parameter at the end of the docker run command, as shown below:

docker run -d --name gpustack \
    --restart=unless-stopped \
    --gpus all \
    --network=host \
    --ipc=host \
    -v gpustack-data:/var/lib/gpustack \
    gpustack/gpustack \
    --port 9090

If the host network is not used, only the mapped host port needs to be modified:

docker run -d --name gpustack \
    --restart=unless-stopped \
    --gpus all \
    -p 9090:80 \
    -p 10150:10150 \
    -p 40064-40131:40064-40131 \
    --ipc=host \
    -v gpustack-data:/var/lib/gpustack \
    gpustack/gpustack \
    --worker-ip your_host_ip

pip Installation

Add the --port parameter at the end of the gpustack start:

gpustack start --port 9090

How can I change the registered worker name?

You can set it to a custom name using the --worker-name parameter when running GPUStack:

Script Installation

  • Linux
sudo vim /etc/systemd/system/gpustack.service

Add the --worker-name parameter:

ExecStart=/root/.local/bin/gpustack start --worker-name New-Name

Save and restart GPUStack:

sudo systemctl daemon-reload && sudo systemctl restart gpustack
  • macOS
sudo launchctl bootout system /Library/LaunchDaemons/ai.gpustack.plist
sudo vim /Library/LaunchDaemons/ai.gpustack.plist

Add the --worker-name parameter:

  <array>
    <string>/Users/gpustack/.local/bin/gpustack</string>
    <string>start</string>
    <string>--worker-name</string>
    <string>New-Name</string>
  </array>

Save and start GPUStack:

sudo launchctl bootstrap system /Library/LaunchDaemons/ai.gpustack.plist
  • Windows
nssm edit GPUStack

Add parameter after start:

start --worker-name New-Name

Save and restart GPUStack:

Restart-Service -Name "GPUStack"

Docker Installation

Add the --worker-name parameter at the end of the docker run command, as shown below:

docker run -d --name gpustack \
    --restart=unless-stopped \
    --gpus all \
    --network=host \
    --ipc=host \
    -v gpustack-data:/var/lib/gpustack \
    gpustack/gpustack \
    --worker-name New-Name

pip Installation

Add the --worker-name parameter at the end of the gpustack start:

gpustack start --worker-name New-Name

How can I change the registered worker IP?

You can set it to a custom IP using the --worker-ip parameter when running GPUStack:

Script Installation

  • Linux
sudo vim /etc/systemd/system/gpustack.service

Add the --worker-ip parameter:

ExecStart=/root/.local/bin/gpustack start --worker-ip xx.xx.xx.xx

Save and restart GPUStack:

sudo systemctl daemon-reload && sudo systemctl restart gpustack
  • macOS
sudo launchctl bootout system /Library/LaunchDaemons/ai.gpustack.plist
sudo vim /Library/LaunchDaemons/ai.gpustack.plist

Add the --worker-ip parameter:

  <array>
    <string>/Users/gpustack/.local/bin/gpustack</string>
    <string>start</string>
    <string>--worker-ip</string>
    <string>xx.xx.xx.xx</string>
  </array>

Save and start GPUStack:

sudo launchctl bootstrap system /Library/LaunchDaemons/ai.gpustack.plist
  • Windows
nssm edit GPUStack

Add parameter after start:

start --worker-ip xx.xx.xx.xx

Save and restart GPUStack:

Restart-Service -Name "GPUStack"

Docker Installation

Add the --worker-ip parameter at the end of the docker run command, as shown below:

docker run -d --name gpustack \
    --restart=unless-stopped \
    --gpus all \
    --network=host \
    --ipc=host \
    -v gpustack-data:/var/lib/gpustack \
    gpustack/gpustack \
    --worker-ip xx.xx.xx.xx

pip Installation

Add the --worker-ip parameter at the end of the gpustack start:

gpustack start --worker-ip xx.xx.xx.xx

Where are GPUStack's data stored?

Script Installation

  • Linux

The default path is as follows:

/var/lib/gpustack

You can set it to a custom path using the --data-dir parameter when running GPUStack:

sudo vim /etc/systemd/system/gpustack.service

Add the --data-dir parameter:

ExecStart=/root/.local/bin/gpustack start --data-dir /data/gpustack-data

Save and restart GPUStack:

sudo systemctl daemon-reload && sudo systemctl restart gpustack
  • macOS

The default path is as follows:

/var/lib/gpustack

You can set it to a custom path using the --data-dir parameter when running GPUStack:

sudo launchctl bootout system /Library/LaunchDaemons/ai.gpustack.plist
sudo vim /Library/LaunchDaemons/ai.gpustack.plist
  <array>
    <string>/Users/gpustack/.local/bin/gpustack</string>
    <string>start</string>
    <string>--data-dir</string>
    <string>/Users/gpustack/data/gpustack-data</string>
  </array>

Save and start GPUStack:

sudo launchctl bootstrap system /Library/LaunchDaemons/ai.gpustack.plist
  • Windows

The default path is as follows:

"$env:APPDATA\gpustack"

You can set it to a custom path using the --data-dir parameter when running GPUStack:

nssm edit GPUStack

Add parameter after start:

start --data-dir D:\gpustack-data

Save and restart GPUStack:

Restart-Service -Name "GPUStack"

Docker Installation

When running the GPUStack container, the Docker volume is mounted using -v parameter. The default data path is under the Docker data directory, specifically in the volumes subdirectory, and the default path is:

/var/lib/docker/volumes/gpustack-data/_data

You can check it by the following method:

docker volume ls
docker volume inspect gpustack-data

If you need to change it to a custom path, modify the mount configuration when running container. For example, to mount the host directory /data/gpustack:

docker run -d --name gpustack \
    --restart=unless-stopped \
    --gpus all \
    --network=host \
    --ipc=host \
    -v /data/gpustack:/var/lib/gpustack  \
    gpustack/gpustack

pip Installation

Add the --data-dir parameter at the end of the gpustack start:

gpustack start --data-dir /data/gpustack-data

Where are model files stored?

Script Installation

  • Linux

The default path is as follows:

/var/lib/gpustack/cache

You can set it to a custom path using the --cache-dir parameter when running GPUStack:

sudo vim /etc/systemd/system/gpustack.service

Add the --cache-dir parameter:

ExecStart=/root/.local/bin/gpustack start --cache-dir /data/model-cache

Save and restart GPUStack:

sudo systemctl daemon-reload && sudo systemctl restart gpustack
  • macOS

The default path is as follows:

/var/lib/gpustack/cache

You can set it to a custom path using the --cache-dir parameter when running GPUStack:

sudo launchctl bootout system /Library/LaunchDaemons/ai.gpustack.plist
sudo vim /Library/LaunchDaemons/ai.gpustack.plist
  <array>
    <string>/Users/gpustack/.local/bin/gpustack</string>
    <string>start</string>
    <string>--cache-dir</string>
    <string>/Users/gpustack/data/model-cache</string>
  </array>

Save and start GPUStack:

sudo launchctl bootstrap system /Library/LaunchDaemons/ai.gpustack.plist
  • Windows

The default path is as follows:

"$env:APPDATA\gpustack\cache"

You can set it to a custom path using the --cache-dir parameter when running GPUStack:

nssm edit GPUStack

Add parameter after start:

start --cache-dir D:\model-cache

Save and restart GPUStack:

Restart-Service -Name "GPUStack"

Docker Installation

When running the GPUStack container, the Docker volume is mounted using -v parameter. The default cache path is under the Docker data directory, specifically in the volumes subdirectory, and the default path is:

/var/lib/docker/volumes/gpustack-data/_data/cache

You can check it by the following method:

docker volume ls
docker volume inspect gpustack-data

If you need to change it to a custom path, modify the mount configuration when running container.

Note: If the data directory is already mounted, the cache directory should not be mounted inside the data directory. You need to specify a different path using the --cache-dir parameter.

For example, to mount the host directory /data/model-cache:

docker run -d --name gpustack \
    --restart=unless-stopped \
    --gpus all \
    --network=host \
    --ipc=host \
    -v /data/gpustack:/var/lib/gpustack  \
    -v /data/model-cache:/data/model-cache \
    gpustack/gpustack \
    --cache-dir /data/model-cache

pip Installation

Add the --cache-dir parameter at the end of the gpustack start:

gpustack start --cache-dir /data/model-cache

What parameters can I set when starting GPUStack?

Please refer to: gpustack start

Upgrade

How can I upgrade the built-in vLLM?

GPUStack supports multiple versions of inference backends. When deploying a model, you can specify the backend version in Edit ModelAdvancedBackend Version to use a newly released vLLM version. GPUStack will automatically create a virtual environment using pipx to install it:

pin-vllm-backend-version

If you still need to upgrade the built-in vLLM, you can upgrade vLLM on all worker nodes using the following method:

Script Installation

pipx runpip gpustack list | grep vllm
pipx runpip gpustack install -U vllm

Docker Installation

docker exec -it gpustack bash
pip list | grep vllm
pip install -U vllm

pip Installation

pip list | grep vllm
pip install -U vllm

How can I upgrade the built-in Transformers?

Script Installation

pipx runpip gpustack list | grep transformers
pipx runpip gpustack install -U transformers

Docker Installation

docker exec -it gpustack bash
pip list | grep transformers
pip install -U transformers

pip Installation

pip list | grep transformers
pip install -U transformers

How can I upgrade the built-in llama-box?

GPUStack supports multiple versions of inference backends. When deploying a model, you can specify the backend version in Edit ModelAdvancedBackend Version to use a newly released llama-box version. GPUStack will automatically download and configure it:

pin-llama-box-backend-version

If you are using distributed inference, you should upgrade llama-box on all worker nodes using the following method:

Download a newly released llama-box binary from llama-box releases.

And you need to stop the GPUStack first, then replace the binary, and finally restart the GPUStack. You can check the file location through some directories, for example:

Script & pip Installation

ps -ef | grep llama-box

Docker Installation

docker exec -it gpustack bash
ps -ef | grep llama-box

View Logs

How can I view the GPUStack logs?

The GPUStack logs provide information on the startup status, calculated model resource requirements, and more. Refer to the Troubleshooting for viewing the GPUStack logs.

How can I enable debug mode in GPUStack?

You can temporarily enable debug mode without interrupting the GPUStack service. Refer to the Troubleshooting for guidance.

If you want to enable debug mode persistently, both server and worker can add the --debug parameter when running GPUStack:

Script Installation

  • Linux
sudo vim /etc/systemd/system/gpustack.service
ExecStart=/root/.local/bin/gpustack start --debug

Save and restart GPUStack:

sudo systemctl daemon-reload && sudo systemctl restart gpustack
  • macOS
sudo launchctl bootout system /Library/LaunchDaemons/ai.gpustack.plist
sudo vim /Library/LaunchDaemons/ai.gpustack.plist
  <array>
    <string>/Users/gpustack/.local/bin/gpustack</string>
    <string>start</string>
    <string>--debug</string>
  </array>
sudo launchctl bootstrap system /Library/LaunchDaemons/ai.gpustack.plist
  • Windows
nssm edit GPUStack

Add parameter after start:

start --debug

Save and restart GPUStack:

Restart-Service -Name "GPUStack"

Docker Installation

Add the --debug parameter at the end of the docker run command, as shown below:

docker run -d --name gpustack \
    --restart=unless-stopped \
    --gpus all \
    --network=host \
    --ipc=host \
    -v gpustack-data:/var/lib/gpustack \
    gpustack/gpustack \
    --debug

pip Installation

Add the --debug parameter at the end of the gpustack start:

gpustack start --debug

How can I view the RPC server logs?

RPC Server is used for distributed inference of GGUF models. If the model starts abnormally or if there are issues with distributed inference, you can check the RPC Server logs on the corresponding node:

Script Installation

  • Linux & macOS

The default path is as follows. If the --data-dir or --log-dir parameters are set, please modify it to the actual path you have configured:

tail -200f /var/lib/gpustack/log/rpc_server/gpu-0.log

Each GPU corresponds to an RPC Server. For other GPU indices, modify it to the actual index:

tail -200f /var/lib/gpustack/log/rpc_server/gpu-n.log
  • Windows

The default path is as follows. If the --data-dir or --log-dir parameters are set, please modify it to the actual path you have configured:

Get-Content "$env:APPDATA\gpustack\log\rpc_server\gpu-0.log" -Tail 200 -Wait

Each GPU corresponds to an RPC Server. For other GPU indices, modify it to the actual index:

Get-Content "$env:APPDATA\gpustack\log\rpc_server\gpu-n.log" -Tail 200 -Wait

Docker Installation

The default path is as follows. If the --data-dir or --log-dir parameters are set, please modify it to the actual path you have configured:

docker exec -it gpustack tail -200f /var/lib/gpustack/log/rpc_server/gpu-0.log

Each GPU corresponds to an RPC Server. For other GPU indices, modify it to the actual index:

docker exec -it gpustack tail -200f /var/lib/gpustack/log/rpc_server/gpu-n.log

Where are the model logs stored?

The model instance logs are stored in the /var/lib/gpustack/log/serve/ directory of the corresponding worker node or worker container, with the log file named id.log, where id is the model instance ID. If the --data-dir or --log-dir parameter is set, the logs will be stored in the actual path specified by the parameter.

How can I enable the backend debug mode?

llama-box backend (GGUF models)

Add the --verbose parameter in Edit ModelAdvancedBackend Parameters and recreate the model instance:

enable-llama-box-debug-mode

vLLM backends (Safetensors models)

Add the VLLM_LOGGING_LEVEL=DEBUG environment variable in Edit ModelAdvancedEnvironment Variables and recreate the model instance:

enable-vllm-debug-mode

Managing Workers

What should I do if the worker is stuck in Unreachable state?

Try accessing the URL shown in the error from the server. If the server is running in container, you need to enter the server container to execute the command:

curl http://10.10.10.1:10150/healthz

What should I do if the worker is stuck in NotReady state?

Check the GPUStack logs on the corresponding worker here. If there are no abnormalities in the logs, verify that the time zones and system clocks are consistent across all nodes.

Detect GPUs

Why did it fail to detect the Ascend NPUs?

Check if npu-smi can be executed in the container:

docker exec -it gpustack bash
npu-smi info

When the following error occurs, it indicates that other containers are also mounting the NPU device, and sharing is not supported:

dcmi model initialized failed, because the device is used. ret is -8020

Check if any containers on the host have mounted NPU devices:

if [ $(docker ps | wc -l) -gt 1 ]; then docker ps | grep -v CONT | awk '{print $1}' | xargs docker inspect --format='{{printf "%.5s" .ID}} {{range .HostConfig.Devices}}{{.PathOnHost}} {{end}}' | sort -k2; fi; echo ok

Only mount NPUs that are not mounted by other containers, specify them using the --device:

docker run -d --name gpustack \
    --restart=unless-stopped \
    --device /dev/davinci4 \
    --device /dev/davinci5 \
    --device /dev/davinci6 \
    --device /dev/davinci7 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    --network=host \
    --ipc=host \
    -v gpustack-data:/var/lib/gpustack \
    gpustack/gpustack:latest-npu

Managing Models

How can I deploy the model?

How can I deploy the model from Hugging Face?

To deploy models from Hugging Face, the server node and the worker nodes where the model instances are scheduled must have access to Hugging Face, or you can use a mirror.

For example, configure the hf-mirror.com mirror:

Script Installation

  • Linux

Create or edit /etc/default/gpustack on all nodes , add the HF_ENDPOINT environment variable to use https://hf-mirror.com as the Hugging Face mirror:

vim /etc/default/gpustack
HF_ENDPOINT=https://hf-mirror.com

Save and restart GPUStack:

systemctl restart gpustack

Docker Installation

Add the HF_ENDPOINT environment variable when running container, as shown below:

docker run -d --name gpustack \
    --restart=unless-stopped \
    --gpus all \
    -e HF_ENDPOINT=https://hf-mirror.com \
    --network=host \
    --ipc=host \
    -v gpustack-data:/var/lib/gpustack \
    gpustack/gpustack

pip Installation

HF_ENDPOINT=https://hf-mirror.com gpustack start

How can I deploy the model from Local Path?

When deploying models from Local Path, it is recommended to upload the model files to each node and maintain the same absolute path. Alternatively, the model instance should be manually scheduled to nodes that have the model files via manual scheduling or label selection. Another option is to mount a shared storage across multiple nodes.

When deploying GGUF models from Local Path, the path must point to the absolute path of the .gguf file. For sharded model files, use the absolute path of the first .gguf file (00001). If using container installation, the model files must be mounted into the container, and the path should point to the container’s path, not the host’s path.

When deploying Safetensors models from Local Path, the path must point to the absolute path of the model directory which contain *.safetensors, config.json, and other files. If using container installation, the model files must be mounted into the container, and the path should point to the container’s path, not the host’s path.

deploy-model-from-local-path

How can I deploy a locally downloaded Ollama model?

Use the following command to find the full path of the model file and deploy it via the Local Path. The example below uses deepseek-r1:14b-qwen-distill-q4_K_M, be sure to replace it with your actual Ollama model name:

ollama show deepseek-r1:14b-qwen-distill-q4_K_M --modelfile | grep FROM | grep blobs | sed 's/^FROM[[:space:]]*//'

deploy-downloaded-ollama-model

What should I do if the model is stuck in Pending state?

Pending means that there are currently no workers meeting the model’s requirements, move the mouse over the Pending status to view the reason.

First, check the Resources-Workers section to ensure that the worker status is Ready.

Then, for different backends:

  • llama-box

llama-box uses the GGUF Parser to calculate the model’s memory requirements. You need to ensure that the allocatable memory is greater than the calculated memory requirements of the model. Note that even if other models are in an Error or Downloading state, the GPU memory has already been allocated. If you are unsure how much GPU memory the model requires, you can use the GGUF Parser to calculate it.

The context size for the model also affects the required GPU memory. You can adjust the --ctx-size parameter to set a smaller context. In GPUStack, if this parameter is not set, its default value is 8192. If it is specified in the backend parameters, the actual setting will take effect.

You can adjust it to a smaller context in Edit ModelAdvancedBackend Parameters as needed, for example, --ctx-size=2048. However, keep in mind that the max tokens for each inference request is influenced by both the --ctx-size and --parallel parameters: max tokens = context size / parallel

The default value of --parallel is 4, so in this case, the max tokens would be 512. If the token count exceeds the max tokens, the inference output will be truncated.

On the other hand, the --parallel parameter represents the number of parallel sequences to decode, which can roughly be considered as a setting for the model’s concurrent request handling.

Therefore, it is important to appropriately set the --ctx-size and --parallel parameters, ensuring that the max tokens for a single request is within the limits and that the available GPU memory can support the specified context size.

If you need to align with Ollama’s configuration, you can refer to the following examples:

Set the following parameters in Edit ModelAdvancedBackend Parameters:

--ctx-size=8192
--parallel=4

If your GPU memory is insufficient, try launching with a lower configuration:

--ctx-size=2048
--parallel=1
  • vLLM

vLLM requires that all GPUs have more than 90% of their memory available by default (controlled by the --gpu-memory-utilization parameter). Ensure that there is enough allocatable GPU memory exceeding 90%. Note that even if other models are in an Error or Downloading state, the GPU memory has already been allocated.

If all GPUs have more than 90% available memory but still show Pending, it indicates insufficient memory. For safetensors models in BF16 format, the required GPU memory (GB) can be estimated as:

GPU Memory (GB) = Number of Parameters (B) * 2 * 1.2 + 2

If the allocatable GPU memory is less than 90%, but you are sure the model can run with a lower allocation, you can adjust the --gpu-memory-utilization parameter. For example, add --gpu-memory-utilization=0.5 in Edit ModelAdvancedBackend Parameters to allocate 50% of the GPU memory.

Note: If the model encounters an error after running and the logs show CUDA: out of memory, it means the allocated GPU memory is insufficient. You will need to further adjust --gpu-memory-utilization, add more resources, or deploy a smaller model.

The context size for the model also affects the required GPU memory. You can adjust the --max-model-len parameter to set a smaller context. In GPUStack, if this parameter is not set, its default value is 8192. If it is specified in the backend parameters, the actual setting will take effect.

You can adjust it to a smaller context as needed, for example, --max-model-len=2048. However, keep in mind that the max tokens for each inference request cannot exceed the value of --max-model-len. Therefore, setting a very small context may cause inference truncation.

The --enforce-eager parameter also helps reduce GPU memory usage. However, this parameter in vLLM forces the model to execute in eager execution mode, meaning that operations are executed immediately as they are called, rather than being deferred for optimization in graph-based execution (like in lazy execution). This can make the execution slower but easier to debug. However, it can also reduce performance due to the lack of optimizations provided by graph execution.

What should I do if the model is stuck in Scheduled state?

Try restarting the GPUStack service where the model is scheduled. If the issue persists, check the worker logs here to analyze the cause.

What should I do if the model is stuck in Error state?

Move the mouse over the Error status to view the reason. If there is a View More button, click it to check the error messages in the model logs and analyze the cause of the error.

How can I resolve the error *.so: cannot open shared object file: No such file or directory?

If the error occurs during model startup indicating that any .so file cannot be opened, for example:

llama-box: error while loading shared libraries: libcudart.so.12: cannot open shared object file: No such file or directory

The cause is that GPUStack doesn’t recognize the LD_LIBRARY_PATH environment variable, which may be due to a missing environment variable or unconfigured toolkits (such as CUDA, CANN, etc.) during GPUStack installation.

To check if the environment variable is set:

echo $LD_LIBRARY_PATH

If not configured, here’s an example configuration for CUDA.

Ensure that the nvidia-smi is executable and the NVIDIA driver version is 550 or later:

nvidia-smi

Configure the CUDA environment variables. If not installed, install CUDA 12.4 or later:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/targets/x86_64-linux/lib
export PATH=$PATH:/usr/local/cuda/bin
echo $LD_LIBRARY_PATH
echo $PATH

Create or edit /etc/default/gpustack , add the PATH and LD_LIBRARY_PATH environment variables:

vim /etc/default/gpustack
LD_LIBRARY_PATH=......
PATH=......

Save and restart GPUStack:

systemctl restart gpustack

Why did it fail to load the model when using the local path?

When deploying a model using Local Path and encountering a failed to load model error, you need to check whether the model files exist on the node that the model instance is scheduled to, and if the absolute path is correct.

For GGUF models, you need to specify the absolute path to the .gguf file. For sharded models, use the absolute path to the first .gguf file (typically 00001).

If using Docker installation, the model files must be mounted into the container. Make sure the path you provide is the one inside the container, not the host path.

deploy-model-from-local-path

Why doesn’t deleting a model free up disk space?

This is to avoid re-downloading the model when redeploying. You need to clean it up in ResourcesModel Files manually.

Why does each GPU have a llama-box process by default?

This process is the RPC server used for llama-box’s distributed inference. If you are sure that you do not need distributed inference with llama-box, you can disable the RPC server service by adding the --disable-rpc-servers parameter when running GPUStack.

Backend Parameters

How can I know the purpose of the backend parameters?

How can I set the model’s context length?

llama-box backend (GGUF models)

GPUStack sets the default context length for models to 8K. You can customize the context length using the --ctx-size parameter, but it cannot exceed the model’s maximum context length:

set-the-model-context-length-for-llama-box

If editing, save the change and then recreate the model instance to take effect.

vLLM backend (Safetensors models)

GPUStack sets the default context length for models to 8K. You can customize the context length using the --max-model-len parameter, but it cannot exceed the model’s maximum context length:

set-the-model-context-length-for-vllm

MindIE backend (Safetensors models)

GPUStack sets the default context length for models to 8K. You can customize the context length using the --max-seq-len parameter, but it cannot exceed the model’s maximum context length:

set-the-model-context-length-for-mindie

If editing, save the change and then recreate the model instance to take effect.

Using Models

Using Vision Language Models

How can I resolve the error At most 1 image(s) may be provided in one request?

This is a limitation of vLLM. You can adjust the --limit-mm-per-prompt parameter in Edit ModelAdvancedBackend Parameters as needed. For example, --limit-mm-per-prompt=image=4 means that it supports up to 4 images per inference request, see the details here.

Managing GPUStack

How can I manage the GPUStack service?

Script Installation

  • Linux

Stop GPUStack:

sudo systemctl stop gpustack

Start GPUStack:

sudo systemctl start gpustack

Restart GPUStack:

sudo systemctl restart gpustack
  • macOS

Stop GPUStack:

sudo launchctl bootout system /Library/LaunchDaemons/ai.gpustack.plist

Start GPUStack:

sudo launchctl bootstrap system /Library/LaunchDaemons/ai.gpustack.plist

Restart GPUStack:

sudo launchctl bootout system /Library/LaunchDaemons/ai.gpustack.plist
sudo launchctl bootstrap system /Library/LaunchDaemons/ai.gpustack.plist
  • Windows

Run PowerShell as administrator (avoid using PowerShell ISE).

Stop GPUStack:

Stop-Service -Name "GPUStack"

Start GPUStack:

Start-Service -Name "GPUStack"

Restart GPUStack:

Restart-Service -Name "GPUStack"

Docker Installation

Restart GPUStack container:

docker restart gpustack

How do I use GPUStack behind a proxy?

Script Installation

  • Linux & macOS

Create or edit /etc/default/gpustack and add the proxy configuration:

vim /etc/default/gpustack
http_proxy="http://username:password@proxy-server:port"
https_proxy="http://username:password@proxy-server:port"
all_proxy="socks5://username:password@proxy-server:port"
no_proxy="localhost,127.0.0.1,192.168.0.0/24,172.16.0.0/16,10.0.0.0/8"

Save and restart GPUStack:

systemctl restart gpustack

Docker Installation

Pass environment variables when running GPUStack:

docker run -e http_proxy="http://username:password@proxy-server:port" \
           -e https_proxy="http://username:password@proxy-server:port" \
           -e all_proxy="socks5://username:password@proxy-server:port" \
           -e no_proxy="localhost,127.0.0.1,192.168.0.0/24,172.16.0.0/16,10.0.0.0/8" \
           ……