FAQ

Support Matrix

Hybird Cluster Support

GPUStack supports a mix of Linux, Windows, and macOS nodes, as well as x86_64 and arm64 architectures. Additionally, It also supports various GPUs, including NVIDIA, Apple Metal, AMD, Ascend, Hygon, Moore Threads, Iluvatar and Cambricon.

Distributed Inference Support

Single-Node Multi-GPU

vLLM
MindIE
llama-box (Image Generation models are not supported)
vox-box

Multi-Node Multi-GPU

vLLM
MindIE
llama-box

Heterogeneous-Node Multi-GPU

llama-box

Tip

Installation

How can I change the default GPUStack port?

By default, the GPUStack server uses port 80. You can change it using the following method:

DockerInstallerpipScript (Legacy)

Add the --port parameter at the end of the docker run command, as shown below:

docker run -d --name gpustack \
    --restart=unless-stopped \
    --gpus all \
    --network=host \
    --ipc=host \
    -v gpustack-data:/var/lib/gpustack \
    gpustack/gpustack \
    --port 9090

Click the GPUStack icon in the menu bar (macOS) or system tray (Windows), choose Quick Config, change the port, and restart the service to apply the changes.

Add the --port parameter at the end of the gpustack start:

gpustack start --port 9090

Linux

sudo vim /etc/systemd/system/gpustack.service

Add the --port parameter:

ExecStart=/root/.local/bin/gpustack start --port 9090

Save and restart GPUStack:

sudo systemctl daemon-reload && sudo systemctl restart gpustack

macOS

sudo launchctl bootout system /Library/LaunchDaemons/ai.gpustack.plist
sudo vim /Library/LaunchDaemons/ai.gpustack.plist

Add the --port parameter:

  <array>
    <string>/Users/gpustack/.local/bin/gpustack</string>
    <string>start</string>
    <string>--port</string>
    <string>9090</string>
  </array>

Save and start GPUStack:

sudo launchctl bootstrap system /Library/LaunchDaemons/ai.gpustack.plist

Windows

nssm edit GPUStack

Add parameter after start:

start --port 9090

Save and restart GPUStack:

Restart-Service -Name "GPUStack"

How can I change the registered worker name?

You can set it to a custom name using the --worker-name parameter when running GPUStack:

DockerInstallerpipScript (Legacy)

Add the --worker-name parameter at the end of the docker run command, as shown below:

docker run -d --name gpustack \
    --restart=unless-stopped \
    --gpus all \
    --network=host \
    --ipc=host \
    -v gpustack-data:/var/lib/gpustack \
    gpustack/gpustack \
    --worker-name New-Name

macOS

Click the GPUStack icon in the menu bar, then select Config Directory to open the Finder. Edit the config.yaml file and add the worker_name parameter:

worker_name: New-Name

Save the file, then click the GPUStack icon again, go to Status - Restart to apply the changes.

Windows

Right-click the GPUStack icon in the system tray, then select Config Directory to open the File Explorer. Edit the config.yaml file and add the worker_name parameter:

worker_name: New-Name

Save the file, then right-click the GPUStack icon again, go to Status - Restart to apply the changes.

Add the --worker-name parameter at the end of the gpustack start:

gpustack start --worker-name New-Name

Linux

sudo vim /etc/systemd/system/gpustack.service

Add the --worker-name parameter:

ExecStart=/root/.local/bin/gpustack start --worker-name New-Name

Save and restart GPUStack:

sudo systemctl daemon-reload && sudo systemctl restart gpustack

macOS

sudo launchctl bootout system /Library/LaunchDaemons/ai.gpustack.plist
sudo vim /Library/LaunchDaemons/ai.gpustack.plist

Add the --worker-name parameter:

  <array>
    <string>/Users/gpustack/.local/bin/gpustack</string>
    <string>start</string>
    <string>--worker-name</string>
    <string>New-Name</string>
  </array>

Save and start GPUStack:

sudo launchctl bootstrap system /Library/LaunchDaemons/ai.gpustack.plist

Windows

nssm edit GPUStack

Add parameter after start:

start --worker-name New-Name

Save and restart GPUStack:

Restart-Service -Name "GPUStack"

How can I change the registered worker IP?

You can set it to a custom IP using the --worker-ip parameter when running GPUStack:

DockerInstallerpipScript (Legacy)

Add the --worker-ip parameter at the end of the docker run command, as shown below:

docker run -d --name gpustack \
    --restart=unless-stopped \
    --gpus all \
    --network=host \
    --ipc=host \
    -v gpustack-data:/var/lib/gpustack \
    gpustack/gpustack \
    --worker-ip xx.xx.xx.xx

macOS

Click the GPUStack icon in the menu bar, then select Config Directory to open the Finder. Edit the config.yaml file and add the worker_ip parameter:

worker_ip: xx.xx.xx.xx

Save the file, then click the GPUStack icon again, go to Status - Restart to apply the changes.

Windows

Right-click the GPUStack icon in the system tray, then select Config Directory to open the File Explorer. Edit the config.yaml file and add the worker_ip parameter:

worker_ip: xx.xx.xx.xx

Save the file, then right-click the GPUStack icon again, go to Status - Restart to apply the changes.

Add the --worker-ip parameter at the end of the gpustack start:

gpustack start --worker-ip xx.xx.xx.xx

Linux

sudo vim /etc/systemd/system/gpustack.service

Add the --worker-ip parameter:

ExecStart=/root/.local/bin/gpustack start --worker-ip xx.xx.xx.xx

Save and restart GPUStack:

sudo systemctl daemon-reload && sudo systemctl restart gpustack

macOS

sudo launchctl bootout system /Library/LaunchDaemons/ai.gpustack.plist
sudo vim /Library/LaunchDaemons/ai.gpustack.plist

Add the --worker-ip parameter:

  <array>
    <string>/Users/gpustack/.local/bin/gpustack</string>
    <string>start</string>
    <string>--worker-ip</string>
    <string>xx.xx.xx.xx</string>
  </array>

Save and start GPUStack:

sudo launchctl bootstrap system /Library/LaunchDaemons/ai.gpustack.plist

Windows

nssm edit GPUStack

Add parameter after start:

start --worker-ip xx.xx.xx.xx

Save and restart GPUStack:

Restart-Service -Name "GPUStack"

Where are GPUStack's data stored?

DockerInstallerpipScript (Legacy)

When running the GPUStack container, the Docker volume is mounted using -v parameter. The default data path is under the Docker data directory, specifically in the volumes subdirectory, and the default path is:

/var/lib/docker/volumes/gpustack-data/_data

You can check it by the following method:

docker volume ls
docker volume inspect gpustack-data

If you need to change it to a custom path, modify the mount configuration when running container. For example, to mount the host directory /data/gpustack:

docker run -d --name gpustack \
    --restart=unless-stopped \
    --gpus all \
    --network=host \
    --ipc=host \
    -v /data/gpustack:/var/lib/gpustack  \
    gpustack/gpustack

The default path is as follows:

macOS

/Library/Application Support/GPUStack

Windows

C:\ProgramData\GPUStack

And modifying the data directory is not allowed when using the installer.

The default path is as follows:

Linux & macOS

/var/lib/gpustack

Windows

"$env:APPDATA\gpustack"

You can set it to a custom path using the --data-dir parameter when running GPUStack:

gpustack start --data-dir /data/gpustack-data

Linux

The default path is as follows:

/var/lib/gpustack

You can set it to a custom path using the --data-dir parameter when running GPUStack:

sudo vim /etc/systemd/system/gpustack.service

Add the --data-dir parameter:

ExecStart=/root/.local/bin/gpustack start --data-dir /data/gpustack-data

Save and restart GPUStack:

sudo systemctl daemon-reload && sudo systemctl restart gpustack

macOS

The default path is as follows:

/var/lib/gpustack

You can set it to a custom path using the --data-dir parameter when running GPUStack:

sudo launchctl bootout system /Library/LaunchDaemons/ai.gpustack.plist
sudo vim /Library/LaunchDaemons/ai.gpustack.plist

  <array>
    <string>/Users/gpustack/.local/bin/gpustack</string>
    <string>start</string>
    <string>--data-dir</string>
    <string>/Users/gpustack/data/gpustack-data</string>
  </array>

Save and start GPUStack:

sudo launchctl bootstrap system /Library/LaunchDaemons/ai.gpustack.plist

Windows

The default path is as follows:

"$env:APPDATA\gpustack"

You can set it to a custom path using the --data-dir parameter when running GPUStack:

nssm edit GPUStack

Add parameter after start:

start --data-dir D:\gpustack-data

Save and restart GPUStack:

Restart-Service -Name "GPUStack"

Where are model files stored?

DockerInstallerpipScript (Legacy)

When running the GPUStack container, the Docker volume is mounted using -v parameter. The default cache path is under the Docker data directory, specifically in the volumes subdirectory, and the default path is:

/var/lib/docker/volumes/gpustack-data/_data/cache

You can check it by the following method:

docker volume ls
docker volume inspect gpustack-data

If you need to change it to a custom path, modify the mount configuration when running container.

Note: If the data directory is already mounted, the cache directory should not be mounted inside the data directory. You need to specify a different path using the --cache-dir parameter.

For example, to mount the host directory /data/model-cache:

docker run -d --name gpustack \
    --restart=unless-stopped \
    --gpus all \
    --network=host \
    --ipc=host \
    -v /data/gpustack:/var/lib/gpustack  \
    -v /data/model-cache:/data/model-cache \
    gpustack/gpustack \
    --cache-dir /data/model-cache

macOS

The default path is as follows:

/Library/Application Support/GPUStack/cache

If you need to change it to a custom path, click the GPUStack icon in the menu bar, then select Config Directory to open the Finder. Edit the config.yaml file and add the cache_dir parameter:

cache_dir: /Users/gpustack/data/model-cache

Save the file, then click the GPUStack icon again, go to Status - Restart to apply the changes.

Windows

The default path is as follows:

C:\ProgramData\GPUStack\cache

If you need to change it to a custom path, click the GPUStack icon in the menu bar, then select Config Directory to open the Finder. Edit the config.yaml file and add the cache_dir parameter:

cache_dir: D:\model-cache

Save the file, then right-click the GPUStack icon again, go to Status - Restart to apply the changes.

The default path is as follows:

Linux & macOS

/var/lib/gpustack/cache

Windows

"$env:APPDATA\gpustack\cache"

You can set it to a custom path using the --cache-dir parameter when running GPUStack:

gpustack start --cache-dir /data/model-cache

Linux

The default path is as follows:

/var/lib/gpustack/cache

You can set it to a custom path using the --cache-dir parameter when running GPUStack:

sudo vim /etc/systemd/system/gpustack.service

Add the --cache-dir parameter:

ExecStart=/root/.local/bin/gpustack start --cache-dir /data/model-cache

Save and restart GPUStack:

sudo systemctl daemon-reload && sudo systemctl restart gpustack

macOS

The default path is as follows:

/var/lib/gpustack/cache

You can set it to a custom path using the --cache-dir parameter when running GPUStack:

sudo launchctl bootout system /Library/LaunchDaemons/ai.gpustack.plist
sudo vim /Library/LaunchDaemons/ai.gpustack.plist

  <array>
    <string>/Users/gpustack/.local/bin/gpustack</string>
    <string>start</string>
    <string>--cache-dir</string>
    <string>/Users/gpustack/data/model-cache</string>
  </array>

Save and start GPUStack:

sudo launchctl bootstrap system /Library/LaunchDaemons/ai.gpustack.plist

Windows

The default path is as follows:

"$env:APPDATA\gpustack\cache"

You can set it to a custom path using the --cache-dir parameter when running GPUStack:

nssm edit GPUStack

Add parameter after start:

start --cache-dir D:\model-cache

Save and restart GPUStack:

Restart-Service -Name "GPUStack"

What parameters can I set when starting GPUStack?

Please refer to: gpustack start

Upgrade

How can I upgrade the built-in vLLM?

GPUStack supports multiple versions of inference backends. When deploying a model, you can specify the backend version in Edit Model → Advanced → Backend Version to use a newly released vLLM version. GPUStack will automatically create a virtual environment using pipx to install it:

Or you can manually install the custom version, including using a custom PyPI mirror, and then link the executable to /var/lib/gpustack/bin/. After that, configure and use it as described above.

python3 -m venv $(pipx environment --value PIPX_LOCAL_VENVS)/vllm-v0-8-5-post1
$(pipx environment --value PIPX_LOCAL_VENVS)/vllm-v0-8-5-post1/bin/python -m pip install vllm==v0.8.5.post1 -i https://mirrors.aliyun.com/pypi/simple
sudo mkdir -p /var/lib/gpustack/bin
sudo ln -s $(pipx environment --value PIPX_LOCAL_VENVS)/vllm-v0-8-5-post1/bin/vllm /var/lib/gpustack/bin/vllm_v0.8.5.post1

If you still need to upgrade the built-in vLLM, you can upgrade vLLM on all worker nodes using the following method:

Docker

docker exec -it gpustack bash
pip list | grep vllm
pip install -U vllm

pip

pip list | grep vllm
pip install -U vllm

Script (Legacy)

pipx runpip gpustack list | grep vllm
pipx runpip gpustack install -U vllm

How can I upgrade the built-in Transformers?

Docker

docker exec -it gpustack bash
pip list | grep transformers
pip install -U transformers

pip

pip list | grep transformers
pip install -U transformers

Script (Legacy)

pipx runpip gpustack list | grep transformers
pipx runpip gpustack install -U transformers

How can I upgrade the built-in llama-box?

GPUStack supports multiple versions of inference backends. When deploying a model, you can specify the backend version in Edit Model → Advanced → Backend Version to use a newly released llama-box version. GPUStack will automatically download and configure it:

If you are using distributed inference, you should upgrade llama-box on all worker nodes using the following method:

Download a newly released llama-box binary from llama-box releases.

And you need to stop the GPUStack first, then replace the binary, and finally restart the GPUStack. You can check the file location through some directories, for example:

Docker

docker exec -it gpustack bash
ps -ef | grep llama-box

pip & Script (Legacy)

ps -ef | grep llama-box

View Logs

How can I view the GPUStack logs?

The GPUStack logs provide information on the startup status, calculated model resource requirements, and more. Refer to the Troubleshooting for viewing the GPUStack logs.

How can I enable debug mode in GPUStack?

You can temporarily enable debug mode without interrupting the GPUStack service. Refer to the Troubleshooting for guidance.

If you want to enable debug mode persistently, both server and worker can add the --debug parameter when running GPUStack:

DockerInstallerpipScript (Legacy)

Add the --debug parameter at the end of the docker run command, as shown below:

docker run -d --name gpustack \
    --restart=unless-stopped \
    --gpus all \
    --network=host \
    --ipc=host \
    -v gpustack-data:/var/lib/gpustack \
    gpustack/gpustack \
    --debug

macOS

Click the GPUStack icon in the menu bar, then select Config Directory to open the Finder. Edit the config.yaml file and add the debug parameter:

debug: true

Save the file, then click the GPUStack icon again, go to Status - Restart to apply the changes.

Windows

Right-click the GPUStack icon in the system tray, then select Config Directory to open the File Explorer. Edit the config.yaml file and add the debug parameter:

debug: true

Save the file, then right-click the GPUStack icon again, go to Status - Restart to apply the changes.

Add the --debug parameter at the end of the gpustack start:

gpustack start --debug

Linux

sudo vim /etc/systemd/system/gpustack.service

ExecStart=/root/.local/bin/gpustack start --debug

Save and restart GPUStack:

sudo systemctl daemon-reload && sudo systemctl restart gpustack

macOS

sudo launchctl bootout system /Library/LaunchDaemons/ai.gpustack.plist
sudo vim /Library/LaunchDaemons/ai.gpustack.plist

  <array>
    <string>/Users/gpustack/.local/bin/gpustack</string>
    <string>start</string>
    <string>--debug</string>
  </array>

sudo launchctl bootstrap system /Library/LaunchDaemons/ai.gpustack.plist

Windows

nssm edit GPUStack

Add parameter after start:

start --debug

Save and restart GPUStack:

Restart-Service -Name "GPUStack"

How can I view the RPC server logs?

RPC Server is used for distributed inference of GGUF models. If the model starts abnormally or if there are issues with distributed inference, you can check the RPC Server logs on the corresponding node:

DockerInstallerpip & Script (Legacy)

The default path is as follows. If the --data-dir or --log-dir parameters are set, please modify it to the actual path you have configured:

docker exec -it gpustack tail -200f /var/lib/gpustack/log/rpc_server/gpu-0.log

Each GPU corresponds to an RPC Server. For other GPU indices, modify it to the actual index:

docker exec -it gpustack tail -200f /var/lib/gpustack/log/rpc_server/gpu-n.log

macOS

The default path is as follows. If the --log-dir parameter is set, please modify it to the actual path you have configured:

tail -200f /Library/Application\ Support/GPUStack/log/rpc_server/gpu-0.log

Windows

The default path is as follows. If the --log-dir parameter is set, please modify it to the actual path you have configured:

Get-Content "C:\ProgramData\GPUStack\log\rpc_server\gpu-0.log" -Tail 200 -Wait

Each GPU corresponds to an RPC Server. For other GPU indices, modify it to the actual index:

Get-Content "C:\ProgramData\GPUStack\log\rpc_server\gpu-n.log" -Tail 200 -Wait

Linux & macOS

The default path is as follows. If the --data-dir or --log-dir parameters are set, please modify it to the actual path you have configured:

tail -200f /var/lib/gpustack/log/rpc_server/gpu-0.log

Each GPU corresponds to an RPC Server. For other GPU indices, modify it to the actual index:

tail -200f /var/lib/gpustack/log/rpc_server/gpu-n.log

Windows

The default path is as follows. If the --data-dir or --log-dir parameters are set, please modify it to the actual path you have configured:

Get-Content "$env:APPDATA\gpustack\log\rpc_server\gpu-0.log" -Tail 200 -Wait

Each GPU corresponds to an RPC Server. For other GPU indices, modify it to the actual index:

Get-Content "$env:APPDATA\gpustack\log\rpc_server\gpu-n.log" -Tail 200 -Wait

How can I view the Ray logs?

Ray is used for distributed inference of vLLM. If the model starts abnormally or if there are issues with distributed inference, you can check the Ray logs on the server and corresponding workers. If installed via container, the following refers to the directory inside the container.

The default path is as follows. If the --data-dir or --log-dir parameters are set, please modify it to the actual path you have configured:

Ray head startup logs:

tail -200f /var/lib/gpustack/log/ray-head.log

Ray worker startup logs:

tail -200f /var/lib/gpustack/log/ray-worker.log

Ray component logs:

/tmp/ray/session_*/logs

Where are the model logs stored?

Dockerpip & Script (Legacy)

The model instance logs are stored in the /var/lib/gpustack/log/serve/ directory of the corresponding worker container, with the log file named id.log, where id is the model instance ID. If the --data-dir or --log-dir parameter is set, the logs will be stored in the actual path specified by the parameter.

Linux & macOS

The model instance logs are stored in the /var/lib/gpustack/log/serve/ directory of the corresponding worker node, with the log file named id.log, where id is the model instance ID. If the --data-dir or --log-dir parameter is set, the logs will be stored in the actual path specified by the parameter.

Windows

The model instance logs are stored in the $env:APPDATA\gpustack\log\serve\ directory of the corresponding worker node, with the log file named id.log, where id is the model instance ID. If the --data-dir or --log-dir parameter is set, the logs will be stored in the actual path specified by the parameter.

How can I enable the backend debug mode?

vLLMllama-box

Add the VLLM_LOGGING_LEVEL=DEBUG environment variable in Edit Model → Advanced → Environment Variables and recreate the model instance:

Add the --verbose parameter in Edit Model → Advanced → Backend Parameters and recreate the model instance:

Managing Workers

What should I do if the worker is stuck in `Unreachable` state?

Try accessing the URL shown in the error from the server. If the server is running in container, you need to enter the server container to execute the command:

curl http://10.10.10.1:10150/healthz

What should I do if the worker is stuck in `NotReady` state?

Check the GPUStack logs on the corresponding worker here. If there are no abnormalities in the logs, verify that the time zones and system clocks are consistent across all nodes.

Detect GPUs

Why did it fail to detect the Ascend NPUs?

Check if npu-smi can be executed in the container:

docker exec -it gpustack bash
npu-smi info

When the following error occurs, it indicates that other containers are also mounting the NPU device, and sharing is not supported:

dcmi model initialized failed, because the device is used. ret is -8020

Check if any containers on the host have mounted NPU devices:

if [ $(docker ps | wc -l) -gt 1 ]; then docker ps | grep -v CONT | awk '{print $1}' | xargs docker inspect --format='{{printf "%.5s" .ID}} {{range .HostConfig.Devices}}{{.PathOnHost}} {{end}}' | sort -k2; fi; echo ok

Only mount NPUs that are not mounted by other containers, specify them using the --device:

docker run -d --name gpustack \
    --restart=unless-stopped \
    --device /dev/davinci4 \
    --device /dev/davinci5 \
    --device /dev/davinci6 \
    --device /dev/davinci7 \
    ... \
    gpustack/gpustack:latest-npu

Managing Models

How can I deploy the model?

How can I deploy the model from Hugging Face?

To deploy models from Hugging Face, the server node and the worker nodes where the model instances are scheduled must have access to Hugging Face, or you can use a mirror.

For example, configure the hf-mirror.com mirror:

DockerInstallerpipScript (Legacy)

Add the HF_ENDPOINT environment variable when running container, as shown below:

docker run -d --name gpustack \
    --restart=unless-stopped \
    --gpus all \
    -e HF_ENDPOINT=https://hf-mirror.com \
    --network=host \
    --ipc=host \
    -v gpustack-data:/var/lib/gpustack \
    gpustack/gpustack

Click the GPUStack icon in the menu bar (macOS) or system tray (Windows), choose Quick Config - Environments, add the environment variable HF_ENDPOINT=https://hf-mirror.com, and restart the service to apply the changes.

HF_ENDPOINT=https://hf-mirror.com gpustack start

Linux

Create or edit /etc/default/gpustack on all nodes , add the HF_ENDPOINT environment variable to use https://hf-mirror.com as the Hugging Face mirror:

vim /etc/default/gpustack

HF_ENDPOINT=https://hf-mirror.com

Save and restart GPUStack:

systemctl restart gpustack

macOS

Create or edit /etc/default/gpustack on all nodes , add the HF_ENDPOINT environment variable to use https://hf-mirror.com as the Hugging Face mirror:

vim /etc/default/gpustack

HF_ENDPOINT=https://hf-mirror.com

Then re-run the installation script using the same configuration options you originally used:

curl -sfL https://get.gpustack.ai | <EXISTING_INSTALL_ENV> sh -s - <EXISTING_GPUSTACK_ARGS>

Windows

Create or edit $env:APPDATA\gpustack\gpustack.env and add the proxy configuration:

notepad $env:APPDATA\gpustack\gpustack.env

HF_ENDPOINT=https://hf-mirror.com

Then re-run the installation script using the same configuration options you originally used:

curl -sfL https://get.gpustack.ai | <EXISTING_INSTALL_ENV> sh -s - <EXISTING_GPUSTACK_ARGS>

How can I deploy the model from Local Path?

When deploying models from Local Path, it is recommended to upload the model files to each node and maintain the same absolute path. Alternatively, the model instance should be manually scheduled to nodes that have the model files via manual scheduling or label selection. Another option is to mount a shared storage across multiple nodes.

When deploying GGUF models from Local Path, the path must point to the absolute path of the .gguf file. For sharded model files, use the absolute path of the first .gguf file (00001). If using container installation, the model files must be mounted into the container, and the path should point to the container’s path, not the host’s path.

When deploying Safetensors models from Local Path, the path must point to the absolute path of the model directory which contain *.safetensors, config.json, and other files. If using container installation, the model files must be mounted into the container, and the path should point to the container’s path, not the host’s path.

What should I do if the model is stuck in `Pending` state?

Pending means that there are currently no workers meeting the model’s requirements, move the mouse over the Pending status to view the reason.

First, check the Resources-Workers section to ensure that the worker status is Ready.

Then, for different backends:

vLLMllama-box

vLLM requires that all GPUs have more than 90% of their memory available by default (controlled by the --gpu-memory-utilization parameter). Ensure that there is enough allocatable GPU memory exceeding 90%. Note that even if other models are in an Error or Downloading state, the GPU memory has already been allocated.

If all GPUs have more than 90% available memory but still show Pending, it indicates insufficient memory. For safetensors models in BF16 format, the required GPU memory (GB) can be estimated as:

GPU Memory (GB) = Model weight size (GB) * 1.2 + 2

If the allocatable GPU memory is less than 90%, but you are sure the model can run with a lower allocation, you can adjust the --gpu-memory-utilization parameter. For example, add --gpu-memory-utilization=0.5 in Edit Model → Advanced → Backend Parameters to allocate 50% of the GPU memory.

Note: If the model encounters an error after running and the logs show CUDA: out of memory, it means the allocated GPU memory is insufficient. You will need to further adjust --gpu-memory-utilization, add more resources, or deploy a smaller model.

The context size for the model also affects the required GPU memory. You can adjust the --max-model-len parameter to set a smaller context. In GPUStack, if this parameter is not set, its default value is 8192. If it is specified in the backend parameters, the actual setting will take effect.

You can adjust it to a smaller context as needed, for example, --max-model-len=2048. However, keep in mind that the max tokens for each inference request cannot exceed the value of --max-model-len. Therefore, setting a very small context may cause inference truncation.

The --enforce-eager parameter also helps reduce GPU memory usage. However, this parameter in vLLM forces the model to execute in eager execution mode, meaning that operations are executed immediately as they are called, rather than being deferred for optimization in graph-based execution (like in lazy execution). This can make the execution slower but easier to debug. However, it can also reduce performance due to the lack of optimizations provided by graph execution.

llama-box uses the GGUF Parser to calculate the model’s memory requirements. You need to ensure that the allocatable memory is greater than the calculated memory requirements of the model. Note that even if other models are in an Error or Downloading state, the GPU memory has already been allocated. If you are unsure how much GPU memory the model requires, you can use the GGUF Parser to calculate it.

The context size for the model also affects the required GPU memory. You can adjust the --ctx-size parameter to set a smaller context. In GPUStack, if this parameter is not set, its default value is 8192. If it is specified in the backend parameters, the actual setting will take effect.

You can adjust it to a smaller context in Edit Model → Advanced → Backend Parameters as needed, for example, --ctx-size=2048. However, keep in mind that the max tokens for each inference request is influenced by both the --ctx-size and --parallel parameters: max tokens = context size / parallel

The default value of --parallel is 4, so in this case, the max tokens would be 512. If the token count exceeds the max tokens, the inference output will be truncated.

On the other hand, the --parallel parameter represents the number of parallel sequences to decode, which can roughly be considered as a setting for the model’s concurrent request handling.

Therefore, it is important to appropriately set the --ctx-size and --parallel parameters, ensuring that the max tokens for a single request is within the limits and that the available GPU memory can support the specified context size.

If your GPU memory is insufficient, try launching with a lower configuration:

--ctx-size=2048
--parallel=1

What should I do if the model is stuck in `Scheduled` state?

Try restarting the GPUStack service where the model is scheduled. If the issue persists, check the worker logs here to analyze the cause.

What should I do if the model is stuck in `Error` state?

Move the mouse over the Error status to view the reason. If there is a View More button, click it to check the error messages in the model logs and analyze the cause of the error.

How can I resolve the error *.so: cannot open shared object file: No such file or directory?

If the error occurs during model startup indicating that any .so file cannot be opened, for example:

llama-box: error while loading shared libraries: libcudart.so.12: cannot open shared object file: No such file or directory

The cause is that GPUStack doesn’t recognize the LD_LIBRARY_PATH environment variable, which may be due to a missing environment variable or unconfigured toolkits (such as CUDA, CANN, etc.) during GPUStack installation.

Note

This issue only occurs in non-containerized environments. The installation script has already been deprecated, and we strongly recommend using the Docker installation method instead.

To check if the environment variable is set:

echo $LD_LIBRARY_PATH

If not configured, here’s an example configuration for CUDA.

Ensure that the nvidia-smi is executable and the NVIDIA driver version is 550 or later:

nvidia-smi

Configure the CUDA environment variables. If not installed, install CUDA 12.4 or later:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/targets/x86_64-linux/lib
export PATH=$PATH:/usr/local/cuda/bin
echo $LD_LIBRARY_PATH
echo $PATH

Create or edit /etc/default/gpustack , add the PATH and LD_LIBRARY_PATH environment variables:

vim /etc/default/gpustack

LD_LIBRARY_PATH=......
PATH=......

Save and restart GPUStack:

systemctl restart gpustack

Why did it fail to load the model when using the local path?

When deploying a model using Local Path and encountering a failed to load model error, you need to check whether the model files exist on the node that the model instance is scheduled to, and if the absolute path is correct.

For GGUF models, you need to specify the absolute path to the .gguf file. For sharded models, use the absolute path to the first .gguf file (typically 00001).

If using Docker installation, the model files must be mounted into the container. Make sure the path you provide is the one inside the container, not the host path.

Why doesn’t deleting a model free up disk space?

This is to avoid re-downloading the model when redeploying. You need to clean it up in Resources - Model Files manually.

Why does each GPU have a llama-box process by default?

This process is the RPC server used for llama-box’s distributed inference. If you are sure that you do not need distributed inference with llama-box, you can disable the RPC server service by adding the --disable-rpc-servers parameter when running GPUStack.

Backend Parameters

How can I know the purpose of the backend parameters?

How can I set the model’s context length?

vLLMMindIEllama-box

GPUStack sets the default context length for models to 8K. You can customize the context length using the --max-model-len parameter, but it cannot exceed the model’s maximum context length:

If editing, save the change and then recreate the model instance to take effect.

GPUStack sets the default context length for models to 8K. You can customize the context length using the --max-seq-len parameter, but it cannot exceed the model’s maximum context length:

If editing, save the change and then recreate the model instance to take effect.

GPUStack sets the default context length for models to 8K. You can customize the context length using the --ctx-size parameter, but it cannot exceed the model’s maximum context length:

If editing, save the change and then recreate the model instance to take effect.

Using Models

Using Vision Language Models

How can I resolve the error At most 1 image(s) may be provided in one request?

This is a limitation of vLLM. You can adjust the --limit-mm-per-prompt parameter in Edit Model - Advanced - Backend Parameters as needed. For example, --limit-mm-per-prompt=image=4 means that it supports up to 4 images per inference request, see the details here.

Managing GPUStack

How can I manage the GPUStack service?

DockerInstallerScript (Legacy)

Restart GPUStack container:

docker restart gpustack

Click the GPUStack icon in the menu bar (macOS) or system tray (Windows), then choose Status - Start/Stop/Restart to control the service.

Linux

Stop GPUStack:

sudo systemctl stop gpustack

Start GPUStack:

sudo systemctl start gpustack

Restart GPUStack:

sudo systemctl restart gpustack

macOS

Stop GPUStack:

sudo launchctl bootout system /Library/LaunchDaemons/ai.gpustack.plist

Start GPUStack:

sudo launchctl bootstrap system /Library/LaunchDaemons/ai.gpustack.plist

Restart GPUStack:

sudo launchctl bootout system /Library/LaunchDaemons/ai.gpustack.plist
sudo launchctl bootstrap system /Library/LaunchDaemons/ai.gpustack.plist

Windows

You can control the service status by right-clicking the GPUStack icon in the system tray, or by using PowerShell commands.

Run PowerShell as administrator (avoid using PowerShell ISE).

Stop GPUStack:

Stop-Service -Name "GPUStack"

Start GPUStack:

Start-Service -Name "GPUStack"

Restart GPUStack:

Restart-Service -Name "GPUStack"

How do I use GPUStack behind a proxy?

DockerInstallerpipScript (Legacy)

Pass environment variables when running GPUStack:

docker run -e HTTP_PROXY="http://username:password@proxy-server:port" \
           -e HTTPS_PROXY="http://username:password@proxy-server:port" \
           -e ALL_PROXY="socks5://username:password@proxy-server:port" \
           -e NO_PROXY="localhost,127.0.0.1,192.168.0.0/24,172.16.0.0/16,10.0.0.0/8" \
           ...

Click the GPUStack icon in the menu bar (macOS) or system tray (Windows), choose Quick Config - Environments, add the required environment variables, and restart the service to apply the changes.

Linux & macOS

Set the proxy environment variables:

export HTTP_PROXY="http://username:password@proxy-server:port"
export HTTPS_PROXY="http://username:password@proxy-server:port"
export ALL_PROXY="socks5://username:password@proxy-server:port"
export NO_PROXY="localhost,127.0.0.1,192.168.0.0/24,172.16.0.0/16,10.0.0.0/8"

Then start GPUStack:

gpustack start

Windows

Set the proxy environment variables:

$env:HTTP_PROXY = "http://username:password@proxy-server:port"
$env:HTTPS_PROXY = "http://username:password@proxy-server:port"
$env:ALL_PROXY = "socks5://username:password@proxy-server:port"
$env:NO_PROXY = "localhost,127.0.0.1,192.168.0.0/24,172.16.0.0/16,10.0.0.0/8"

Then start GPUStack:

gpustack start

Linux

Create or edit /etc/default/gpustack and add the proxy configuration:

vim /etc/default/gpustack

HTTP_PROXY="http://username:password@proxy-server:port"
HTTPS_PROXY="http://username:password@proxy-server:port"
ALL_PROXY="socks5://username:password@proxy-server:port"
NO_PROXY="localhost,127.0.0.1,192.168.0.0/24,172.16.0.0/16,10.0.0.0/8"

Save and restart GPUStack:

systemctl restart gpustack

macOS

Create or edit /etc/default/gpustack and add the proxy configuration:

vim /etc/default/gpustack

HTTP_PROXY="http://username:password@proxy-server:port"
HTTPS_PROXY="http://username:password@proxy-server:port"
ALL_PROXY="socks5://username:password@proxy-server:port"
NO_PROXY="localhost,127.0.0.1,192.168.0.0/24,172.16.0.0/16,10.0.0.0/8"

Then re-run the installation script using the same configuration options you originally used:

curl -sfL https://get.gpustack.ai | <EXISTING_INSTALL_ENV> sh -s - <EXISTING_GPUSTACK_ARGS>

Windows

Create or edit $env:APPDATA\gpustack\gpustack.env and add the proxy configuration:

notepad $env:APPDATA\gpustack\gpustack.env

HTTP_PROXY="http://username:password@proxy-server:port"
HTTPS_PROXY="http://username:password@proxy-server:port"
ALL_PROXY="socks5://username:password@proxy-server:port"
NO_PROXY="localhost,127.0.0.1,192.168.0.0/24,172.16.0.0/16,10.0.0.0/8"

Then re-run the installation script using the same configuration options you originally used:

curl -sfL https://get.gpustack.ai | <EXISTING_INSTALL_ENV> sh -s - <EXISTING_GPUSTACK_ARGS>

FAQ

Support Matrix

Hybird Cluster Support

Distributed Inference Support

Installation

How can I change the default GPUStack port?

How can I change the registered worker name?

How can I change the registered worker IP?

Where are GPUStack's data stored?

Where are model files stored?

What parameters can I set when starting GPUStack?

Upgrade

How can I upgrade the built-in vLLM?

How can I upgrade the built-in Transformers?

How can I upgrade the built-in llama-box?

View Logs

How can I view the GPUStack logs?

How can I enable debug mode in GPUStack?

How can I view the RPC server logs?

How can I view the Ray logs?

Where are the model logs stored?

How can I enable the backend debug mode?

Managing Workers

What should I do if the worker is stuck in Unreachable state?

What should I do if the worker is stuck in NotReady state?

Detect GPUs

Why did it fail to detect the Ascend NPUs?

Managing Models

How can I deploy the model?

How can I deploy the model from Hugging Face?

How can I deploy the model from Local Path?

What should I do if the model is stuck in Pending state?

What should I do if the model is stuck in Scheduled state?

What should I do if the model is stuck in Error state?

How can I resolve the error *.so: cannot open shared object file: No such file or directory?

Why did it fail to load the model when using the local path?

Why doesn’t deleting a model free up disk space?

Why does each GPU have a llama-box process by default?

Backend Parameters

How can I know the purpose of the backend parameters?

How can I set the model’s context length?

Using Models

Using Vision Language Models

How can I resolve the error At most 1 image(s) may be provided in one request?

Managing GPUStack

How can I manage the GPUStack service?

How do I use GPUStack behind a proxy?

What should I do if the worker is stuck in `Unreachable` state?

What should I do if the worker is stuck in `NotReady` state?

What should I do if the model is stuck in `Pending` state?

What should I do if the model is stuck in `Scheduled` state?

What should I do if the model is stuck in `Error` state?