Performing Distributed Inference Across Workers

This tutorial will guide you through the process of configuring and running distributed inference across multiple workers using GPUStack. Distributed inference allows you to handle larger language models by distributing the computational workload among multiple workers. This is particularly useful when individual workers do not have sufficient resources, such as VRAM, to run the entire model independently.

Prerequisites

Before proceeding, ensure the following:

GPUStack is installed and running. Refer to the Setting Up a Multi-node GPUStack Cluster tutorial if needed.
Access to Hugging Face for downloading the model files.

In this tutorial, we’ll assume a cluster with two nodes, each equipped with an NVIDIA P40 GPU (22GB VRAM), as shown in the following image:

We aim to run a large language model that requires more VRAM than a single worker can provide. For this tutorial, we’ll use the Qwen/Qwen2.5-72B-Instruct model with the q2_k quantization format. The required resources for running this model can be estimated using the gguf-parser tool:

$ gguf-parser --hf-repo Qwen/Qwen2.5-72B-Instruct-GGUF --hf-file qwen2.5-72b-instruct-q2_k-00001-of-00007.gguf --ctx-size=8192 --in-short --skip-architecture --skip-metadata --skip-tokenizer

+--------------------------------------------------------------------------------------+
| ESTIMATE                                                                             |
+----------------------------------------------+---------------------------------------+
|                      RAM                     |                 VRAM 0                |
+--------------------+------------+------------+----------------+----------+-----------+
| LAYERS (I + T + O) |     UMA    |   NONUMA   | LAYERS (T + O) |    UMA   |   NONUMA  |
+--------------------+------------+------------+----------------+----------+-----------+
|      1 + 0 + 0     | 243.89 MiB | 393.89 MiB |     80 + 1     | 2.50 GiB | 28.92 GiB |
+--------------------+------------+------------+----------------+----------+-----------+

From the output, we can see that the estimated VRAM requirement for this model exceeds the 22GB VRAM available on each worker node. Thus, we need to distribute the inference across multiple workers to successfully run the model.

Step 1: Deploy the Model

Follow these steps to deploy the model from Hugging Face, enabling distributed inference:

Navigate to the Models page in the GPUStack UI.
Click the Deploy Model button.
In the dropdown, select Hugging Face as the source for your model.
Enable the GGUF checkbox to filter models by GGUF format.
Use the search bar in the top left to search for the model name Qwen/Qwen2.5-72B-Instruct-GGUF.
In the Available Files section, select the q2_k quantization format.
Expand the Advanced section and scroll down. Disable the Allow CPU Offloading option and verify that the Allow Distributed Inference Across Workers option is enabled(this is enabled by default). GPUStack will evaluate the available resources in the cluster and run the model in a distributed manner if required.
Click the Save button to deploy the model.

Step 2: Verify the Model Deployment

Once the model is deployed, verify the deployment on the Models page, where you can view details about how the model is running across multiple workers.

You can also check worker and GPU resource usage by navigating to the Resources page.

Finally, go to the Playground page to interact with the model and verify that everything is functioning correctly.

Conclusion

Congratulations! You have successfully configured and run distributed inference across multiple workers using GPUStack.