Using Vision Language Models

Vision Language Models can process both visual (image) and language (text) data simultaneously, making them versatile tools for various applications, such as image captioning, visual question answering, and more. In this guide, you will learn how to deploy and interact with Vision Language Models (VLMs) in GPUStack.

The procedure for deploying and interacting with these models in GPUStack is similar. The main difference is the parameters you need to set when deploying the models. For more information on the parameters you can set, please refer to Backend Parameters .

In this guide, we will cover the deployment of the following models:

Qwen3-VL
Llama3.2-Vision
Pixtral
Phi3.5-Vision

Prerequisites

Before you begin, ensure that you have the following:

A Linux machine with one or more GPUs that has at least 30 GB of VRAM in total. We will use the vLLM backend which only supports Linux.
Access to Hugging Face and a Hugging Face API key for downloading the model files.
You have been granted access to the above models on Hugging Face. Llama3.2-VL and Pixtral are gated models, and you need to request access to them.

Note

An Ubuntu node equipped with one H100 (80GB) GPU is used throughout this guide.

Step 1: Install GPUStack

Please follow the Installation Documentation to install GPUStack.

Step 2: Log in to GPUStack UI

After the server starts, run the following command to get the default admin password:

docker exec gpustack cat /var/lib/gpustack/initial_admin_password

Open your browser and navigate to http://your_host_ip to access the GPUStack UI. Use the default username admin and the password you retrieved above to log in.

Step 3: Deploy Vision Language Models with vLLM

Deploy Qwen3-VL

Navigate to the Deployments page in the GPUStack UI.
Click on the Deploy Model button, then select Hugging Face in the dropdown.
Search for Qwen/Qwen3-VL-4B-Instruct in the search bar.
Click the Save button. The default configurations should work as long as you have enough GPU resources.

Deploy Llama3.2-Vision

Navigate to the Deployments page in the GPUStack UI.
Click on the Deploy Model button, then select Hugging Face in the dropdown.
Search for meta-llama/Llama-3.2-11B-Vision-Instruct in the search bar.
Expand the Advanced section in configurations and scroll down to the Backend Parameters section.
Click on the Add Parameter button multiple times and add the following parameters:

--enforce-eager
--max-num-seqs=16
--max-model-len=8192

Click the Save button.

Deploy Pixtral

Navigate to the Deployments page in the GPUStack UI.
Click on the Deploy Model button, then select Hugging Face in the dropdown.
Search for mistralai/Pixtral-12B-2409 in the search bar.
Expand the Advanced section in configurations and scroll down to the Backend Parameters section.
Click on the Add Parameter button multiple times and add the following parameters:

--tokenizer-mode=mistral
--limit-mm-per-prompt=image=4

Click the Save button.

Deploy Phi3.5-Vision

Navigate to the Deployments page in the GPUStack UI.
Click on the Deploy Model button, then select Hugging Face in the dropdown.
Search for microsoft/Phi-3.5-vision-instruct in the search bar.
Expand the Advanced section in configurations and scroll down to the Backend Parameters section.
Click on the Add Parameter button and add the following parameter:

--trust-remote-code

Click the Save button.

Step 4: Interact with Vision Language Models

Navigate to the Chat page in the GPUStack UI.
Select the deployed model from the top-right dropdown.
Click on the Upload Image button above the input text area and upload an image.
Enter a prompt in the input text area. For example, "Describe the image."
Click the Submit button to generate the output.