Using Vision Language Models
Vision Language Models can process both visual (image) and language (text) data simultaneously, making them versatile tools for various applications, such as image captioning, visual question answering, and more. In this tutorial, you will learn how to deploy and interact with Vision Language Models (VLMs) in GPUStack.
The procedure for deploying and interacting with these models in GPUStack is similar. The main difference is the parameters you need to set when deploying the models. For more information on the parameters you can set, please refer to Backend Parameters .
In this tutorial, we will cover the deployment of the following models:
- Llama3.2-Vision
 - Qwen2-VL
 - Pixtral
 - Phi3.5-Vision
 
Prerequisites
Before you begin, ensure that you have the following:
- A Linux machine with one or more GPUs that has at least 30 GB of VRAM in total. We will use the vLLM backend which only supports Linux.
 - Access to Hugging Face and a Hugging Face API key for downloading the model files.
 - You have been granted access to the above models on Hugging Face. Llama3.2-VL and Pixtral are gated models, and you need to request access to them.
 
Note
An Ubuntu node equipped with one H100 (80GB) GPU is used throughout this tutorial.
Step 1: Install GPUStack
Run the following command to install GPUStack:
curl -sfL https://get.gpustack.ai | sh -s - --huggingface-token <Hugging Face API Key>
Replace <Hugging Face API Key> with your Hugging Face API key. GPUStack will use this key to download the model files.
Step 2: Log in to GPUStack UI
Run the following command to get the default password:
cat /var/lib/gpustack/initial_admin_password
Open your browser and navigate to http://<your-server-ip>. Replace <your-server-ip> with the IP address of your server. Log in using the username admin and the password you obtained in the previous step.
Step 3: Deploy Vision Language Models
Deoloy from Catalog
Vision language models in the catalog are marked with the vision capability. When you select a vision language model from the catalog, the default configurations should work as long as you have enough GPU resources and the backend is compatible with your setup(e.g., vLLM backend requires an amd64 Linux worker).
Example of Custom Deployment Using llama-box
When deploying GGUF VLM models with llama-box, GPUStack automatically handles the multi-modal projector file and it should work out of the box.
- Navigate to the 
Modelspage in the GPUStack UI and click theDeploy Modelbutton. In the dropdown, selectHugging Faceas the source for your model. - Enable the 
GGUFcheckbox to filter models by GGUF format. - Use the search bar to find the 
bartowski/Qwen2-VL-2B-Instruct-GGUFmodel. - Click the 
Savebutton to deploy the model. 
Example of Custom Deployment Using vLLM
Deploy Llama3.2-Vision
- Navigate to the 
Modelspage in the GPUStack UI. - Click on the 
Deploy Modelbutton, then selectHugging Facein the dropdown. - Search for 
meta-llama/Llama-3.2-11B-Vision-Instructin the search bar. - Expand the 
Advancedsection in configurations and scroll down to theBackend Parameterssection. - Click on the 
Add Parameterbutton multiple times and add the following parameters: 
--enforce-eager--max-num-seqs=16--max-model-len=8192
- Click the 
Savebutton. 
Deploy Qwen2-VL
- Navigate to the 
Modelspage in the GPUStack UI. - Click on the 
Deploy Modelbutton, then selectHugging Facein the dropdown. - Search for 
Qwen/Qwen2-VL-7B-Instructin the search bar. - Click the 
Savebutton. The default configurations should work as long as you have enough GPU resources. 
Deploy Pixtral
- Navigate to the 
Modelspage in the GPUStack UI. - Click on the 
Deploy Modelbutton, then selectHugging Facein the dropdown. - Search for 
mistralai/Pixtral-12B-2409in the search bar. - Expand the 
Advancedsection in configurations and scroll down to theBackend Parameterssection. - Click on the 
Add Parameterbutton multiple times and add the following parameters: 
--tokenizer-mode=mistral--limit-mm-per-prompt=image=4
- Click the 
Savebutton. 
Deploy Phi3.5-Vision
- Navigate to the 
Modelspage in the GPUStack UI. - Click on the 
Deploy Modelbutton, then selectHugging Facein the dropdown. - Search for 
microsoft/Phi-3.5-vision-instructin the search bar. - Expand the 
Advancedsection in configurations and scroll down to theBackend Parameterssection. - Click on the 
Add Parameterbutton and add the following parameter: 
--trust-remote-code
- Click the 
Savebutton. 
Step 4: Interact with Vision Language Models
- Navigate to the 
Playgroundpage in the GPUStack UI. - Select the deployed model from the top-right dropdown.
 - Click on the 
Upload Imagebutton above the input text area and upload an image. - Enter a prompt in the input text area. For example, "Describe the image."
 - Click the 
Submitbutton to generate the output. 
Conclusion
In this tutorial, you learned how to deploy and interact with Vision Language Models in GPUStack. You can use the same approach to deploy other Vision Language Models not covered in this tutorial. If you have any questions or need further assistance, feel free to reach out to us.



