Inference with Tool Calling
Tool calling allows you to connect models to external tools and systems. This is useful for many things such as empowering AI assistants with capabilities, or building deep integrations between your applications and the models.
In this tutorial, you’ll learn how to set up and use tool calling within GPUStack to extend your AI’s capabilities.
Note
Prerequisites
Before proceeding, ensure the following:
- GPUStack is installed and running.
- A Linux worker node with a GPU is available. We'll use Qwen2.5-7B-Instruct as the model for this tutorial. The model requires a GPU with at least 18GB VRAM.
- Access to Hugging Face for downloading the model files.
Step 1: Deploy the Model
Deploy from Catalog
LLMs that support tool calling are marked with the tools capability in the catalog. When you select such a model from the catalog, tool calling is enabled by default.
Example of Custom Deployment Using llama-box
When you deploy GGUF models using llama-box, tool calling is enabled by default for models that support it.
- Navigate to the
Deploymentspage in the GPUStack UI and click theDeploy Modelbutton. In the dropdown, selectHugging Faceas the source for your model. - Enable the
GGUFcheckbox to filter models by GGUF format. - Use the search bar to find the
Qwen/Qwen2.5-7B-Instruct-GGUFmodel. - Click the
Savebutton to deploy the model.
Example of Custom Deployment Using vLLM
When you deploy models using vLLM, you need to enable tool calling with additional parameters.
- Navigate to the
Deploymentspage in the GPUStack UI and click theDeploy Modelbutton. In the dropdown, selectHugging Faceas the source for your model. - Use the search bar to find the
Qwen/Qwen2.5-7B-Instructmodel. - Expand the
Advancedsection in configurations and scroll down to theBackend Parameterssection. - Click on the
Add Parameterbutton and add the following parameters:
--enable-auto-tool-choice--tool-call-parser=hermes
- Click the
Savebutton to deploy the model.
After deployment, you can monitor the model's status on the Deployments page.
Step 2: Generate an API Key
We will use the GPUStack API to interact with the model. To do this, you need to generate an API key:
- Hover over the user avatar and navigate to the
API Keyspage. - Click the
New API Keybutton. - Enter a name for the API key and click the
Savebutton. - Copy the generated API key for later use.
Step 3: Do Inference
With the model deployed and an API key, you can call the model via the GPUStack API. Here is an example script using curl (replace <your-server-url> with your GPUStack server URL and <your-api-key> with the API key generated in the previous step):
export GPUSTACK_SERVER_URL=<your-server-url>
export GPUSTACK_API_KEY=<your-api-key>
curl $GPUSTACK_SERVER_URL/v1-openai/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $GPUSTACK_API_KEY" \
-d '{
"model": "qwen2.5-7b-instruct",
"messages": [
{
"role": "user",
"content": "What'\''s the weather like in Boston today?"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location"]
}
}
}
],
"tool_choice": "auto"
}'
Example response:
{
"model": "qwen2.5-7b-instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": null,
"tool_calls": [
{
"id": "chatcmpl-tool-b99d32848b324eaea4bac5a5830d00b8",
"type": "function",
"function": {
"name": "get_current_weather",
"arguments": "{\"location\": \"Boston, MA\", \"unit\": \"fahrenheit\"}"
}
}
]
},
"finish_reason": "tool_calls"
}
],
"usage": {
"prompt_tokens": 212,
"total_tokens": 242,
"completion_tokens": 30
}
}

