Inference APIs
OpenAI-Compatible APIs
GPUStack provides OpenAI-compatible APIs at the /v1 endpoint.
You can integrate and use models deployed on GPUStack with any application or framework that supports the OpenAI-compatible API, simply by pointing it to GPUStack's OpenAI-compatible endpoint.
Supported Endpoints
The following API endpoints are supported:
- List Models
- Create Completion
- Create Chat Completion
- Create Response
- Create Embeddings
- Create Image
- Create Image Edit
- Create Speech
- Create Transcription
Usage
cURL Example
export GPUSTACK_API_KEY=your_api_key
curl http://your_gpustack_server_url/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $GPUSTACK_API_KEY" \
-d '{
"model": "qwen3",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
],
"stream": true
}'
Anthropic-Compatible APIs
GPUStack provides the Anthropic-compatible /v1/messages API.
Usage
export GPUSTACK_API_KEY=your_api_key
curl http://your_gpustack_server_url/v1/messages \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $GPUSTACK_API_KEY" \
-d '{
"model": "qwen3",
"messages": [
{
"role": "user",
"content": "Hello!"
}
],
"max_tokens": 1024
}'
Jina-Compatible Rerank API
In the context of Retrieval-Augmented Generation (RAG), reranking refers to the process of selecting the most relevant information from retrieved documents or knowledge sources before presenting them to the user or utilizing them for answer generation.
Note that the OpenAI-compatible APIs do not provide a rerank endpoint. To fill this gap, GPUStack provides a Jina-compatible Rerank API at the /v1/rerank path.
Usage
cURL Example
export GPUSTACK_API_KEY=your_api_key
curl http://your_gpustack_server_url/v1/rerank \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $GPUSTACK_API_KEY" \
-d '{
"model": "bge-reranker-v2-m3",
"query": "What is a panda?",
"top_n": 3,
"documents": [
"hi",
"it is a bear",
"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China."
]
}' | jq
Example output:
{
"model": "bge-reranker-v2-m3",
"object": "list",
"results": [
{
"document": {
"text": "The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China."
},
"index": 2,
"relevance_score": 1.951932668685913
},
{
"document": {
"text": "it is a bear"
},
"index": 1,
"relevance_score": -3.7347371578216553
},
{
"document": {
"text": "hi"
},
"index": 0,
"relevance_score": -6.157620906829834
}
],
"usage": {
"prompt_tokens": 69,
"total_tokens": 69
}
}
Other APIs
For other API types, GPUStack allows you to enable the Generic Proxy feature when deploying a model.
When the Generic Proxy is enabled, GPUStack determines which model to forward the request to by checking either:
- the
modelfield in the JSON body, or - the
X-GPUStack-Modelheader.
Once enabled, you can forward API requests to the target model via the /model/proxy endpoint. For example:
export GPUSTACK_API_KEY=your_api_key
curl http://your_gpustack_server_url/model/proxy/embed \
-X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $GPUSTACK_API_KEY" \
-H "X-GPUStack-Model: bge-m3" \
-d '{"inputs": ["What is Deep Learning?", "Deep Learning is not..."]}'
For more details, see Enable Generic Proxy.