Using Audio Models
GPUStack supports running both Speech-to-Text and Text-to-Speech models. Speech-to-Text models convert audio inputs in various languages into written text, while Text-to-Speech models transform written text into natural and expressive speech.
In this guide, we will walk you through deploying and using Speech-to-Text and Text-to-Speech models in GPUStack.
Prerequisites
Before you begin, ensure that you have the following:
- GPUStack is installed and running. If not, refer to the Quickstart Guide.
- Access to Hugging Face or ModelScope for downloading the model files.
Running Speech-to-Text Model
Step 1: Deploy Speech-to-Text Model
Follow these steps to deploy the model from the Model Catalog:
- Navigate to the
Model Catalogpage in the GPUStack UI. - Select
Speech-to-Textin the category filter, then select theWhisper-Large-V3-Turbomodel. - Leave everything as default and click the
Savebutton to deploy the model.
After deployment, you can monitor the model deployment's status on the Deployments page. Once the deployment is successful, click the ellipsis icon of the deployment and select Open in Playground to start using the model in the Playground.
Step 2: Interact with Speech-to-Text Model
In the Speech to Text playground,
- Click the
Uploadbutton to upload an audio file, or click theMicrophonebutton to record audio. - Click the
Generate Text Contentbutton to generate the transcription.
Step 3: Streaming Output via API
You can also use the API to get streaming transcriptions. Here's an example using curl:
# Replace ${SERVER_URL} with your GPUStack server URL and ${YOUR_GPUSTACK_API_KEY} with your API key.
curl ${SERVER_URL}/v1/audio/transcriptions \
-H "Content-Type: multipart/form-data" \
-H "Authorization: Bearer ${YOUR_GPUSTACK_API_KEY}" \
-F model="whisper-large-v3-turbo" \
-F file="@/path/to/audio-file;type=audio/mpeg" \
-F language="en" \
-F stream="true"
This will return streaming transcription results as they become available.
Running Text-to-Speech Model
Step 1: Deploy Text-to-Speech Model
Follow these steps to deploy the model from the Model Catalog:
- Navigate to the
Model Catalogpage in the GPUStack UI. - Select
Text-to-Speechin the category filter, then select theQwen3-TTS-12Hz-1.7B-CustomVoicemodel. - Leave everything as default and click the
Savebutton to deploy the model.
After deployment, you can monitor the model deployment's status on the Deployments page. Once the deployment is successful, click the ellipsis icon of the deployment and select Open in Playground to start using the model in the Playground.
Step 2: Interact with Text-to-Speech Model
In the Text to Speech playground,
- Select the desired voice from the
Voicedropdown. - (Optional) Provide
Instructionsto guide the model to generate the desired style of speech. - Enter the text you want to convert to speech.
- Click the
Submitbutton to generate the audio.
Step 3: Streaming Output via API
You can also use the API to get streaming audio output. Here's an example using curl:
# Replace ${SERVER_URL} with your GPUStack server URL and ${YOUR_GPUSTACK_API_KEY} with your API key.
curl ${SERVER_URL}/v1/audio/speech \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${YOUR_GPUSTACK_API_KEY}" \
-d '{
"model": "qwen3-tts-12hz-1.7b-customvoice",
"voice": "Vivian",
"task_type": "CustomVoice",
"language": "Auto",
"input": "Good one. Okay, fine, I'm just gonna leave this sock monkey here. Goodbye.",
"stream": true,
"response_format": "pcm"
}' --no-buffer | play -t raw -r 24000 -e signed -b 16 -c 1 -
This will stream the audio output directly and play it using the play command. The audio is streamed in PCM format at 24kHz sample rate with 16-bit signed encoding and mono channel.
Voice Cloning Using Qwen3-TTS
GPUStack also supports voice cloning with Text-to-Speech models. Here's how to use it:
Step 1: Deploy Voice Cloning Model
- Navigate to the
Model Catalogpage in the GPUStack UI. - Select
Text-to-Speechin the category filter, then select theQwen3-TTS-12Hz-1.7B-Basemodel. - Leave everything as default and click the
Savebutton to deploy the model.
Step 2: Use Voice Cloning in Playground
Once the deployment is successful, click the ellipsis icon of the deployment and select Open in Playground to start using the model in the Playground. Then follow these steps:
- In the
Reference Audiofield, upload an audio file or input an audio URL to provide the reference voice for cloning. For example, you can input the URLhttps://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-0115/APS-en_33.wav. - Check the
Use Speaker Embedding Only (no ICL)option. - Enter the text to be synthesized, for example:
Good one. Okay, fine, I'm just gonna leave this sock monkey here. Goodbye. - Click the
Submitbutton to generate the speech with the cloned voice.






