Running Inference With Moore Threads GPUs
GPUStack supports running inference on Moore Threads GPUs. This tutorial provides a comprehensive guide to configuring your system for optimal performance.
System and Hardware Support
OS | Architecture | Status | Verified |
---|---|---|---|
Linux | x86_64 | Support | Ubuntu 20.04/22.04 |
Device | Status | Verified |
---|---|---|
MTT S80 | Support | Yes |
MTT S3000 | Support | Yes |
MTT S4000 | Support | Yes |
Prerequisites
The following instructions are applicable for
Ubuntu 20.04/22.04
systems withx86_64
architecture.
Configure the Container Runtime
Follow these links to install and configure the container runtime:
- Install Docker: Docker Installation Guide
- Install the latest drivers for MTT S80/S3000/S4000 (currently rc3.1.0): MUSA SDK Download
- Install the MT Container Toolkits (currently v1.9.0): MT CloudNative Toolkits Download
Verify Container Runtime Configuration
Ensure the output shows the default runtime as mthreads
.
$ (cd /usr/bin/musa && sudo ./docker setup $PWD)
$ docker info | grep mthreads
Runtimes: mthreads mthreads-experimental runc
Default Runtime: mthreads
Installing GPUStack
To set up an isolated environment for GPUStack, we recommend using Docker.
docker run -d --name gpustack-musa -p 9009:80 --ipc=host -v gpustack-data:/var/lib/gpustack \
gpustack/gpustack:main-musa
This command will:
- Start a container with the GPUStack image.
- Expose the GPUStack web interface on port
9009
. - Mount the
gpustack-data
volume to store the GPUStack data.
To check the logs of the running container, use the following command:
docker logs -f gpustack-musa
If the following message appears, the GPUStack container is running successfully:
2024-11-15T23:37:46+00:00 - gpustack.server.server - INFO - Serving on 0.0.0.0:80.
2024-11-15T23:37:46+00:00 - gpustack.worker.worker - INFO - Starting GPUStack worker.
Once the container is running, access the GPUStack web interface by navigating to http://localhost:9009
in your browser.
After the initial setup for GPUStack, you should see the following screen:
Dashboard
Workers
GPUs
Running Inference
After installation, you can deploy models and run inference. Refer to the model management for detailed usage instructions.
Moore Threads GPUs support inference through the llama-box (llama.cpp) backend. Most recent models are supported (e.g., llama3.2:1b, llama3.2-vision:11b, qwen2.5:7b, etc.).
Use mthreads-gmi
to verify if the model is offloaded to the GPU.
root@a414c45864ee:/# mthreads-gmi
Sat Nov 16 12:00:16 2024
---------------------------------------------------------------
mthreads-gmi:1.14.0 Driver Version:2.7.0
---------------------------------------------------------------
ID Name |PCIe |%GPU Mem
Device Type |Pcie Lane Width |Temp MPC Capable
| ECC Mode
+-------------------------------------------------------------+
0 MTT S80 |00000000:01:00.0 |98% 1339MiB(16384MiB)
Physical |16x(16x) |56C YES
| N/A
---------------------------------------------------------------
---------------------------------------------------------------
Processes:
ID PID Process name GPU Memory
Usage
+-------------------------------------------------------------+
0 120 ...ird_party/bin/llama-box/llama-box 2MiB
0 2022 ...ird_party/bin/llama-box/llama-box 1333MiB
---------------------------------------------------------------