Model Catalog

The Model Catalog is an index of popular models to help you quickly find and deploy models.

Browse Models

You can browse the Model Catalog by navigating to the Catalog page. You can filter models by name and category. The following screenshot shows the Model Catalog page:

Deploy a Model from the Catalog

You can deploy a model from the Model Catalog by clicking the model card. A model deployment configuration page will appear. You can review and customize the deployment configuration and click the Save button to deploy the model.

Customize Model Catalog

You can customize the Model Catalog by providing a YAML file via GPUStack server configuration using the --model-catalog-file flag. It accepts either a local file path or a URL. You can refer to the built-in model catalog file here for the schema. It contains a list of model sets, each with model metadata and templates for deployment configurations.

The following is an example model set in the model catalog file:

- name: Llama3.2
  description: The Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks.
  home: https://www.llama.com/
  icon: /static/catalog_icons/meta.png
  categories:
    - llm
  capabilities:
    - context/128k
    - tools
  sizes:
    - 1
    - 3
  licenses:
    - llama3.2
  release_date: "2024-09-25"
  order: 2
  templates:
    - quantizations:
        - Q3_K_L
        - Q4_K_M
        - Q5_K_M
        - Q6_K_L
        - Q8_0
        - f16
      source: huggingface
      huggingface_repo_id: bartowski/Llama-3.2-{size}B-Instruct-GGUF
      huggingface_filename: "*-{quantization}*.gguf"
      replicas: 1
      backend: llama-box
      cpu_offloading: true
      distributed_inference_across_workers: true
    - quantizations: ["BF16"]
      source: huggingface
      huggingface_repo_id: unsloth/Llama-3.2-{size}B-Instruct
      replicas: 1
      backend: vllm
      backend_parameters:
        - --enable-auto-tool-choice
        - --tool-call-parser=llama3_json
        - --chat-template={data_dir}/chat_templates/tool_chat_template_llama3.2_json.jinja

Using Model Catalog in Air-Gapped Environments

The built-in model catalog sources models from either Hugging Face or ModelScope. If you are using GPUStack in an air-gapped environment without internet access, you can customize the model catalog to use a local-path model source. Here is an example:

- name: Llama3.2
  description: The Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks.
  home: https://www.llama.com/
  icon: /static/catalog_icons/meta.png
  categories:
    - llm
  capabilities:
    - context/128k
    - tools
  sizes:
    - 1
    - 3
  licenses:
    - llama3.2
  release_date: "2024-09-25"
  order: 2
  templates:
    - quantizations:
        - Q3_K_L
        - Q4_K_M
        - Q5_K_M
        - Q6_K_L
        - Q8_0
        - f16
      source: local_path
      # assuming you have all the GGUF model files in /path/to/the/model/directory
      local_path: /path/to/the/model/directory/Llama-3.2-{size}B-Instruct-{quantization}.gguf
      replicas: 1
      backend: llama-box
      cpu_offloading: true
      distributed_inference_across_workers: true
    - quantizations: ["BF16"]
      source: local_path
      # assuming you have both /path/to/Llama-3.2-1B-Instruct and /path/to/Llama-3.2-3B-Instruct directories
      local_path: /path/to/Llama-3.2-{size}B-Instruct
      replicas: 1
      backend: vllm
      backend_parameters:
        - --enable-auto-tool-choice
        - --tool-call-parser=llama3_json
        - --chat-template={data_dir}/chat_templates/tool_chat_template_llama3.2_json.jinja

Template Variables

The following template variables are available for the deployment configuration:

{size}: Model size in billion parameters.
{quantization}: Quantization method of the model.
{data_dir}: GPUStack data directory path.