HAMi Configuration Guide: GPU Resource Pool Management

2025-04-29


HAMi Configuration Guide: GPU Resource Pool Management

HAMi is an intelligent platform designed for heterogeneous GPU resource pooling and scheduling. To support flexible configuration across different environments and requirements, HAMi's scheduler and device plugins offer comprehensive startup parameters. This guide provides a detailed overview of these parameters and their default behaviors to help you quickly get started and optimize your deployment.

Note: This documentation is based on the current master branch and parameters may be updated.

Scheduler Parameters

HAMi's scheduler component supports the following configuration options:

--http_bind

Specifies the HTTP server binding address. Default: 127.0.0.1:8080.

--cert_file

Path to the TLS certificate file for HTTPS communication.

--key_file

Path to the TLS private key file, used in conjunction with cert_file.

--scheduler-name

Defines the scheduler name written to pod.spec.schedulerName. If empty, uses the default Kubernetes scheduler.

--default-mem

Default GPU memory allocation for pods when not explicitly specified.

--default-cores

Default GPU core utilization percentage for pods when not explicitly specified.

--default-gpu

Default number of GPUs allocated per pod when not specified. Default: 1.

--node-scheduler-policy

Node scheduling policy. Default: "binpack" to consolidate resource allocation.

--gpu-scheduler-policy

GPU scheduling policy. Default: "spread" to distribute workloads across GPUs.

--metrics-bind-address

Prometheus metrics endpoint binding address. Default: :9395.

--node-label-selector

Node selection based on labels, with multiple key-value pairs separated by commas.

--kube-qps

Queries per second (QPS) limit for kube-apiserver communication. Default: 5.0.

--kube-burst

Maximum burst request limit. Default: 10.

--kube-timeout

Timeout for communicating with the kube-apiserver (in seconds). Default is 30.

--profiling

Enables pprof performance profiling via HTTP server. Default: false.

Device Plugin Parameters

HAMi's NVIDIA device plugin supports the following configuration options:

Basic Device Parameters

--node-name

Current node name, automatically read from environment variables by default.

--device-split-count

Number of virtual devices to create from a single GPU. Default: 2.

--device-memory-scaling

GPU memory scaling factor. Default: 1.0 (no scaling).

--device-cores-scaling

GPU core scaling factor. Default: 1.0.

--disable-core-limit

Disables GPU core utilization limits when set. Default: false.

--resource-name

Resource field name for GPU requests in containers. Default: "nvidia.com/gpu".

Advanced Features

--mig-strategy

Resource exposure strategy for MIG-capable (Multi-Instance GPU) devices. Options:

  • none (default): MIG disabled
  • single: Each MIG instance exposed as a separate GPU
  • mixed: Hybrid management mode

--fail-on-init-error

Terminates plugin execution on initialization errors. Default: true (strict mode).

--nvidia-driver-root

Root path for NVIDIA driver installation. Default: /.

--pass-device-specs

Controls whether DeviceSpecs list is passed to kubelet during Allocate(). Default: false.

--device-list-strategy

Method for passing device lists to runtime. Options:

  • envvar (default): Via environment variables
  • volume-mounts: Via mounted volumes
  • cdi-annotations: Via Container Device Interface annotations

--device-id-strategy

Device ID passing method. Options:

  • uuid (default): Using unique identifiers
  • index: Using index numbers

--gds-enabled

Ensures GPU Direct Storage (GDS) is enabled for containers at launch.

--mofed-enabled

Ensures Mellanox OpenFabrics (MOFED) is enabled for containers at launch.

--config-file

Path to configuration file for overriding command-line args or environment variables.

--cdi-annotation-prefix

Prefix for CDI annotation keys. Uses preset value by default.

--nvidia-ctk-path

Path to nvidia-ctk tool for CDI spec generation.

--container-driver-root

NVIDIA driver directory path mounted inside containers for CDI specs.

--v

Logging verbosity level. Default: 0, higher values increase detail.

Summary

HAMi's startup parameters are designed with both flexibility and security in mind, enabling fine-grained tuning based on:

  • Cluster scale
  • Hardware capabilities
  • Scheduling requirements
  • Monitoring strategies

Whether you're testing on a single node or deploying in a large-scale production environment, properly configuring these parameters helps maximize GPU resource utilization while enhancing scheduling efficiency and system stability.

To learn more about RiseUnion's GPU pooling, virtualization and computing power management solutions, please contact us: contact@riseunion.io