2024-11-29
HAMi, formerly known as 'k8s-vGPU-scheduler', is a Heterogeneous device management middleware for Kubernetes. It can manage different types of heterogeneous devices (like GPU, NPU, etc.), share heterogeneous devices among pods, and make better scheduling decisions based on topology of devices and schedule policies.
HAMi aims to remove the gap between different heterogeneous devices and provide a unified interface for users to manage with no change to their applications. Until June 2024, HAMi has been widely used around the world in various industries such as Internet, Cloud Computing, Finance, and Manufacturing. More than 40 companies or institutions are not only end users but also active contributors.
HAMi is a Cloud Native Computing Foundation(CNCF) sandbox and landscape project, as well as a CNAI Landscape project.
HAMi provides device virtualization for several heterogeneous devices including GPU, supporting device sharing and device resource isolation. For the list of devices supporting device virtualization, please refer to supported devices.
HAMi supports hard isolation of device resources. Here's a simple demonstration using NVIDIA GPU as an example: After submitting a task defined as follows:
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 vGPU
nvidia.com/gpumem: 3000 # Each vGPU contains 3000m device memory
Only 3G of visible memory will be available
HAMi consists of several components, including a unified mutatingwebhook, a unified scheduler, and device plugins along with in-container control components for various heterogeneous computing devices. The overall architectural features are shown in the diagram above.
kubectl label nodes {nodeid} gpu=on
helm repo add hami-charts https://project-hami.github.io/HAMi/
helm install hami hami-charts/hami -n kube-system
You can customize your installation by adjusting the configs
kubectl get pods -n kube-system
If both vgpu-device-plugin
and vgpu-scheduler
pods are in the Running state, your installation is successful.
HAMi-WebUI is available after HAMi v2.4.0. For deployment instructions.
NVIDIA vGPUs can now be requested by containers using the resource type nvidia.com/gpu
:
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: ubuntu-container
image: ubuntu:18.04
command: ["bash", "-c", "sleep 86400"]
resources:
limits:
nvidia.com/gpu: "2" # requesting 2 vGPUs
nvidia.com/gpumem: "3000" # Each vGPU contains 3000m device memory (optional, integer)
nvidia.com/gpucores: "30" # Each vGPU uses 30% of actual GPU computing power (optional, integer)
If your task cannot run on any node (e.g., if the task's nvidia.com/gpu
is greater than the actual GPU count of any GPU node in the cluster), the task will remain in pending
state.
You can now execute the nvidia-smi
command in the container to compare the difference between vGPU and actual GPU memory size.
Note:
- If you use the privileged field, this task will not be scheduled as it can see all GPUs and will affect other tasks.
- Do not set the nodeName field; use nodeSelector for similar requirements.
For more examples, click here: Examples
Monitoring is automatically enabled after installation. Obtain cluster information overview by visiting:
http://{scheduler ip}:{monitorPort}/metrics
The default monitorPort is 31993; other values can be set using --set devicePlugin.service.httpPort during installation.
Grafana dashboard example
Note The vGPU status of a node will only be collected after it uses vGPU
RiseUnion, as one of the core developers of the HAMi open source community, continues to promote the co-construction and development of the HAMi community.
If you want to become a contributor to HAMi, please refer to: Contributor Guide.
For more details, please refer to: HAMi github.
Rise VAST is the HAMi Enterprise Edition launched by RiseUnion in collaboration with 4Paradigm, building upon the open-source version. It introduces numerous enterprise-level features, including the superposition of computing power and memory, computing power expansion and preemption, computing power specification definition, nvlink topology awareness, differentiated scheduling strategies, enterprise-level isolation, resource quota control, multi-cluster management, audit logs, high availability assurance, and refined operation analysis, among many other core functionalities. By providing unified management, shared allocation, on-demand distribution, and rapid scheduling of computing power clusters, it fully unleashes the potential of heterogeneous computing power, accelerating the modernization and intelligent transformation of AI infrastructure. Click to view the related report on the HAMi enterprise edition signed by RiseUnion and 4Paradigm.