2025-12-12
In our previous article, we explored HAMi-core CUDA Compatibility Mechanisms: Cross-Version Stability Explained, examining how HAMi-core addresses the fundamental challenge of "getting workloads to run."
However, the most frequently discussed concerns in community and user interactions often revolve around resource contention, such as:
Among various GPU virtualization technologies, memory isolation is typically straightforward (a matter of presence or absence), but compute (SM) partitioning presents a more challenging problem.
In this article, we dive into the HAMi-core source code to explore how it implements compute partitioning through time-slicing and dynamic feedback mechanisms.
HAMi-core does not modify the GPU hardware scheduler. Instead, it implements a time-slicing mechanism at the software layer by intercepting CUDA APIs. The core logic resides in the rate_limiter() function (src/multiprocess/multiprocess_utilization_watcher.c).
Think of it as a dynamic traffic control system:
The system maintains a global CUDA core token pool (g_cur_cuda_cores).
To determine when to "release" kernels, a background daemon thread utilization_watcher() continuously monitors the system.
g_wait.This is the "brain" of the entire mechanism. The system dynamically adjusts token injection rate by comparing target utilization with actual utilization:
delta() function calculates the difference (Target - Current).
Core Logic Summary:
HAMi-core essentially implements a soft rate limiting mechanism at the application layer. It does not directly control GPU hardware execution units, but rather controls the "rate" at which kernels are submitted to the GPU.
In HAMi, configuring compute limits is straightforward and intuitive. Users simply set it via an environment variable:
CUDA_DEVICE_SM_LIMIT0 - 100 (integer)export CUDA_DEVICE_SM_LIMIT=50
This means processes within the container are limited to using no more than 50% of GPU compute capacity.
HAMi-core employs a software-defined rate limiting approach. This design achieves broad GPU compatibility without relying on specific hardware features (such as MIG). When understanding how it works, we need to consider the following technical characteristics:
The system uses a "sample-feedback" closed-loop mechanism: utilization_watcher collects data based on time windows, which means the system's response to burst traffic has inherent latency. Under scenarios with dramatic load fluctuations, utilization curves may exhibit brief dynamic oscillations around the target value before converging to a stable state.
The token bucket checkpoint is located before kernel launch: this mechanism primarily controls task submission frequency, rather than interrupting executing compute units. For the rare case of extremely long-running kernels, instantaneous compute usage may temporarily exceed the limit, but from a long-term statistical perspective, overall average compute usage remains controlled.
The token consumption model is based on Grid/Block dimension estimation: while this is not equivalent to hardware-level instruction cycles, for the vast majority of deep learning and compute workloads, this estimation model achieves a good balance between performance overhead and control effectiveness.
Through the source code analysis above, we can see that HAMi-core has chosen a lightweight, highly compatible technical path for compute partitioning.
Compared to hardware partitioning (such as MIG) with strict physical segmentation, HAMi's "soft isolation" mechanism is more flexible. It implements a general-purpose GPU compute scheduler at the software layer, allowing a degree of resource sharing when resources are idle, while applying rate limiting when resources are constrained. This approach is well-suited for improving overall cluster utilization and multi-workload co-location scenarios, providing an efficient resource management solution for cloud-native AI workloads.
RiseUnion's Rise VAST (Enterprise Edition) builds upon HAMi-core with additional enhancements and optimizations, particularly in timing control precision, workload management, and lock-free performance under high concurrency, to meet the stringent SLA requirements of enterprise customers in production environments.
To learn more about RiseUnion's vGPU resource pooling, virtualization, and AI compute management solutions:please contact us at contact@riseunion.io