Skip to main content

Rise CAMP: AI Compute Scheduling Platform

Fine-grained Slicing · Precise Scheduling: 4-way intelligent scheduling, making every byte of VRAM count

Platform Overview

GPUs are expensive, but most sit underutilized. Rise CAMP fixes that. On top of Rise VAST managed resources, it uses vGPU slicing to turn one card into many, and 4-way scheduling to match every task to the right GPU. It also provides ready-to-use dev environments, multi-cluster management, and distributed training, and serves as the core scheduling engine for Rise MAX appliances.
30→70 %

GPU cluster utilization boost

4 -way

Intelligent scheduling strategies

10 +

Domestic chip vendors supported

6,000 +

GPUs under management

Fine-grained Slicing

Fine-grained Slicing

vGPU Fine-grained Slicing

Fine-grained compute and VRAM partitioning with multiple tasks sharing a single physical GPU. Supports compute and VRAM overcommit for 200+ small models loaded on-demand, boosting GPU utilization from 30% to 70%+.

Domestic Chip Dynamic Partitioning

Breaks through Ascend (fixed 1/2, 1/4 card) and KunlunXin (fixed 24/48/96GB) vendor limitations. Intelligent dynamic allocation without restart, upgrading from manual complex configuration to one-click deployment.

VRAM Isolation & Alignment

Strict VRAM boundary checks preventing out-of-bounds access that causes performance degradation or crashes. Auto-alignment to valid specs, inter-container isolation, and real-time VRAM monitoring.

Multi-Model Co-location

Agent-era multi-model deployment: precisely partition a 7B router, 14B summarizer, and 8B embedding model onto one 80G GPU (20G + 30G + 30G) with hard isolation.

Speculative Decoding Foundation

Deploy 7B draft model and 72B target model on specific vGPU slices of the same physical node, leveraging shared memory for ultra-fast data exchange without wasting full GPU resources.

K8s Standardization

Reduces GPU resources to standard K8s countable resources like CPU and memory, enabling Volcano and advanced schedulers for complex Bin-packing where every byte of VRAM is precisely utilized.

Developer Productivity

Ready-to-use Dev Environment

Pre-installed PyTorch/TensorFlow/Paddle on Jupyter and VSC with SSH access and native TensorBoard integration. Instant environment startup, no more tedious configuration.

Distributed Training

One-click multi-node multi-GPU distributed training with PyTorch, TensorFlow, MPI, and DeepSpeed. Built-in TensorBoard for visual training progress tracking.

Multi-tenant Resource Isolation

Four-tier RBAC (platform admin, tenant owner, project admin, project member) with team and project-based resource quotas. Flexible shared and dedicated pool combinations.

Checkpointing & Auto-recovery

Automatic checkpoint saving for training jobs. Faulty nodes auto-isolated and workloads rescheduled, minimizing training time lost to hardware failures.

Image Registry & Storage

Built-in image registry with base images, custom images, and external registry support. Public and custom storage configuration with data persistence guarantees.

Multi-cluster Management

Unified management across geographies and architectures (x86/ARM) with multiple K8s clusters. LAN-based inter-cluster coordination with edge node vGPU support.

Use Cases

Heterogeneous GPU Resource Pool

Unified management of NVIDIA H20, Ascend 910B, KunlunXin P800 multi-architecture clusters. A state-owned bank built a heterogeneous pool with CAMP, managing 600+ servers with 50%+ utilization improvement.

Inference & Agent Co-location

Mix online inference and offline training on the same cluster via vGPU slicing and priority scheduling. A telecom provider runs 500+ model services across 100+ servers at 70%+ GPU utilization.

Multi-tenant AI Dev Platform

Unified development environments and compute resources for multiple R&D teams. A financial institution deploys risk, marketing, and customer service AI applications supporting hundreds of stable model services.

Cross-region Multi-cluster Scheduling

Unified management across multiple data centers (e.g., Beijing, Inner Mongolia) with 100G interconnect. A manufacturing enterprise achieved 60%+ utilization improvement through unified local and remote GPU management.

Frequently Asked Questions

01 How is Rise CAMP different from Run:ai, Volcano, or Kueue?
All three target AI workload scheduling, but with different focus: Volcano / Kueue are K8s-native batch schedulers — strong on queue management and gang scheduling, but lacking deep support for GPU virtualization and domestic chips. Run:ai is a complete AI scheduling solution but is closed-source, NVIDIA-locked, and effectively unusable in domestic-stack scenarios. Rise CAMP differentiates on: (1) deep integration with Rise VAST, with scheduling decisions reaching vGPU granularity; (2) native topology-aware scheduling for Ascend, Cambricon, and other domestic accelerators; (3) four-dimensional awareness — topology, priority, load, and resource; (4) fully private deployable across the domestic stack.
02 Does CAMP require VAST?
CAMP runs as a Kubernetes-native scheduler on any standard GPU cluster, delivering topology awareness, priority preemption, and fair queuing. But CAMP + VAST is the best-practice combo — VAST provides vGPU virtualization, enabling CAMP to push scheduling decisions to vGPU granularity for true fine-grained sharing and compute oversubscription.
03 What scheduling policies does CAMP offer? How do I choose?
CAMP provides four-dimensional scheduling: topology-aware (prefers NVLink / PCIe / NUMA-local placement to avoid cross-topology bottlenecks, critical for multi-GPU training), priority-aware (high-priority workloads can preempt low-priority ones, with multi-level queues), load-aware (dynamic allocation based on real-time GPU utilization to avoid hotspots), and resource-aware (multi-dimensional matching across memory, compute, and network bandwidth). Also supports gang scheduling, binpack / spread, and custom scoring plugins. Different teams can run different policy combinations.
04 How does CAMP schedule distributed multi-node training? Does it support PyTorch DDP / DeepSpeed?
Native support for PyTorch DDP, DeepSpeed, Megatron, and Horovod. CAMP's topology-aware scheduling co-locates workers from the same training job on NVLink-connected GPUs within a node, and selects high-bandwidth IB / RoCE topologies for cross-node placement. Combined with gang scheduling (all-or-nothing Pod startup), this prevents distributed training jobs from wasting resources when partial Pods get stuck.
05 If the cluster runs out of resources, do new jobs get rejected or queued? Does CAMP support preemption?
CAMP provides full queue management: resource-available jobs schedule immediately, others enter queues with priority sorting, quota limits, and max-wait policies. Priority preemption is supported — high-priority jobs can evict running low-priority ones (e.g., daytime inference services preempting nightly training), with preempted jobs auto-requeued. All preemption decisions are logged, and you can configure non-preemptible allowlists to protect critical workloads.
06 How do I prevent a single team or project from monopolizing the cluster? How does multi-tenant quota work?
CAMP provides multi-level quota management: cluster → tenant → project, each with limits on GPU count, memory, CPU, and RAM. Combined with fair queuing, simultaneous job submissions from multiple teams get resources proportional to their quota — no single team can monopolize. Real-time quota usage is fully visualized.
07 Can inference services and training jobs share the same cluster without interfering with each other?
This is one of CAMP's core scenarios: training and inference colocation. With VAST's vGPU virtualization, inference services (high-priority, low-latency) and training jobs (low-priority, preemptible) can run on the same physical GPU. CAMP's priority preemption + compute quotas guarantee inference SLAs are unaffected by training, while training jobs automatically reclaim capacity when inference load drops. Overall GPU utilization climbs from 30% to 70%+.
08 How does CAMP integrate with our existing IAM / LDAP / SSO? What's the permission model?
CAMP supports standard OIDC, LDAP, SAML, WeCom, and DingTalk authentication, integrating with enterprise IAM systems like OKTA, Auth0, Keycloak, or in-house IdPs. The permission model extends K8s RBAC with three-level access control — tenant / project / role, supporting fine-grained resource permissions (which user can use which GPU resources under which project). All operations are audit-logged for MLPS compliance.