Distributed Training
Multi-framework, multi-GPU, multi-node — seamless scaling from single-machine experiments to large-scale distributed training
Product Overview
Core Capabilities
Five Framework Support
Native support for PyTorch, TensorFlow, DeepSpeed, MPI, and LlamaFactory. Each framework auto-configures its task roles (master/worker/ps/launcher) — no manual distributed topology setup required.
Volcano Distributed Scheduling
Volcano Gang Scheduling ensures coordinated multi-Pod launches with queue priority (high/medium/low), minimum-available guarantees, and automatic timeout reclamation for deterministic resource allocation at scale.
Unified Heterogeneous Hardware
Unified resource mapping across NVIDIA GPU (nvidia.com/gpu), vGPU (volcano.sh/vgpu-number + vgpu-memory), and Ascend NPU (volcano.sh/Ascend910A/B). A single API submission auto-adapts to the underlying hardware.
Real-Time Progress Tracking
Sub-second metric collection — step, loss, gradNorm, learningRate, epoch — rendered as live training curves. Supports tqdm log parsing for LlamaFactory with automatic training phase detection.
TensorBoard Integration
One-click TensorBoard launch with auto-generated Ingress URL. Supports multi-experiment comparison. Training jobs automatically create TensorBoard log directories — zero extra configuration.
Model Comparison & Export
Launch temporary inference services post-training for A/B comparison. One-click model export with LoRA merge and quantization options (INT8/INT4).
Framework Support Matrix
| Framework | Single GPU | Multi-GPU | Multi-Node Distributed | Ascend NPU |
|---|---|---|---|---|
| PyTorch | ✓ | ✓ | ✓ | ✓ |
| TensorFlow | ✓ | ✓ | ✓ | — |
| DeepSpeed | ✓ | ✓ | ✓ | — |
| MPI | ✓ | ✓ | ✓ | — |
| LlamaFactory | ✓ | ✓ | ✓ | ✓ |
Training Workflow
Select Framework
Choose training framework and resource spec; platform auto-recommends task topology
Configure Resources
Specify GPU type, queue priority, data storage, shared memory
Submit Job
Volcano scheduler co-allocates resources with Gang Scheduling for synchronized startup
Monitor Training
Live loss curves, per-Pod GPU/memory metrics, TensorBoard visualization
Export Model
A/B compare results, one-click export or merge model