Skip to main content

Distributed Training

Multi-framework, multi-GPU, multi-node — seamless scaling from single-machine experiments to large-scale distributed training

Product Overview

Enterprise-grade distributed training platform built on the Volcano scheduler. Supports five major frameworks — PyTorch, TensorFlow, DeepSpeed, MPI, and LlamaFactory — across heterogeneous hardware including NVIDIA GPU, vGPU, and Ascend NPU. Provides a 12-state lifecycle manager, real-time training progress tracking, and TensorBoard visualization.

Core Capabilities

Five Framework Support

Native support for PyTorch, TensorFlow, DeepSpeed, MPI, and LlamaFactory. Each framework auto-configures its task roles (master/worker/ps/launcher) — no manual distributed topology setup required.

Volcano Distributed Scheduling

Volcano Gang Scheduling ensures coordinated multi-Pod launches with queue priority (high/medium/low), minimum-available guarantees, and automatic timeout reclamation for deterministic resource allocation at scale.

Unified Heterogeneous Hardware

Unified resource mapping across NVIDIA GPU (nvidia.com/gpu), vGPU (volcano.sh/vgpu-number + vgpu-memory), and Ascend NPU (volcano.sh/Ascend910A/B). A single API submission auto-adapts to the underlying hardware.

Real-Time Progress Tracking

Sub-second metric collection — step, loss, gradNorm, learningRate, epoch — rendered as live training curves. Supports tqdm log parsing for LlamaFactory with automatic training phase detection.

TensorBoard Integration

One-click TensorBoard launch with auto-generated Ingress URL. Supports multi-experiment comparison. Training jobs automatically create TensorBoard log directories — zero extra configuration.

Model Comparison & Export

Launch temporary inference services post-training for A/B comparison. One-click model export with LoRA merge and quantization options (INT8/INT4).

Framework Support Matrix

Framework Single GPU Multi-GPU Multi-Node Distributed Ascend NPU
PyTorch
TensorFlow
DeepSpeed
MPI
LlamaFactory

Training Workflow

1

Select Framework

Choose training framework and resource spec; platform auto-recommends task topology

2

Configure Resources

Specify GPU type, queue priority, data storage, shared memory

3

Submit Job

Volcano scheduler co-allocates resources with Gang Scheduling for synchronized startup

4

Monitor Training

Live loss curves, per-Pod GPU/memory metrics, TensorBoard visualization

5

Export Model

A/B compare results, one-click export or merge model

Back to Rise ModelX