A
argbe.tech - news1min read
Dynamic context parallelism cuts waste in variable-length training
NVIDIA described Dynamic Context Parallelism (Dynamic-CP) in Megatron Core, a per-microbatch scheduling approach that adapts context-parallel sharding to variable-length sequences to reduce idle time and communication overhead.
NVIDIA introduced Dynamic Context Parallelism (Dynamic-CP) in Megatron Core to vary context-parallel sharding per microbatch when sequence lengths fluctuate.
- Dynamic-CP targets LLM post-training and diffusion transformer (DiT) pre-training, where real datasets show long-tail sequence lengths that skew compute and memory.
- NVIDIA reported up to 1.48× training speedup on real-world datasets by selecting a CP size that better matches each packed microbatch.
- Even with sample-level packing, attention’s quadratic compute cost means “equal-length” packs can still create data-parallel imbalance, leaving some GPU ranks waiting at gradient synchronization.
- Static CP sizing based on the longest sequence can force short sequences to shard unnecessarily, increasing attention communication cost; this overhead can surface when CP spans InfiniBand domains and compute is too small to hide it.
- Megatron Core’s Dynamic-CP approach relies on a solver that chooses packing and CP size without exceeding GPU memory limits, while avoiding heavyweight reconfiguration required by changing tensor- or pipeline-parallel sizes.