A
argbe.tech - news
1min read

Dynamic context parallelism cuts waste in variable-length training

NVIDIA described Dynamic Context Parallelism (Dynamic-CP) in Megatron Core, a per-microbatch scheduling approach that adapts context-parallel sharding to variable-length sequences to reduce idle time and communication overhead.

NVIDIA introduced Dynamic Context Parallelism (Dynamic-CP) in Megatron Core to vary context-parallel sharding per microbatch when sequence lengths fluctuate.

  • Dynamic-CP targets LLM post-training and diffusion transformer (DiT) pre-training, where real datasets show long-tail sequence lengths that skew compute and memory.
  • NVIDIA reported up to 1.48× training speedup on real-world datasets by selecting a CP size that better matches each packed microbatch.
  • Even with sample-level packing, attention’s quadratic compute cost means “equal-length” packs can still create data-parallel imbalance, leaving some GPU ranks waiting at gradient synchronization.
  • Static CP sizing based on the longest sequence can force short sequences to shard unnecessarily, increasing attention communication cost; this overhead can surface when CP spans InfiniBand domains and compute is too small to hide it.
  • Megatron Core’s Dynamic-CP approach relies on a solver that chooses packing and CP size without exceeding GPU memory limits, while avoiding heavyweight reconfiguration required by changing tensor- or pipeline-parallel sizes.