A
argbe.tech - news
1min read

NVIDIA outlines Hybrid-EP to push MoE all-to-all closer to hardware limits

NVIDIA introduced Hybrid-EP, a MoE expert-parallel communication approach aimed at reducing all-to-all overhead by streaming token dispatch/combine across NVLink and RDMA networks with low SM usage.

NVIDIA described Hybrid-EP, a hybrid expert-parallel communication path designed to reduce all-to-all bottlenecks when training hyperscale mixture-of-experts (MoE) models.

  • The post frames expert-parallel (EP) as an all-to-all pattern made harder by sparse routing (top-k experts per token) and notes that, in DeepSeek-V3-style MoE training, EP communication can exceed 50% of step time without targeted optimization.
  • Hybrid-EP uses hierarchical transport (intra-node NVLink plus inter-node RDMA) and a streaming pipeline that separates token “dispatch” and “combine” work into different warp groups to mask latency.
  • It advertises native support for FP8 and BF16 data paths and aims to overlap communication with computation rather than running them as separate phases.
  • NVIDIA reports validation via Megatron Core and benchmarks across DeepSeek-V3, Megatron-FSDP, and Qwen 3 235B, including a claimed 514% throughput uplift over prior approaches.
  • In the same results, NVIDIA states Hybrid-EP can saturate network bandwidth using 416 streaming multiprocessors (SMs), leaving more GPU capacity for model compute.