認識 mKernel：用於 GPU 驅動通訊的多 GPU、多節點融合核心函式庫

2026年5月29日 08:43

重點摘要

GPU 通訊開銷是生產環境中 AI 工作負載的明顯瓶頸。根據 mKernel 專案引用的數據，通訊耗時佔前向傳遞的 43.6% 及端到端訓練時間的 32%。在流行的混合專家（MoE）模型中，裝置間通訊最高可佔總執行時間的 47%。來自加州大學柏克萊分校 UCCL 專案的研究人員釋出了 mKernel，這是一套持續運行的 CUDA 核心函式庫，能將節點內 NVLink 通訊、節點間 RDMA 及運算融合至單一核心中。問題：主機驅動的通訊。目前多 GPU 通訊的標準模型是由主機驅動：CPU 執行控制路徑並呼叫 NCCL 或 NVSHMEM 等函式庫。該函式庫會發起集體操作——例如 AllReduce、AllGather 等。

站內 AI 整理稿

GPU communication overhead is a measurable bottleneck in production AI workloads. According to data cited by the mKernel project, communication can consume 43.6% of the forward pass and 32% of end-to-end training time. Across popular Mixture-of-Experts (MoE) models, inter-device communication can account for up to 47% of total execution time. Researchers from UC Berkeley’s UCCL project have released mKernel, a library of persistent CUDA kernels that fuse intra-node NVLink communication, inter-node RDMA, and compute into a single kernel. The Problem: Host-Driven Communication The standard model for multi-GPU communication is host-driven: the CPU runs the control path and calls into a library like NCCL or NVSHMEM. The library issues the collective operation — an AllReduce, an AllGather, etc. — across GPUs. Compute and communication run on separate CUDA streams and overlap at kernel boundaries. The research team identifies two problems with this approach: (1) CPUs are not scaling with GPU compute. A GB300 NVL72 rack integrates 72 Blackwell Ultra GPUs and 36 Grace CPUs, delivering 720 PFLOP/s FP8/FP6, 1.44 EFLOP/s FP4 Tensor Core performance, and 130 TB/s of all-to-all intra-rack NVLink bandwidth. At those speeds, microsecond-scale host orchestration overhead — a cudaLaunchKernel call, a CPU-side “all writes done” check, an inter-stream event — shows up directly as pipeline bubbles. (2) Host-driven systems overlap compute and communication at coarse kernel boundaries. Finer-grained overlap at the tile or chunk level is not possible from the host side. The alternative is GPU-driven communication: the GPU itself triggers transfers, with communication fused into the same kernel as the compute. Most existing fused kernel libraries operate within a single node, or a single GPU. mKernel targets the multi-node case. What mKernel Does mKernel is a library of persistent CUDA kernels. Each kernel fuses intra-node NVLink communication, inter-node RDMA, and dense compute into a single kernel. Multi-GPU + multi-node, in one kernel: Both intra-node NVLink and inter-node RDMA live inside the same persistent kernel. Fine-grained intra-kernel overlap: Compute and communication overlap at tile/chunk granularity, covering both intra-node and inter-node GPU communication. Persistent kernel with SM specialization: CTAs self-assign roles: compute, intra-comm, inter-send, inter-reduce. The number of SMs dedicated to each role is tunable per shape. GPU-driven networking built on libibverbs: mKernel uses GPU-initiated RDMA writes without depending on NCCL or NVSHMEM. The communication backend is written from scratch to maximize performance and support heterogeneous networking devices. The Five Fused Kernels KernelWhat it fusesDescriptionAllGather + GEMMAllGather → GEMMEach rank holds a shard of A. While ranks gather peers’ shards over NVLink/RDMA, the local GEMM consumes tiles as soon as they arrive.GEMM + AllReduceGEMM → AllReduceComputes C = A @ B and reduces partial outputs across all ranks in one launch. Output tiles are pushed into the reduction tree the instant they’re produced.MoE Dispatch + GEMMAll-to-All dispatch → grouped GEMMRoutes MoE tokens to their expert ranks (intra-node NVLink + inter-node all-to-all) and runs the per-expert grouped GEMM in the same kernel. Tokens are processed as soon as they land — no staging buffer round-trip.Ring AttentionRing KV exchange → FlashAttentionSequence-parallel attention across ranks. Each step rotates a KV chunk around the ring while the local FlashAttention consumes the previously-received chunk. Compute and the ring send/recv run concurrently inside a single persistent kernel.GEMM + ReduceScatterGEMM → ReduceScatterComputes C = A @ B and reduce-scatters the output. Each output tile is reduced and forwarded to its owning rank as soon as it is produced. Evaluation Setup The research team evaluated mKernel on two 2-node × 8-H200 clusters that differ only in their inter-node fabric: TestbedNodes × GPUsIntra-nodeInter-node transportNICAWS EFA2 × 8 H200NVLinkAWS EFA / SRD16 × 200 Gb/s EFA per nodeConnectX-72 × 8 H200NVLinkInfiniBand8 × 400 Gb/s NVIDIA ConnectX-7 per node mKernel was benchmarked against NCCL, Triton-distributed, Flux, Mercury, MagiAttention, Transformer-Engine, and ring-flash-attention. The team notes that further benchmarking at larger scale is still in progress. Backends and Requirements mKernel supports two networking backends: BackendMacroTransportWhere it runsCX7-DINTERNODE_BACKEND_IBVERBSlibibverbs RCConnectX-7 / InfiniBand / RoCEEFA-DINTERNODE_BACKEND_EFAlibibverbs + efadv (SRD)AWS p5/p5e (H200, EFA) Both backends share the same host-side API and the same on-GPU kernel. Only the proxy/session implementation differs (session.h for CX7, session_efa.h for EFA). Requirements: NVIDIA Hopper GPUs (default build targets sm_90a), CUDA 12.9, Python with PyTorch. The CX7 backend requires libibverbs development headers and libraries. The EFA backend requires AWS EFA installation with libfabric, libibverbs, efadv, and EFA headers under EFA_HOME=/opt/amazon/efa by default. Marktechpost’s Visual Explainer #mkernel-slider-guide *{box-sizing:border-box;margin:0;padding:0} #mkernel-slider-guide{font-family:'Segoe UI',system-ui,sans-serif;background:#f5f7fa!important;border:1px solid #e2e6ed!important;border-radius:12px!important;overflow:hidden;max-width:780px;margin:2rem auto;box-shadow:0 4px 24px rgba(0,0,0,0.08)} #mkernel-slider-guide .mk-header{background:#ffffff!important;border-bottom:2px solid #e2e6ed!important;padding:18px 24px;display:flex;align-items:center;gap:12px} #mkernel-slider-guide .mk-logo{background:#1a73e8!important;color:#fff!important;font-size:11px!important;font-weight:700!important;padding:4px 10px!important;border-radius:5px!important;letter-spacing:0.05em;text-transform:uppercase;border:none!important} #mkernel-slider-guide .mk-header-title{font-size:13px!important;color:#5f6368!important;font-weight:500!important;border:none!important;background:none!important} #mkernel-slider-guide .mk-header-title span{color:#1a73e8!important;font-weight:700!important} #mkernel-slider-guide .mk-slides{position:relative;overflow:hidden} #mkernel-slider-guide .mk-track{display:flex;transition:transform 0.4s cubic-bezier(.4,0,.2,1)} #mkernel-slider-guide .mk-slide{min-width:100%;padding:32px 32px 28px;background:#ffffff!important} #mkernel-slider-guide .mk-slide-num{font-size:11px!important;font-weight:700!important;color:#1a73e8!important;letter-spacing:0.1em;text-transform:uppercase;margin-bottom:10px!important;border:none!important;background:none!important} #mkernel-slider-guide .mk-slide h2{font-size:20px!important;font-weight:700!important;color:#1a1a2e!important;line-height:1.3!important;margin-bottom:14px!important;border:none!important;background:none!important} #mkernel-slider-guide .mk-slide h2 .mk-accent{color:#1a73e8!important} #mkernel-slider-guide .mk-slide p{font-size:14px!important;color:#3c4043!important;line-height:1.7!important;margin-bottom:12px!important;border:none!important;background:none!important} #mkernel-slider-guide .mk-slide p:last-child{margin-bottom:0!important} #mkernel-slider-guide .mk-tag{display:inline-block;background:#e8f0fe!important;color:#1a73e8!important;font-size:11px!important;font-weight:600!important;padding:3px 9px!important;border-radius:4px!important;margin:2px 3px 2px 0!important;border:none!important} #mkernel-slider-guide .mk-stat-row{display:flex;gap:12px;margin:16px 0 4px;flex-wrap:wrap} #mkernel-slider-guide .mk-stat{flex:1;min-width:140px;background:#f0f4ff!important;border:1px solid #d2e3fc!important;border-radius:8px!important;padding:14px 16px!important} #mkernel-slider-guide .mk-stat-val{font-size:22px!important;font-weight:800!important;color:#1a73e8!important;line-height:1!important;margin-bottom:4px!important;border:none!important;background:none!important} #mkernel-slider-guide .mk-stat-label{font-size:12px!important;color:#5f6368!important;line-height

原始來源：MarkTechPost AI ↗

查看原始來源

量子位研究與前沿

GPT發AI原創新成果了

這篇消息聚焦「GPT發AI原創新成果了」。原始導語提到：AI實現藥物全自動研發，還遠嗎？從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

剛剛閱讀分析

36氪研究與前沿

AI越強，越要“殺死”過去的自己

這篇消息聚焦「AI越強，越要“殺死”過去的自己」。原始導語提到：人類需要實現思維模式的轉變。從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

2 小時前閱讀分析

MarkTechPost AI研究與前沿

Salesforce CodeGen Tutorial: Generate, Validate, and Rerank Python Functions With Unit Tests and Safety Checks

In this tutorial, we implement an end-to-end workflow for Salesforce CodeGen. We load a CodeGen model from Hugging Face, prepare it for code generation, and use it to generate Python functions from natural-language prompts. We then move beyond basic inference by adding function extraction, syntax checking, static safety checks, unit-test-based validation, best-of-N candidate reranking, multi-step program synthesis, prompt-style experimentation, benchmark visualization, and artifact export. Through this workflow, we learn how CodeGen can be used not only as a code completion model but also as part of a structured code-generation pipeline that evaluates, filters, and organizes generated solutions. Loading the Salesforce CodeGen Model from Hugging Face Copy CodeCopiedUse a different Browserim

8 小時前閱讀分析

36氪研究與前沿

Transformer之父離開谷歌，奧特曼等了他十年

這篇消息聚焦「Transformer之父離開谷歌，奧特曼等了他十年」。原始導語提到：27億美元也沒能留住，Noam Shazeer追尋下一代架構。從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

16 小時前閱讀分析

36氪研究與前沿

Dario訪談首曝：Mythos被稱為“超級武器”

這篇消息聚焦「Dario訪談首曝：Mythos被稱為“超級武器”」。原始導語提到：在這場69分鐘完整訪談裡，Dario Amodei 說人類真正面對的不是某個突然降臨的奇點，而是一條已經開始垂直起飛的指數曲線。從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

19 小時前閱讀分析

鈦媒體研究與前沿

用結構替代數據，因果世界模型如何重塑具身智能大腦

這篇消息聚焦「用結構替代數據，因果世界模型如何重塑具身智能大腦」。原始導語提到：因果世界模型需要一個標誌性的時刻來證明自己。從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

20 小時前閱讀分析

相關文章

GPT發AI原創新成果了

AI越強，越要“殺死”過去的自己

Salesforce CodeGen Tutorial: Generate, Validate, and Rerank Python Functions With Unit Tests and Safety Checks

Transformer之父離開谷歌，奧特曼等了他十年

Dario訪談首曝：Mythos被稱為“超級武器”

用結構替代數據，因果世界模型如何重塑具身智能大腦