MoonMath AI 開源 AMD MI300X 的 HIP 注意力核心，在每種形狀與捨入模式下皆勝過 AITER v3

2026年6月22日 07:13

重點摘要

MoonMath AI 團隊發布了一個針對 AMD MI300X GPU 的 bf16 前向注意力核心。該核心以 HIP 語言編寫，而非手寫組合語言。程式碼以 MIT 授權開源。MoonMath.ai 團隊報告稱，它在所有測試形狀上都超越了 AMD 自家優化的核心 AITER v3。裸機存取由 AMD 雲端供應商 HotAisle 提供。注意力機制是每個 transformer 內部的融合 softmax(QKᵀ/√d)·V 運算。MI300X 是 AMD 的 CDNA3 資料中心 GPU，ISA 目標為 gfx942。此核心僅在該硬體上運行。簡而言之，MoonMath.ai 開源了一個用於 AMD MI300X 的 bf16 前向注意力核心，以 HIP 而非組合語言編寫（MIT 授權），在每種形狀與捨入模式下都勝過 AMD 的 AITER v3——幾何平均提升 1.18×/1.15×/1.08×，最高達 1.26×。核心技巧：單指令的 asm 包裝器讓你能夠...

站內 AI 整理稿

MoonMath AI team has released a bf16 forward attention kernel for AMD’s MI300X GPU. It is written in HIP, not hand-written assembly. The code is open-source under the MIT license. The MoonMath.ai team reports it beats AITER v3, AMD’s own optimized kernel, on every tested shape. Bare-metal access came from HotAisle, an AMD cloud provider. Attention is the fused softmax(QKᵀ/√d)·V operation inside every transformer. The MI300X is AMD’s CDNA3 data-center GPU, with the ISA target (gfx942). This kernel runs on that hardware only. TL;DR MoonMath.ai open-sources a bf16 forward attention kernel for AMD MI300X, written in HIP, not assembly (MIT). It beats AMD’s AITER v3 on every shape and rounding mode — geomean 1.18×/1.15×/1.08×, up to 1.26×. The core trick: one-instruction asm wrappers let you pick the opcode while the compiler allocates registers. Most of the speedup is memory placement — K in LDS, V hot in L1, Q and accumulators in registers. A real SGLang PR used it to speed up Wan2.1 video diffusion by 1.23×, with no quality regression. Understanding Kernel A kernel is a small program that runs directly on the GPU’s many cores to perform one specific computation—here, the attention math—as fast as the hardware allows. The kernel computes forward attention in bf16 on MI300X only. It takes inputs in either BSHD or BHSD layout, with no transpose. Head dimension is fixed at 128. It supports any sequence length, including cross-attention. There are real limits. There is no causal mask, no GQA, and no varlen batching. Outputs are bf16, and it runs on gfx942 hardware exclusively. Numerics are tightly controlled. All three rounding modes match AITER’s per-mode rounding rule. Every finite output sits within 1 bf16 ULP of AITER. NaN and Inf handling is bit-identical, and results are deterministic. The Core Trick: One-Instruction asm Wrappers The core technique avoids a familiar dilemma. Compiler intrinsics keep code tidy but let the compiler reorder or rename operands. Raw inline assembly gives control but forces manual register and address management. MoonMath wraps exactly one instruction in a __device__ __forceinline__ function. Extended asm constraints describe the operands. The research team picks the opcode. The compiler still allocates registers and tracks data flow. Copy CodeCopiedUse a different Browser// in/out tied to the SAME VGPR → no accumulator rename, no v_mov copy. __device__ __forceinline__ void asm_mfma(bf16x4_t a, bf16x4_t b, fp32x4_t& c) { asm volatile("v_mfma_f32_16x16x16_bf16 %0, %1, %2, %0" : "+v"(c) : "v"(a), "v"(b)); } The "+v"(c) constraint ties the accumulator input and output to the same VGPR. No copy instruction is emitted. This keeps the kernel close to ordinary HIP. It still steers the machine one instruction at a time. The Architecture: Eight Waves, Two Groups, Two Barriers A CDNA3 compute unit has four SIMD units. The textbook block is four waves. MoonMath instead runs eight waves per block, in two groups of four. The two groups run the same Q*K, softmax, O += P*V sequence. They are offset by a phase. While one group saturates the matrix core, the other runs softmax and issues loads. Then they swap, so the matrix core never idles. There are two s_barriers per iteration. One sits at the phase handoff. One sits at the iteration boundary. Per-counter waits handle the rest of the synchronization. This echoes FlashAttention-3’s matmul and softmax alternation. It does not copy FA3’s producer and consumer warp split. On CDNA3, every memory move is already asynchronous, so a dedicated producer wave is unnecessary. Where Data Lives, and Why 16×16×16 Most of the speedup comes from memory placement. K streams from HBM into LDS, double-buffered, shared by all eight waves. V stays hot in L1, read on every PV matmul. Q and accumulators live in registers. The research team picked the 16×16×16 MFMA over 32×32×8. Both shapes have identical throughput. The smaller tile accumulates into 4 fp32 elements per lane, against 16. Lower accumulator pressure leaves room for deeper prefetch and a third Q tile. DecisionChoiceReasonWaves per block8 (two groups of 4)Plan the pipeline directly; share one K copyMFMA shape16×16×16 bf16Same throughput, lower VGPR pressure, better power efficiencyK placementLDS, double-buffered, 32 KiBShared by all 8 waves, swapped per iterationV placementL1, resident, prefetchedReread across PV, kept hot deliberatelyQ + accumulatorsVGPRsRead every iteration, never reloaded Two later wins close the gap. A third Q tile (3Q) raises data reuse per loaded K and V tile. A Flash-Decoding-style tail KV split rescues the stranded fractional round across MI300X’s 304 CUs. These wins cascade. Moving V to L1 freed the LDS that the third Q tile then fills. Benchmark Tests ran on MI300X in bf16, head dimension 128. Each shape was measured at three rounding modes. RTNE rounds to nearest even. RTNA rounds to nearest, ties away from zero. RTZ truncates toward zero. Shape (B, H, S, D)RoundOurs (ms)AITER v3 (ms)vs AITERvs MAX(2, 24, 8192, 128)RTNE3.0833.7921.23×1.37×(2, 24, 16384, 128)RTNE11.67014.6911.26×1.54×(4, 16, 16384, 128)RTZ15.05516.1831.07×1.47×(2, 24, 32768, 128)RTNA44.44052.3631.18×1.57×(1, 16, 131072, 128)RTNE232.517269.2781.16×1.46× Geomeans across the sweep favor MoonMath. Versus AITER, it scores 1.18× (RTNE), 1.15× (RTNA), and 1.08× (RTZ). Versus Modular MAX, geomeans run 1.44× to 1.49×, and per-shape speedups reach 1.59×. RTZ is AITER’s own fastest mode and the tightest race. The (4, 16, 16384) RTZ shape moved from 0.95× to 1.07×. The tail KV split is what closed that final gap. Interactive Explainer (function(){ window.addEventListener("message",function(e){ if(e&&e.data&&e.data.type==="mm-cdna3-height"){ var f=document.getElementById("mm-cdna3-frame"); if(f&&e.data.height){f.style.height=e.data.height+"px";} } }); })(); Use Cases The kernel installs with pip and exposes a small API. It launches on the caller’s stream, so it overlaps inside larger pipelines. Copy CodeCopiedUse a different Browserimport torch import moonmath_attention as ma # PyTorch's ROCm build uses the "cuda" device string on AMD GPUs q = torch.randn(2, 8192, 24, 128, dtype=torch.bfloat16, device="cuda") k = torch.randn(2, 8192, 24, 128, dtype=torch.bfloat16, device="cuda") v = torch.randn(2, 8192, 24, 128, dtype=torch.bfloat16, device="cuda") out = ma.forward(q, k, v, layout="bshd") out_rtz = ma.forward(q, k, v, layout="bshd", round_mode="rtz") One concrete use case is video diffusion. The team added LiteAttention support and sent a PR to SGLang diffusion. On Wan2.1-T2V-1.3B-Diffusers, they switched attention from AITER to liteattention_rocm. End-to-end generation improved by 1.23× on MI300X, with no visible quality regression. The BSHD layout suits diffusion tensors directly. Cross-attention works with any KV length and no padding. Key Takeaways The kernel is bf16 forward attention for MI300X, written in HIP under MIT. It beats AITER v3 on every shape and rounding mode, geomean 1.18×/1.15×/1.08×. One-instruction asm wrappers give opcode control while the compiler allocates registers. Memory placement drove most of the gain: K in LDS, V hot in L1, Q in registers. A real SGLang PR sped up Wan2.1 video diffusion by 1.23× with no quality regression. Check out the Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well. Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us The post MoonMath AI Open-Sources a HIP Attention Kernel for AMD MI300X That Beats AITER v3 on Every Shape and Rounding Mode appeared first on MarkTechPost.

原始來源：MarkTechPost AI ↗

查看原始來源

36氪產業與商業

騰訊51億捧出“中國AMD”，要上A股挑戰寒武紀了

這篇消息聚焦「騰訊51億捧出“中國AMD”，要上A股挑戰寒武紀了」。原始導語提到：下一個寒武紀？從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

剛剛閱讀分析

全天候科技產業與商業

又一戶外品牌將在港交所上市

這篇消息聚焦「又一戶外品牌將在港交所上市」。原始導語提到：坦博爾通過聆訊。從 AI 情報角度來看，這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

剛剛閱讀分析

鈦媒體產業與商業

豆包的下一步是拆分嗎？

{"id":"f916c4ed-2792-4d7e-8f34-63ef0f2d1b93","object":"response","model":"deepseek-v4-flash","output":[],"stop_reason":"max_output_tokens","usage":{"input_tokens":111,"output_tokens":200,"total_tokens":311}}

剛剛閱讀分析

36氪產業與商業

智譜到底值不值萬億市值

{"id":"83dda544-7a2e-498b-ab42-5cd4dd741d96","object":"response","model":"deepseek-v4-flash","output":[],"stop_reason":"max_output_tokens","usage":{"input_tokens":113,"output_tokens":200,"total_tokens":313}}

剛剛閱讀分析