MarkTechPost AI研究與前沿

介紹 Flash-KMeans:一款感知 I/O 的精確 K-Means,在 GPU 上比 FAISS 快超過 200 倍

2026年6月15日 09:16

重點摘要

數十年來,k-means 一直是離線工具:執行一次來預處理資料,然後就繼續下一步。UC Berkeley 與 UT Austin 的研究團隊推出全新開源函式庫 Flash-KMeans,鎖定不同的使用情境。現代的 AI 流程開始在訓練與推論迴圈中呼叫 k-means,在這種高頻使用下,每次呼叫的延遲比理論上的浮點運算數更為關鍵。Flash-KMeans 是標準 Lloyd's k-means 的感知 I/O 實作,不改變數學原理、不做近似運算,只重新組織演算法在 GPU 上的資料移動方式。研究團隊在 NVIDIA H200 上報告,相較於最佳基準線,端對端加速最高達 17.9 倍;對比 NVIDIA cuML 為 33 倍;對比 FAISS 則超過 200 倍。

站內 AI 整理稿

k-means has been an offline tool for decades. You run it once to preprocess data, then move on. A team of researchers from UC Berkeley and UT Austin released Flash-KMeans, a new open-source library that targets a different setting. Modern AI pipelines now call k-means inside training and inference loops. At that frequency, latency per call matters more than theoretical FLOPs. Flash-KMeans is an IO-aware implementation of standard Lloyd’s k-means. It does not change the math, and it does not approximate. It only restructures how the algorithm moves data on a GPU. On an NVIDIA H200, the research team reported up to 17.9× end-to-end speedup over the best baseline. Against NVIDIA cuML they report 33×. Against FAISS they report over 200×. What is Flash-KMeans Flash-KMeans is a batched k-means library written in Triton GPU kernels. It ships under Apache 2.0 and installs with pip install flash-kmeans. The output is mathematically identical to standard Lloyd’s k-means. The speedup comes from kernel-level dataflow, not from skipping work. That separates it from algorithmic methods like triangle-inequality pruning or coreset sampling. A standard Lloyd iteration has two stages. The assignment stage computes each point’s distance to every centroid, then picks the nearest. The update stage averages the points in each cluster to form new centroids. Both stages are simple arithmetic. On GPUs, both are bottlenecked by memory, not compute. The Two Bottlenecks It Attacks The first bottleneck is the assignment stage. Standard code builds a full distance matrix D of shape N×K in High Bandwidth Memory (HBM). It writes the matrix, then reads it back to run argmin. For N=65536, K=1024, d=128, B=32, the distance math takes 2.6ms. Writing and consuming D takes about 23ms. The matrix is the cost, not the arithmetic. Flash-KMeans replaces this with FlashAssign. The design borrows from FlashAttention. FlashAssign streams tiles of points and centroids from HBM into on-chip SRAM. It fuses distance computation with an online argmin. The full N×K matrix is never materialized. This cuts the dominant IO complexity from O(NK) to O(Nd + Kd). At the kernel level, FlashAssign reaches up to 21.2×. In one case it cut assignment from 122.5ms to 5.8ms. The second bottleneck is the centroid update stage. Standard code uses scatter-style atomic adds. Each thread adds its point into a shared sum buffer keyed by cluster id. Many threads hit the same ‘hot’ cluster at once. That causes atomic contention and hardware serialization. The research team measured only 50 GB/s effective bandwidth here on an H200. Flash-KMeans replaces this with Sort-Inverse Update. It sorts the 1D assignment vector by cluster id using argsort. Identical cluster ids then form contiguous segments. Each thread block reduces a segment on-chip, then issues one atomic add per segment. The heavy point matrix is never physically permuted. Atomic operations drop from (O((K+NBN)d))(O((K + \frac{N}{B_N})d)) . The kernel reaches up to 6.3×. Benchmark The research team test it on an H200 with CUDA 12.8, FP16 data, and d=128. They sweep N, K, and batch size B. They compare against four optimized baselines: fast_pytorch_kmeans, fastkmeans, cuML, and FAISS. ComparisonReported speedupWorkload contextEnd-to-end vs best baselineup to 17.9×N=8M, K=1024 (large N, small K)vs NVIDIA cuML33×industry libraryvs FAISSover 200×industry libraryFlashAssign kernelup to 21.2×N=1M, K=8192 (assignment)Sort-Inverse Update kernelup to 6.3×N=33M, K=4096 (update)Out-of-core, large scaleup to 10.5×N=400M, K=16384 vs fastkmeans One failure mode matters for context. Standard PyTorch implementations run out of memory in large-K regimes. They cannot materialize the N×K matrix. FAISS is the industry-standard library under many production vector-search systems. The library also runs out-of-core. On one billion points (K=32768, d=128), it finishes an iteration in 41.4s, against 261.8s for the baseline. It uses chunked stream overlap to hide PCIe transfer behind compute. A cache-aware compile heuristic also cuts tuning overhead by up to 175×, within 0.3% of tuned speed. MTP Interactive Explainer #mtp-fk-demo *{box-sizing:border-box!important;margin:0;padding:0} #mtp-fk-demo{ background:#111!important;color:#e8e8e8!important; border:1px solid #222!important;border-radius:14px!important; font-family:-apple-system,BlinkMacSystemFont,"Segoe UI",Roboto,Helvetica,Arial,sans-serif!important; max-width:980px;margin:24px auto!important;padding:0!important;overflow:hidden!important; line-height:1.5!important } #mtp-fk-demo hr,#mtp-fk-demo p:empty,#mtp-fk-demo del,#mtp-fk-demo s{display:none!important} #mtp-fk-demo .fk-head{padding:22px 24px 14px!important;border-bottom:1px solid #1d1d1d!important} #mtp-fk-demo .fk-eyebrow{color:#76B900!important;font-size:11px!important;letter-spacing:.16em!important;text-transform:uppercase!important;font-weight:700!important} #mtp-fk-demo h2.fk-title{color:#fff!important;font-size:23px!important;font-weight:800!important;margin:6px 0 4px!important;letter-spacing:-.01em!important} #mtp-fk-demo .fk-sub{color:#9a9a9a!important;font-size:13.5px!important;max-width:680px!important} #mtp-fk-demo .fk-stats{display:grid!important;grid-template-columns:repeat(4,1fr)!important;gap:1px!important;background:#1d1d1d!important;border-bottom:1px solid #1d1d1d!important} #mtp-fk-demo .fk-stat{background:#0c0c0c!important;padding:14px 16px!important} #mtp-fk-demo .fk-stat b{color:#76B900!important;font-size:21px!important;font-weight:800!important;display:block!important;font-variant-numeric:tabular-nums!important} #mtp-fk-demo .fk-stat span{color:#8a8a8a!important;font-size:11px!important;display:block!important;margin-top:2px!important} #mtp-fk-demo .fk-tabs{display:flex!important;gap:6px!important;padding:14px 24px 0!important;flex-wrap:wrap!important} #mtp-fk-demo .fk-tab{ background:#181818!important;color:#bdbdbd!important;border:1px solid #262626!important; border-radius:8px 8px 0 0!important;padding:9px 15px!important;font-size:13px!important;font-weight:600!important; cursor:pointer!important;transition:all .15s!important } #mtp-fk-demo .fk-tab:hover{color:#fff!important;border-color:#3a3a3a!important} #mtp-fk-demo .fk-tab.on{background:#0c0c0c!important;color:#76B900!important;border-color:#76B900!important;border-bottom-color:#0c0c0c!important} #mtp-fk-demo .fk-panel{display:none!important;padding:18px 24px 22px!important} #mtp-fk-demo .fk-panel.on{display:block!important} #mtp-fk-demo .fk-row{display:flex!important;gap:18px!important;flex-wrap:wrap!important;align-items:flex-start!important} #mtp-fk-demo .fk-canvaswrap{flex:1 1 360px!important;min-width:280px!important} #mtp-fk-demo canvas{width:100%!important;display:block!important;background:#0a0a0a!important;border:1px solid #1d1d1d!important;border-radius:10px!important} #mtp-fk-demo .fk-side{flex:1 1 220px!important;min-width:200px!important} #mtp-fk-demo .fk-ctl{margin-bottom:14px!important} #mtp-fk-demo .fk-ctl label{display:flex!important;justify-content:space-between!important;font-size:12px!important;color:#bdbdbd!important;margin-bottom:6px!important} #mtp-fk-demo .fk-ctl label em{color:#76B900!important;font-style:normal!important;font-weight:700!important;font-variant-numeric:tabular-nums!important} #mtp-fk-demo input[type=range]{-webkit-appearance:none;appearance:none;width:100%!important;height:5px!important;background:#262626!important;border-radius:4px!important;outline:none!important} #mtp-fk-demo input[type=range]::-webkit-slider-thumb{-webkit-appearance:none;appearance:none;width:16px;height:16px;border-radius:50%;background:#76B900;cursor:pointer;border:2px solid #0c0c0c} #mtp-fk-demo input[type=range]::-moz-range-thumb{width:16px;height:16px;border-radius:50%;background:#76B900;cursor:pointer;border:2px solid #0c0c0c} #mtp-fk-demo .fk-btns{display:flex!important;gap:8px!important;flex-wrap:wrap!important;margin:4px 0 12px!important} #mtp-fk-demo button.fk-b{ background:#7

Related

相關文章

GPT發AI原創新成果了

這篇消息聚焦「GPT發AI原創新成果了」。原始導語提到:AI實現藥物全自動研發,還遠嗎? 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

剛剛

AI越強,越要“殺死”過去的自己

這篇消息聚焦「AI越強,越要“殺死”過去的自己」。原始導語提到:人類需要實現思維模式的轉變。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

1 小時前
MarkTechPost AI研究與前沿

Salesforce CodeGen Tutorial: Generate, Validate, and Rerank Python Functions With Unit Tests and Safety Checks

In this tutorial, we implement an end-to-end workflow for Salesforce CodeGen. We load a CodeGen model from Hugging Face, prepare it for code generation, and use it to generate Python functions from natural-language prompts. We then move beyond basic inference by adding function extraction, syntax checking, static safety checks, unit-test-based validation, best-of-N candidate reranking, multi-step program synthesis, prompt-style experimentation, benchmark visualization, and artifact export. Through this workflow, we learn how CodeGen can be used not only as a code completion model but also as part of a structured code-generation pipeline that evaluates, filters, and organizes generated solutions. Loading the Salesforce CodeGen Model from Hugging Face Copy CodeCopiedUse a different Browserim

7 小時前

Transformer之父離開谷歌,奧特曼等了他十年

這篇消息聚焦「Transformer之父離開谷歌,奧特曼等了他十年」。原始導語提到:27億美元也沒能留住,Noam Shazeer追尋下一代架構。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

15 小時前

Dario訪談首曝:Mythos被稱為“超級武器”

這篇消息聚焦「Dario訪談首曝:Mythos被稱為“超級武器”」。原始導語提到:在這場69分鐘完整訪談裡,Dario Amodei 說人類真正面對的不是某個突然降臨的奇點,而是一條已經開始垂直起飛的指數曲線。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

19 小時前

用結構替代數據,因果世界模型如何重塑具身智能大腦

這篇消息聚焦「用結構替代數據,因果世界模型如何重塑具身智能大腦」。原始導語提到:因果世界模型需要一個標誌性的時刻來證明自己。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

19 小時前