Parallax:一種參數化的局部線性注意力機制,保留 Softmax 並添加學習的協方差校正分支

重點摘要
Transformer 的注意力機制自 2017 年以來幾乎沒有改變。多數提升效率的研究嘗試直接取代 softmax 注意力。一篇新論文則採取不同路線:保留 softmax 注意力,並在其上附加一個校正分支。來自西北大學、Tilde Research 和華盛頓大學的研究團隊提出了一種名為「Parallax」的參數化局部線性注意力機制,可擴展至大型語言模型預訓練,並與 Muon 協同設計。Parallax 並非透過削減計算量來追求效率,而是有目的地增加計算,再設法讓這些計算在現代 GPU 上運行得更便宜。什麼是 Parallax?Parallax 建立在局部線性注意力(LLA)之上。LLA 源自測試時回歸框架,該框架將注意力視為針對鍵值對的回歸求解器。在這種觀點下
The Transformer’s attention mechanism has barely changed since 2017. Most efficiency work has tried to replace softmax attention outright. A new paper takes a different route. It keeps softmax attention and bolts on a correction branch. A team of researchers from Northwestern University, Tilde Research, and University of Washington introduce a parameterized Local Linear Attention called ‘Parallax’ that scales to LLM pretraining and codesigns with Muon. Parallax does not chase efficiency by cutting compute. It adds compute deliberately, then makes that compute cheaper to run on modern GPUs. What is Parallax Parallax builds on Local Linear Attention (LLA). LLA comes from the test-time regression framework. That framework reads attention as a regression solver over key-value pairs. In this view, keys are training data points. Values are labels. The query is the test point. Softmax attention is a nonparametric estimator called Nadaraya-Watson. It fits a local constant function for each query. LLA upgrades that local constant estimate to a local linear estimate. The research team proves this yields strictly smaller integrated mean squared error. The benefit is better bias-variance tradeoffs for associative memory. But LLA has a problem at scale. Its exact forward requires solving a linear system for every query. That uses a parallel conjugate gradient (CG) solver. The CG solver creates three issues: intensive I/O, a hard regularization-expressiveness tradeoff, and low-precision incompatibility. Parallax removes the solver. Instead, it learns an extra projection matrix. The research team writes this as ρi = WRxi. Here WR is a learnable matrix that probes the KV covariance directly from the layer input. So Parallax keeps the local linear principle. It just replaces the per-query solve with a learned, query-like projector. That makes it simpler, more efficient, and easier to implement. How the Mechanism Works Parallax reformulates LLA as softmax attention plus an additive correction. The output equals the softmax attention output minus a projected covariance term. In the research paper’s notation, that term is the KV covariance multiplied by the learned probe ρi. The research team also drops one piece of LLA called the boundary amplification factor, set to zero. This is necessary for stability. Once the probe is parametric, the original geometric interpretation breaks. Leaving the factor in could cause the scaling to diverge or flip sign. Parallax sits inside a family of attention mechanisms. The research team organizes them in the paper by three axes: the bandwidth, the probe construction, and the affine structure. At one extreme, Parallax degenerates exactly to softmax attention when the probe norm goes to zero. Setting WR = 0 makes a Parallax layer behave identically to softmax attention. So a pretrained Transformer checkpoint can be converted by adding WR and fine-tuning. The Hardware Argument Parallax inherits the streaming structure of FlashAttention. It adds one covariance branch that reuses the same key-value stream. The research team expands the forward into two parallel scoring branches. Both branches share the online maximum, the rescaling factor, and the K and V tiles. So Parallax needs no extra I/O per iteration. The key property is higher arithmetic intensity (AI). AI is the ratio of floating point operations to high-bandwidth memory traffic. In the regime where KV work dominates, Parallax roughly doubles the arithmetic intensity. It adds compute while reusing the same memory stream. This shifts attention toward a more compute-bound regime. That is exactly the regime where kernel optimization helps on modern hardware. The research team prototyped a decode kernel in CuTeDSL on NVIDIA Hopper GPUs. Hopper’s tensor core matmul instructions operate on tiles of at least 64 rows. A decode step supplies only one query row. So the QK and RK products can be computed jointly, within instructions standard attention already issues. They profiled against FlashAttention 2 and 3 on H200 GPUs at BF16 precision. They swept batch sizes from 1 to 2,048 and context lengths from 128 to 32,768. The prototype kernel matches or outperforms FlashAttention across all configurations. The below figure annotates speedups of 1.54× in the compute-matched setting and 1.14× in the I/O-matched setting. https://arxiv.org/pdf/2605.29157 What the Experiments Show The research team validated Parallax on synthetic tasks and on LLM pretraining at 0.6B and 1.7B scales. Models used the Qwen-3 architecture in the torchtitan repository. They trained on the Ultra-FineWeb dataset with a 4096 context length. Baselines included softmax attention (Transformer), Mamba, Gated DeltaNet, MesaNet, and Kimi DeltaAttention. On the MAD-Benchmark, Parallax attained the highest overall accuracy at 0.716 average. It consistently improved recall-oriented tasks like In-Context-Recall and Selective-Copying. It stayed competitive on compression and memorization tasks. On language modeling, Parallax with Muon achieved the best perplexity at both scales. It also posted the highest average downstream accuracy. At 1.7B, Parallax scored 62.45 average against the Transformer’s 61.43. Two controls test where the gain comes from. A parameter-matched Transformer closed only a small fraction of the gap. A compute-matched Parallax still beat both baselines. The paper argues this points to the mechanism itself, not extra parameters or compute. The Optimizer Twist A core finding is an optimizer-architecture interaction. Parallax shows a large advantage under Muon. Under AdamW, the advantage shrinks markedly or even disappears. Muon is a recent optimizer for matrix parameters in hidden layers. It uses the polar factor of the momentum buffer, so updates have condition number exactly one. Prior work shows this produces better-conditioned weight matrices. The research team in the paper traces the gap to the correction branch. They define a correction-to-output ratio (COR). Under Muon, COR exceeds 8 in the deepest layers. Under AdamW, it stays below 4. The WR projection is disproportionately affected. Its stable rank collapses under AdamW but stays high under Muon. A gating experiment confirms the pattern. Under AdamW, the model learns to suppress the correction branch rather than use it. The research team call this the first empirical demonstration of strong architecture-optimizer codesign for attention mechanisms. They do not claim Muon with WSD is the optimal recipe. An appendix ablation shows the advantage shrinks during the decay phase. How the Scores Differ Parallax also produces different score distributions from softmax attention. Its per-token weights can take negative values and exceed one in magnitude. Standard softmax weights cannot do this. The research team reports three effects. Parallax can actively subtract value components from irrelevant tokens. It substantially reduces the attention sink on the first token. Its base softmax entropy stays higher, giving more diffuse attention weights. Strengths and Weaknesses and Open Questions Strengths Keeps softmax attention intact, so a pretrained Transformer can convert by adding WR and fine-tuning. Adds no extra I/O per iteration by reusing the FlashAttention key-value stream. Doubles arithmetic intensity, with a prototype kernel matching or beating FlashAttention 2/3 in decode. Shows consistent perplexity and downstream gains under parameter-matched and compute-matched controls. Weaknesses and Open Questions Gains depend heavily on Muon; under AdamW the advantage largely disappears. The precise cause of the optimizer dependence remains an open question. Results stop at 1.7B scale, without MoE, longer context, or larger runs. The advantage erodes during the WSD decay phase, only partially fixed by weight decay annealing. Key Takeaways Parallax keeps softmax attention and adds a learned covariance correction branch, replacing LLA’s per-query conjugate gradient solver
Related
相關文章
Liquid AI Introduces LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M: Dense Bi-Encoder and Late-Interaction Models for Fast Multilingual Search Across 11 Languages
This week, Liquid AI released two new retrieval models. They are LFM2.5-ColBERT-350M and LFM2.5-Embedding-350M. Both hold 350M parameters. Both are the first bidirectional members of the LFM family. They build on LFM2.5-350M-Base, released in March. The pair targets fast multilingual and cross-lingual search across 11 languages. Their footprint is small enough to run almost anywhere. Both are available now on Hugging Face under the LFM Open License v1.0. LFM2.5 Retrievers The two models share one backbone but represent text differently. LFM2.5-Embedding-350M is a dense bi-encoder. It turns each document into a single vector. Pick it when you want the fastest search and the smallest, cheapest index. LFM2.5-ColBERT-350M is a late-interaction model. It converts each token into a vector rather
Perplexity Launches Brain, a Self-Improving Memory System That Builds a Context Graph of an Agent’s Work and Learns Overnight
Most AI memory remembers the user. It stores your preferences, your tastes, and your role. Perplexity is taking a different path. Today, Perplexity launched Brain, a self-improving memory system for its agent product, Computer. Brain does not focus on remembering you. It remembers what the agent did. That reframes what memory in AI is for. What is Perplexity‘s Brain Brain is a self-improving memory system. It builds a context graph of the work Computer performs. At set intervals, such as overnight, Brain reviews that graph. It then teaches itself how to do the work better. The idea is straightforward. The more work you do, the more efficient Brain makes your Computer. Brain is rolling out today to Perplexity Max and Enterprise Max subscribers in Research Preview. Two Axes of AI Memory Perp

智譜新高,MiniMax承壓,“大模型雙雄”命運殊途
這篇消息聚焦「智譜新高,MiniMax承壓,“大模型雙雄”命運殊途」。原始導語提到:大模型在被市場重新定價 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

華為昇騰 0 Day 支持智譜 GLM-5.2 模型,提供全面推理優化
華為昇騰 AI 宣佈在智譜開源 GLM-5.2 大模型當天即完成深度推理優化。通過 MOE 大融合算子、通信計算融合、高併發調度等七項關鍵技術,顯著提升編程和長程任務的處理效率,現已支持 A3 系列產品部署。#AI 大模型# #國產算力#
企業AI轉型再添利器:青雲科技算力雲接入 MiniMax-M3 模型
企業AI落地面臨高效低成本難題。青雲科技旗下基石智算平臺接入國產開源大模型MiniMax-M3,提供新算力支持。MiniMax-M3以卓越上下文處理能力等三大核心技術見長,依託自研架構,助企業便捷部署AI業務。
阿里開源統一科學大模型 LOGOS,僅用五十六分之一參數超越微軟
阿里 ATH-Token Foundry 聯閤中國人民大學高瓴人工智能學院開源科學基礎模型 LOGOS。該模型採用統一科學語法與純序列建模範式,在六大科學任務上匹配或超越傳統專用方法。其中 LOGOS-1B 僅 1B 參數,即展現出極高效率,性能超越參數量達 8×7B 的微軟模型。