MarkTechPost AI生成式AI

Perplexity AI 開源 Unigram 分詞器,p50 延遲比 Hugging Face tokenizers 套件低 5 倍

2026年5月28日 09:08

重點摘要

Perplexity AI 研究團隊從頭以 Rust 重新實作 Unigram 分詞器,並將其開源於推理技術儲存庫 pplx-garden 中。在生產輸入長度下,新編碼器的 p50 延遲比 Hugging Face tokenizers 套件降低約 5 倍,比 SentencePiece (C++) 降低約 2 倍,比 IREE 的分詞器 (C) 降低約 1.5 倍,且無靜態堆積分配。在生產環境中,它將 Perplexity 推理堆疊的 CPU 使用率降低了 5 到 6 倍,並將重排序器延遲縮短了數十毫秒。 為何分詞成為瓶頸? LLM 推理成本通常圍繞 GPU 工作(KV 快取、注意力核心、專家路由)來討論。但較小模型(如嵌入模型、分類器和重排序器)情況則不同。這些模型有兩

站內 AI 整理稿

Perplexity AI’s research team reimplemented their Unigram tokenizer from scratch in Rust and open-sourced the code in pplx-garden, their inference technology repository. At production input lengths, the new encoder cuts p50 latency by roughly 5x versus the Hugging Face tokenizers crate, ~2x versus SentencePiece (C++), and ~1.5x versus IREE’s tokenizer (C), with zero steady-state heap allocations. In production, it reduced CPU utilization in Perplexity’s inference stack by 5-6x and shaved double-digit milliseconds off reranker latency. Why Tokenization Became a Bottleneck LLM inference cost is typically framed around GPU work: KV caches, attention kernels, expert routing. But smaller models, such as embedding models, classifiers, and rerankers, tell a different story. These models are two to three orders of magnitude smaller than frontier transformers. A reranker scoring hundreds of candidate documents per request is a clear example. With a small model, GPU compute often finishes in single-digit milliseconds. Every input still passes through CPU-side tokenization first. When batch sizes are large, tokenization becomes a meaningful fraction of total request latency. Perplexity’s work targets XLM-RoBERTa, a model with a 250K-token Unigram vocabulary trained with SentencePiece. Fine-tuned RoBERTa-family encoders are a common production choice for ranking, retrieval, and similarity tasks. What is Unigram Tokenization? Unigram tokenization was introduced by Kudo in 2018 and is implemented in SentencePiece. It frames segmentation as a most-probable-path problem. Each vocabulary token has a learned log-probability. The tokenizer picks the segmentation whose token scores sum to the highest value. The algorithm used to find that best path is the Viterbi algorithm, a dynamic programming technique from 1967. Byte positions form graph layers and vocabulary tokens are edges spanning a contiguous byte range. The DP recurrence iterates over byte positions and updates the best-scoring path at each position. The outer loop runs in linear time relative to input length. The inner loop walks a vocabulary trie (a prefix tree structure) at each byte position. On a 16K-token input, this inner walk executes hundreds of thousands of trie transitions. It is the hot path. What was Slow in the Hugging Face Implementation The Hugging Face tokenizers crate is the default Rust tokenizer most teams reach for. Perplexity used it as the benchmark reference. At 514 tokens (512 + BOS/EOS injection), the reference implementation had three costly patterns: BottleneckMechanismMeasured impactAllocation per matchString::from_utf8 + AHashMap lookup per trie match7,295 allocations at 514 tokens; 299,171 at 16KPointer chase per byteAHashMap at every trie node; 4 dependent loads per byte stepDependent-load latency dominates the hot pathL2 thrashing on long inputsDP table and output buffers freshly allocated each callL2 miss rate climbs from 8% at 128 tokens to 50% at 16K Per-token allocation is constant: roughly 2 KB and ~18 allocations per token, regardless of input size. The latency problem becomes severe at longer inputs when cumulative allocations overflow the per-core L2 cache. Establishing a Baseline Before Changing the Trie Before switching the trie structure, Perplexity first isolated how much cost came from unnecessary work alone. They made a zero-allocation port of the reference: same HashMap trie, but with a caller-owned scratch struct reused across calls and token IDs stored directly in trie nodes (removing the per-match string allocation and secondary hash-map lookup). This baseline already cut p50 latency to 155 µs at 514 tokens, down from 326 µs in the reference. Instructions retired dropped 2.4x. The remaining cost was the HashMap pointer chase itself, which the next step addressed. The Three Optimizations Optimization 1: Double-Array Trie The Hugging Face trie stores children in a HashMap at every node. Each byte step requires a hash computation, two pointer dereferences, and a heap access. Perplexity replaced this with a double-array trie, the same structure used by SentencePiece and IREE, originally introduced by Aoe in 1989. A double-array trie encodes the entire trie in two flat integer arrays, base and check. A child lookup is: next = base[node] + byte, then verify check[next] == node. That is two array reads, one integer add, and one comparison, with no hashing and no pointer chasing. For XLM-RoBERTa’s 250K vocab, the whole trie fits in ~9 MB of contiguous memory. The hot working set per encode is on the order of 100 KB, which fits in L2 cache. Unlike SentencePiece and IREE, which are general-purpose libraries with lattice bookkeeping and multi-stage pipelines, Perplexity inlined the trie directly in the Viterbi loop and dropped that overhead entirely. Result at 514 tokens: p50 dropped from 155 µs (zero-allocation baseline) to 68 µs. Wall-clock fell 4.8x from the original reference. Optimization 2: Bitmap and Inline Packing The double-array trie still requires two dependent array loads per byte step: first the parent’s base offset, then the check array to confirm the transition is valid. Perplexity replaced the check array with a per-node bitmap (four 64-bit words, 32 bytes) that records which of the 256 possible bytes have valid child transitions. A bitmap lookup compiles to a single bit test against one 64-bit word. The check array is used only during trie construction and dropped from the runtime layout entirely. They also packed all four per-node fields (bitmap, base, token ID, and score) into a single 64-byte cache line, matching CPU cache line width exactly. One trie step now loads a single cache line covering the bitmap for the next-byte check, the base offset for the child slot, and the token ID and score at terminal nodes. Trade-off: trie size grows from ~9 MB to ~50 MB (780K nodes x 64 bytes). The hot working set per encode remains ~100 KB. Result at 514 tokens: Additional 4.5% wall-clock reduction. L2 accesses dropped from 4.6K to 1.8K per encode. Optimization 3: Huge Pages for the Trie At 50 MB, the trie spans roughly 12,000 virtual pages on a default Linux system using 4 KB pages. The first-level data TLB on Intel Sapphire Rapids holds 96 entries. Each Viterbi step touches a different trie node, so TLB misses accumulate. Over a 512-token encode, Perplexity estimated roughly 9,000 cycles spent in page-table walks, about 3% of per-encode budget. Perplexity backed the trie with 2 MB huge pages via mmap with the MAP_HUGETLB flag. The same 50 MB now spans 25 pages, well within the TLB. This requires vm.nr_hugepages configured at boot. In production, 10,561 huge pages are reserved; the trie uses 24. Result: 3-12% wall-clock reduction depending on input length. The largest gain is at 4,098 tokens (-12.0%), where page-table traffic was actively competing with trie data for L2 bandwidth. Beyond 4K tokens the gain shrinks because L3 misses dominate. Final Benchmark Results All measurements are single-threaded, pinned to one core on an Intel Xeon Platinum 8488C, with 10,000 iterations after 1,000 warmup rounds. At 514 tokens: Enginep50 LatencyInstructionsAllocationsHugging Face (tokenizers crate)349 µs3.60M7,295SentencePiece (C++)128 µs1.83M1,559IREE tokenizer (C)112 µs2.28M1Perplexity (final, all 3 optimizations)~63 µs1.04M0 Across the full optimization sequence, instructions per encode fell from 3.66M to 1.04M, a 3.5x reduction. Wall-clock matches that ratio at short inputs and widens at long inputs where the reference’s per-token allocations overflow L2 and L3. One additional finding: off-the-shelf Rust wrapper crates around SentencePiece and IREE add 1.6-1.9x latency overhead compared to the native C/C++ binaries. The sentencepiece crate allocates a fresh list of token pieces on each call. The overhead is measurable but amortizes at long inputs. The final Perplexity encoder produces token-exact output against the reference. In production, it uses rayon to paralleli

Related

相關文章

鈦媒體生成式AI

Edge AI Daily 早報(6月19日)

AI Engineer World's Fair 2026規模再創新高,標誌AI工程從幕後走向舞臺中央。行業面臨結構性調整:楊立昆警示OpenAI年虧210億美元揭示商業模式脆弱性,Transformer之父轉投OpenAI反映人才爭奪白熱化。Anthropic多線佈局——語音支持七種語言、加入碳清除聯盟、落子首爾辦事處,展現生態擴張野心。監管壓力加劇,意大利依據DMA調查蘋果iCloud,巴西開放iOS側載佣金降至5%,蘋果圍牆花園持續崩塌。

3 小時前
智東西生成式AI

谷歌時隔6年再發智能音箱,Gemini上桌,售價不到700元

智東西 編譯 | 劉煜 編輯 | 陳駿達 智東西6月18日消息,谷歌昨日宣佈,其首款搭載居家版Gemini語音助手的智能音箱(Google Home Speaker)已開啟預售,將於當地時間6月25日正式上市,售價為99.99美元(約合人民幣677.03元)。在此之前,谷歌已有6年沒有推出過獨立智能音箱產品。 谷歌這款智能音箱外觀近似球形,風格類似亞馬遜新一代Echo音箱與蘋果舊款音箱HomePod Mini。 ▲谷歌智能音箱(圖源:谷歌官網) 使用音箱時,用戶只需通過口令“Hey Google”或“OK Google”喚醒Gemini,就可以繼續下達相應指令。這與谷歌舊款音箱、智能顯示屏等喚醒語音助手的方式相同。此外,用戶只要按照日常說話習慣下達命令,Gemini便能理解用戶意圖,相比之前大大提升溝通效率。 一、加強短時對話記憶,會員可與Gemini不限次數對話 谷歌此次推出的全新音箱升級諸多功能。其中,音箱搭載的Gemini語音助手擁有10款全新擬人化語音音色,用戶可以根據喜好自行選擇聲線。音箱還可支持用戶一次性下達多條語音指令,即使指令未能說對、說完整,用戶中途改口Gemini也能識別。 Gemini還具備多鏈路推理能力,落地到實際生活場景中比較實用。例如,用戶問:“我支持的足球隊下場比賽天氣如何?”Gemini收到指令後,會自動查詢賽事時間、舉辦地點,同時匹配相應時段天氣,再給出答覆。 同時,Gemini加強了短時對話記憶,能承接上下文實現連續對話功能。即使用戶連續追問、甚至串聯多項任務、不重複交代前置條件,該語音助手也能實現來回連貫交流。 ▲谷歌Gemini對話場景(圖源:谷歌官網) 不僅如此,Gemini搭配的連續對話功能,能讓應答後的音箱麥克風保持短暫收音,用戶無需重複喊“OK Google”就能繼續提問。該功能現已全面支持所有Gemini原生適配的語言,包括

23 小時前

微軟,考慮接入DeepSeek

這篇消息聚焦「微軟,考慮接入DeepSeek」。原始導語提到:Copilot Cowork轉為按量計費。 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

23 小時前