Trajectory Releases a Concurrent Multi-LoRA Training Stack for Continual Learning, Reporting a 2.81× Experiment-Throughput Gain
重點摘要
Trajectory’s concurrent multi-LoRA stack reports a 2.81× experiment-throughput gain over single-tenant RL, with all code in the NovaSky-AI/SkyRL GitHub repository. Most language models improve in discontinuous jumps. A team collects data, trains, and ships a new version. This takes months and produces remarkable or catastrophic behavior for users. Trajectory wants to replace that cycle with continual learning. The Trajectory team published a field report describing how. It built a concurrent, multi-LoRA training platform for continuously learning workloads. The work was done with UC Berkeley Sky Lab and Anyscale. All training code is open-sourced in the NovaSky-AI/SkyRL repository. The result is a 2.81× end-to-end experiment-throughput improvement. The comparison is against a single-tenant
Trajectory’s concurrent multi-LoRA stack reports a 2.81× experiment-throughput gain over single-tenant RL, with all code in the NovaSky-AI/SkyRL GitHub repository. Most language models improve in discontinuous jumps. A team collects data, trains, and ships a new version. This takes months and produces remarkable or catastrophic behavior for users. Trajectory wants to replace that cycle with continual learning. The Trajectory team published a field report describing how. It built a concurrent, multi-LoRA training platform for continuously learning workloads. The work was done with UC Berkeley Sky Lab and Anyscale. All training code is open-sourced in the NovaSky-AI/SkyRL repository. The result is a 2.81× end-to-end experiment-throughput improvement. The comparison is against a single-tenant training framework. Trajectory reports no regression on any training rewards. What Multi-LoRA Training Actually Is Continual learning requires models to update from live feedback and production interactions. A coding agent could learn engineering patterns as developers correct its work. A support agent could resolve hard tickets as operators intervene on difficult cases. Most training infrastructure still assumes a linear lifecycle. Teams allocate GPUs, initialize the model, run a job, then spin down. Continual learning revises that relationship. When production interactions become training inputs, training becomes part of a live system. Modern RL training reduces to three core primitives. The Sampler generates trajectories from the current policy model. The Trainer computes gradients and updates the policy weights. Parameter synchronization broadcasts updated weights back to inference workers. Trajectory calls its approach Continuous Multi-LoRA Training, or C-LoRA. Each experiment maps to a dedicated LoRA adapter on a warm, multi-tenant engine. The Problems It Targets The Trajectory team identifies four inefficiencies in traditional stacks: (1) Cold starts are slow: Every serial job reloads checkpoints, initializes the distributed runtime, and warms inference engines. For large models, this step alone can exceed 30 minutes per run. (2) RL is memory intensive: Frontier models often exceed 100B parameters. Qwen3.5-397B can require up to eight H200 nodes to fit into memory. LoRA cuts memory usage by an order of magnitude. It freezes the base model and trains only small adapter weights. (3) Traditional stacks are single-tenant: They run one experiment at a time. Multi-LoRA maps each experiment to one adapter, multiplexing throughput by a factor of N. (4) Job utilization is low: Trainers and inference engines stall while waiting for each other. Multi-LoRA load balances across jobs to fill idle capacity. Inside the Architecture Most throughput wins come from inference. In vLLM, all adapters are hot-loaded in GPU memory. Decode steps can then mix tokens from different adapters in the same batch. The key enabler is the SGMV decode kernel. It fuses per-adapter matrix-vector work into one GPU launch per decode step. After each optimization step, updated LoRA weights load in-place into the inference engine. The scheduler does not freeze, so other tenants keep decoding. Training works differently. One active LoRA adapter trains on the GPU. The rest sit in pinned CPU memory. Each tenant’s state lives in an AdapterStore. It holds LoRA parameters, FP32 master weights, optimizer moments, and gradient buffers. The engine swaps one tenant’s state onto the GPU, runs a single forward_backward pass, then swaps it back. This training path is still single-adapter. The inference concurrency gains do not yet apply to training. The Numbers Trajectory tested on a single H200 node with Qwen3-4B-Instruct-2507. It ran sync RL on GSM8K in an agentic setting. The Trajectory team reframed GSM8K as a tool use learning task. The model decides when to call a Calculator and a Final Answer tool. Reward is 1.0 only when Final Answer is called with the correct answer. The policy starts near 40% accuracy at step 0. With the right learning algorithm, it climbs past 90% by step 9. The Trajectory team scaled to eight concurrent multi-LoRA runs. Final Experiment Time hit 5433s at N=8, a 2.81× speedup. Eight concurrent experiments finished before three serial runs back-to-back. Mean Experiment Time also improved, peaking at N=4 with a 1.88× speedup. Every concurrency level reached reward_accuracy above 90% by step 9. The Tradeoffs Higher throughput costs per-step latency. As N grows, First Experiment Time and Step Time degrade. At N=8, the first serial experiment finishes 1.97× faster. Mean step time rises from 191s to 500s, only 2.62× slower. Most of that increase is rollout time. Rollout grows from 162s to 401s, roughly 77% of the increase. At N=2, doubling the load adds only 15% rollout time. That is the ideal case for multi-LoRA. The pattern held on a harder workload. On τ-bench retail with the NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 MoE model, N=2 finished 10 steps 1.28× faster. Per-tenant step time rose 1.57×. Strengths and Weaknesses Strengths: 2.81× end-to-end experiment-throughput gain at eight concurrent runs No accuracy regression; runs tracked the serial baseline within ±1σ in the final steps LoRA cuts memory by an order of magnitude versus full fine-tuning Fully open-sourced in NovaSky-AI/SkyRL for the community to build on Weaknesses: Per-step latency and First Experiment Time degrade as N grows Training remains serialized across tenants; only inference is multiplexed Tested mainly on mid-sized models, not frontier-scale parameters Setup requires an 8× H100/H200 node and a Megatron build Key Takeaways Trajectory built a concurrent, multi-LoRA RL training stack for continual learning, open-sourced in NovaSky-AI/SkyRL. It reports a 2.81× end-to-end experiment-throughput gain over a single-tenant baseline, with no reward regression. Each experiment maps to a dedicated LoRA adapter on an always-hot engine, multiplexing throughput by N. Most gains come from vLLM multi-LoRA inference via the SGMV decode kernel; training stays single-adapter. The tradeoff is per-step latency: at N=8, step time rises from 191s to 500s. Marktechpost’s Visual Explainer @import url('https://fonts.googleapis.com/css2?family=Fraunces:opsz,[email protected],400;9..144,600;9..144,700&family=JetBrains+Mono:wght@400;500;700&display=swap'); #mtp-clora *{box-sizing:border-box!important;margin:0;padding:0} #mtp-clora{ --bg:#f5f7f3; --card:#ffffff; --ink:#1c1f1a; --muted:#5c635a; --green:#76B900; --green-deep:#3f6b00; --border:#e3e8dd; --chip:rgba(118,185,0,.12); background:var(--bg)!important; color:var(--ink)!important; font-family:'JetBrains Mono',ui-monospace,monospace!important; border:1px solid var(--border)!important; border-radius:16px; max-width:880px; margin:0 auto; overflow:hidden; position:relative; line-height:1.55; } #mtp-clora p:empty,#mtp-clora hr,#mtp-clora del,#mtp-clora s{display:none!important} /* viewport */ #mtp-clora .mtp-view{position:relative; padding:34px 40px 22px; min-height:480px} #mtp-clora .mtp-slide{display:none; flex-direction:column; gap:14px; animation:mtpFade .45s ease} #mtp-clora .mtp-slide.is-active{display:flex} @keyframes mtpFade{from{opacity:0;transform:translateY(10px)}to{opacity:1;transform:none}} /* type */ #mtp-clora .mtp-eyebrow{font-size:11px; letter-spacing:.18em; text-transform:uppercase; color:var(--green-deep); font-weight:700} #mtp-clora h2{font-family:'Fraunces',Georgia,serif!important; font-weight:600; font-size:30px; line-height:1.12; color:var(--ink); letter-spacing:-.01em} #mtp-clora h3{font-family:'JetBrains Mono',monospace!important; font-size:13px; font-weight:700; color:var(--green-deep); letter-spacing:.04em; text-transform:uppercase} #mtp-clora p,#mtp-clora li{font-size:14.5px; color:#33382f} #mtp-clora .mtp-sub{font-size:14px; color:var(--muted)} #mtp-clora b,#mtp-clora strong{color:var(--ink); font-weight:700} #mtp-clora .mtp-mono{background:var(--chip)!important; color:var(--
Related
相關文章

AI預測不了“佛得角”
AI預測模型在世界盃足球賽預測中集體失準,特別是對非洲隊伍「佛得角」的表現完全錯估,凸顯大模型在面臨動態不確定性與非主流聯賽數據不足時的脆弱性。這場預測翻車事件引發外界對AI可信度的質疑,也促使科技公司反思如何修正模型,導入即時動態資訊以提升預測準確度。

智能家居終於“智能”了!有記憶、能認人的“賈維斯”,小米先交卷了
{"id":"bfc7e789-db52-4597-89dc-85a30161bd27","object":"response","model":"deepseek-v4-flash","output":[],"stop_reason":"max_output_tokens","usage":{"input_tokens":158,"output_tokens":1400,"total_to...

AI 讓獨立遊戲更容易做出來,也更容易死在 Steam 裡
AI 降低了獨立遊戲的生產門檻,也放大了 Steam 供給過剩和玩家信任危機。獨立遊戲的競爭,正在從“能不能做出來”,轉向“能不能被看見、被相信、被持續選擇”。當工具讓內容越來越容易生成,真正稀缺的反而是人的表達、真實反饋、發行篩選與社區信任。

全球首個 AI 藝術博物館:谷歌協力打造,生成 12 億像素超現實畫面
谷歌昨日(6 月 18 日)發佈博文,宣佈攜手藝術家 Refik Anadol,在洛杉磯打造全球首個 AI 藝術博物館 Dataland,將於明日(6 月 20 日)開館。

八部門聯合發文力推“人工智能 + 消費”,擴大 AI 手機電腦及智能網聯汽車消費
商務部等八部門聯合印發《關於加快“人工智能 + 消費”發展的實施意見》,提出 5 方面 17 條舉措,旨在擴大智能產品消費、賦能服務消費、創新消費場景。政策將推動人工智能與消費深度融合,促進 AI 進千家萬戶。#人工智能消費新政##AI 手機電腦##智能網聯汽車#

魔法原子牽手萬機易租,全棧產品入駐2.0平臺共建租賃生態
這篇消息聚焦「魔法原子牽手萬機易租,全棧產品入駐2.0平臺共建租賃生態」。原始導語提到:全系產品入駐萬機易租2.0 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。