Hugging Face BlogAI Agent

ScarfBench:針對企業Java框架遷移的AI代理效能基準評測

2026年6月30日 18:32

重點摘要

IBM 研究團隊推出 ScarfBench,這是一個針對企業 Java 框架遷移的 AI 代理效能基準評測。該基準專注於 Spring、Jakarta EE 和 Quarkus 三大框架間的遷移任務,要求 AI 代理不僅要生成程式碼,還需確保應用程式能成功建置、部署並通過行為驗證。ScarfBench 包含 34 個應用程式、204 個遷移任務及超過 1300 個專家測試,提供更貼近真實情境的現代化品質評估。

站內 AI 整理稿

Back to Articles ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration Enterprise Article Published June 30, 2026 Upvote - Raju Pavuluri rpavuluri Follow ibm-research Rahul Krishna rkrsn Follow ibm-research Srikanth Govindaraj Tamilselvam stamilse Follow ibm-research Bridget M brmcg Follow ibm-research Ashita Saxena ashitasaxenaIBM Follow ibm-research George Safta george-safta Follow ibm-research Advait Pavuluri apavuluri Follow ibm-research Michele Merler mimerler Follow ibm-research ⭐ Star ScarfBench on GitHub Modernizing enterprise applications is one of the largest and most expensive software engineering activities organizations undertake. Teams migrate applications across frameworks to improve maintainability, cloud readiness, developer productivity, and access to modern capabilities. Recent advances in coding agents have sparked excitement around AI-assisted modernization. But an important question remains: Can AI agents reliably modernize real-world enterprise applications? Existing software engineering benchmarks have demonstrated impressive progress in bug fixing and code generation, but framework migration presents a fundamentally different challenge. Success requires not only translating code, but also preserving behavior, adapting build systems, and navigating runtime dependencies. To address this gap, we introduce ScarfBench (Self-Contained Application Refactoring Benchmark), an open benchmark for evaluating AI agents on cross-framework migration tasks in Enterprise Java. ScarfBench focuses on migrations across three major Java ecosystems: Spring Jakarta EE Quarkus Unlike traditional benchmarks that compare generated code against reference implementations, ScarfBench evaluates whether migrated applications actually build, deploy, and preserve behavior. Why Migration Is Hard Framework migration is much more than replacing annotations. A simple repository migration can require changes across dependency injection, persistence configuration, queries, and framework descriptors. Small mistakes in any of these pieces can prevent successful deployment. Figure: Spring → Jakarta Migration Example Framework migration requires translating framework semantics, not just source code. Introducing ScarfBench ScarfBench provides a systematic way to evaluate AI agents on enterprise Java framework migration tasks. Applications are required to: Build successfully. Deploy correctly. Pass behavioral validation. This provides a much more realistic measure of modernization quality. Benchmark at a Glance Metric Value Applications34 Framework implementations102 Migration tasks204 Lines of code~151K Source and test files~2,000 Expert-written tests1,331 ScarfBench includes both focused migration tasks and whole-application migrations. Figure: ScarfBench Construction Pipeline Starting from a JSR-based enterprise Java taxonomy, expert migrations create verified implementations across Spring, Jakarta EE, and Quarkus. How Do Frontier Agents Perform? We evaluated several state-of-the-art coding agents on ScarfBench. Despite strong performance on traditional software engineering benchmarks, framework migration remains difficult. Success rates vary considerably across framework pairs and whole-application migrations remain particularly challenging. Figure: Current Leaderboard Source: scarfbench.info/leaderboard Even the strongest current agents achieve less than 10% behavioral success, illustrating the gap between generating compilable code and preserving application behavior. Figure: Compile → Deploy → Test Progression Compile success consistently exceeds deploy success, which in turn exceeds behavioral success. Build success alone significantly overestimates migration quality. Figure: Migration Outcomes by Target Framework Migration difficulty depends strongly on the target framework, with Jakarta EE proving particularly challenging. What We Learned About AI Agents for Java Modernization Beyond measuring success rates, ScarfBench helps us understand how agents behave during modernization. Can Agents Reliably Tell When a Migration Is Complete? A migrated application is only useful if it actually builds and runs. We therefore compared agent-reported outcomes against independent build verification. Finding: Agents Are Overconfident Claude Code reported successful builds for 29 out of 30 whole applications. Only 22 of those applications actually built successfully. Meanwhile, the single application classified as failed by the agent ultimately built correctly. This suggests that agent self-assessment should not be treated as a reliable signal of migration completion. Independent build and test validation remains essential. How Do Agents Navigate Application Dependencies? Framework migrations rarely affect a single file or layer. Changes in configuration, services, databases, and web components often cascade across the application. Finding: Migration Is Iterative Rather Than Linear The most frequently visited layers were: Configuration Web Database Service Common transitions included: Configuration ↔ Web Service ↔ Database This suggests that migration is an iterative dependency-resolution process rather than a simple source-to-source transformation. Where Do Agents Spend Most of Their Effort? We used layer revisit frequency as a proxy for migration effort. Layers that required repeated visits typically involved debugging, dependency resolution, or framework adaptation. Finding: Configuration Dominates Migration Effort Rather than proceeding linearly, agents repeatedly returned to configuration-related artifacts while resolving framework differences and dependency issues. What Challenges Are Not About Code Transformation? Not every migration issue originates from source code. Finding: Environment and Tooling Matter Agents frequently struggled with environmental issues, including: Docker cache inconsistencies Port connectivity problems Maven wrapper and build tooling issues These operational concerns often delayed validation even when the source-code migration itself was largely complete. Figure: Failure Mode Distribution Modernization failures span build systems, deployment environments, dependency injection, databases, endpoints, assertions, and infrastructure. Key Takeaway The biggest challenge in framework modernization is not translating Java code. It is managing the web of dependencies across configuration, infrastructure, and runtime environments. While frontier agents can automate substantial portions of the migration process, reliable validation and architectural reasoning remain critical for achieving successful outcomes. ScarfBench helps expose these challenges and provides a standardized way to measure progress toward truly autonomous application modernization. Explore ScarfBench ScarfBench is designed as an open resource for researchers and practitioners. Resources include: Benchmark dataset Evaluation infrastructure Public leaderboard Documentation Open-source code Researchers can compare agent architectures and techniques. Practitioners can use ScarfBench to evaluate modernization solutions before deploying them in production environments. Website https://scarfbench.info Dataset https://huggingface.co/datasets/ibm-research/ScarfBench Space https://huggingface.co/spaces/ibm-research/ScarfBench GitHub Repository https://github.com/scarfbench/scarfbench Leaderboard https://scarfbench.info/leaderboard Paper https://arxiv.org/abs/2605.06754 Framework migration remains one of the largest unsolved problems in AI-assisted software engineering. We hope ScarfBench helps the community measure progress and accelerate the next generation of AI-assisted application modernization. We invite researchers, practitioners, and framework communities to evaluate their agents, contribute new migration scenarios and help advance the state of the art. Datasets mentioned in this article 1 Spaces mentioned in this article 1 More from this author Build real agentic app

Related

相關文章

雷峰網AI Agent

從WorldArena榜首到1500+模型落地:跨維智能證明世界模型不是Demo是生意

AI科技評論獲悉,跨維智能近日已完成B輪融資,融資金額10億元人民幣,投後估值超過百億,成功躋身具身智能獨角獸行列,踏入IPO的門檻。這輪融資的投資方橫跨幾類資本:國家級母基金、頭部國資創投、實體龍頭產業資本和地方科創平臺。深創投、貴陽數字經濟基金是連續兩輪下注;前海母基金、藍思科技、工銀資本、恆健資產、諸瑞資本這輪新進入;南山戰新投、成都科創投、四川院士基金等老股東繼續追加。新老股東一起加碼,背後是資本市場對跨維技術路線和落地能力的某種共識。至於錢往哪花,跨維的答案是幾件事:底層世界模型算法迭代、物理引擎升級、數據基礎設施建設、人形機器人能力完善,以及真實場景落地——把技術、產品和商業閉環再往前推一步。百億估值從來不是憑空出現的。一家成立僅四年的公司憑什麼走到這個位置?拆開來看,撐起估值的是:一條被持續驗證的技術路線、一個跑通了的商業閉環、一種以終為始的終局戰略。物理AI技術全棧自研,幾經行業驗證2021年成立時,跨維智能就把方向定在物理AI與世界模型上,是國內最早一批做物理AI全棧自研的公司。它選的路徑是"世界模型 + 物理仿真 + 真機落地",這條路被它認為是行業終局。值得注意的是,跨維多項核心技術節點的佈局進度,排在英偉達、DeepMind等海外巨頭之前,而過去幾年的行業走勢,也一直在驗證這條路線。世界模型,是這條路線的核心,也是當下全球AI頭部力量競逐的方向。英偉達、谷歌等公司持續圍繞物理AI、機器人仿真、環境推演、合成數據生成和世界基礎模型進行前沿佈局,把人工智能從語言理解、視覺識別,進一步推向物理世界建模、動作因果預測和智能體交互決策。對具身智能產業而言,世界模型早已不只是"生成未來畫面"的視覺模型,而是支撐機器人訓練、評估、規劃與泛化的關鍵技術底座。跨維選這條道,等於把自己擺到了和全球頭部同臺競技的位置。一個近期的註腳,來自全球具身世界模型權威評測World

4 小時前
量子位AI Agent

Agent之間,有互聯網了!

這篇消息聚焦「Agent之間,有互聯網了!」。原始導語提到:明略科技開源發佈Octo 從 AI 情報角度來看,這類內容值得關注其背後的技術進展、產品落地、產業競爭與後續市場影響。

7 小時前
智東西AI Agent

高考志願填報,成了檢驗AI Agent的一場大考

智東西 作者 | 畢偉豪 編輯|漠影 智東西6月30日報道,近期,高考出分後,志願填報無疑是全國1290萬考生家庭的頭等大事,AI輔助志願填報也正在成為越來越多家庭的選擇。 近兩年,AI Agent(智能體)快速崛起,以其強大的工具調用能力以及複雜任務處理能力,迅速成為很多人日常生活和工作中不可或缺的一部分。當AI Agent從聊天工具進入真實決策場景,應該用什麼標準評價它?高考志願填報就是一個高壓測試場。 過去評價AI,很多時候看它答得準不準;但進入高考志願這類真實決策場景後,評價標準變了:它能否理解複雜規則,能否調用權威數據,能否記住用戶條件,能否在連續追問中不斷縮小範圍,並最終給出可參考的判斷。 近日,百度搭子DuMate上線了其首個面向高考志願場景的信息助手Skill,根據志願填報場景專門設計,能力覆蓋規則理解、數據分析以及綜合建議等方面,為考生提供可以參考的決策輔助。 一、從搜索到判斷:百度AI如何把分散高考信息組織起來 傳統志願填報的常態是,考生和家長在十幾個網頁中來回切換,去查詢包括一分一段,高校招生章程,院校排名、就業情況等信息,甚至還得去各大社交平臺搜校友評價。信息,網絡上大部分都有,但它們分散在不同的地方,權威性、可靠性也各不相同。 面對大量複雜、分散、難辨真偽的信息,百度搭建了一條從信息蒐集到處理的搜索鏈,來提高輸出結果的可靠性。 在信息蒐集方面,DuMate高考信息助手Skill採取了多源搜索加交叉驗證的機制,啟用該Skill後,DuMate會同時從多個權威信源調取數據,包括各省教育考試院官網、陽光高考平臺、高校招生網,以及百度搜索結果頂部的結構化阿拉丁卡片。 阿拉丁卡片是百度搜索推出的垂直結構化信息展示產品‌,用信息聚合的方式滿足用戶的特定搜索需求。此次,百度專門為高考場景定製了新的阿拉丁卡片,這張卡片包含招生政策、熱門院校、專業查詢、高考查分、一

9 小時前