Hugging Face BlogAI Agent

EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios

2026年6月4日 12:24

重點摘要

站內 AI 整理稿

Back to Articles EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios Enterprise Article Published June 4, 2026 Upvote 1 Tara Bogavelli tarabogavelli Follow ServiceNow-AI Gabrielle Gauthier Melancon gabegma Follow ServiceNow-AI Katrina Stankiewicz kstankiewicz Follow ServiceNow-AI Nifemi Bamgbose onifemibam Follow ServiceNow-AI Fanny Riols FannyRiols Follow ServiceNow-AI Hoang Nguyen hnguy7 Follow ServiceNow-AI Raghav Mehndiratta rmehndir Follow ServiceNow-AI Lindsay Brin lindsaybrin Follow ServiceNow-AI Hari Subramani Hari-sub Follow ServiceNow-AI Anil Madamala anilmadamala Follow ServiceNow-AI Introduction Voice agent failures are often highly domain-specific. A system that flawlessly processes alphanumeric confirmation codes in flight re-booking transactions might stumble when handling complex policies in HR systems. Different domains test an agent's ability to adapt to different vocabulary, workflow complexities and user expectations. So with this release, EVA-Bench expands from one enterprise domain to three: Airline Customer Service Management (CSM), Enterprise IT Service Management (ITSM), and Healthcare HR Service Delivery (HRSD). Together they span 213 evaluation scenarios across 121 tools, a roughly 4x increase in scenario coverage from our original release. Every scenario was validated for solvability against three frontier models (OpenAI GPT-5.4, Google Gemini 3.1 Pro, and Anthropic Claude Opus 4.6) ensuring the benchmark is both challenging and fair. All three datasets are open-source and available for download: from datasets import load_dataset # Airline Customer Service Management (CSM) — 50 scenarios airline = load_dataset("ServiceNow-AI/eva-bench", "airline", split="test") # Enterprise IT Service Management (ITSM) — 80 scenarios itsm = load_dataset("ServiceNow-AI/eva-bench", "itsm", split="test") # Healthcare HR Service Delivery (HRSD) — 83 scenarios hrsd = load_dataset("ServiceNow-AI/eva-bench", "medical", split="test") EVA-Bench is built for multiple audiences. If you're evaluating a voice agent, you can run it against a diverse set of realistic enterprise scenarios spanning 35+ distinct workflows. If you're building your own evaluation dataset, this post describes our end-to-end generation and validation process in enough detail to serve as a practical reference. We walk through how each domain was designed and generated and take a deep dive into the two new additions. We also preview our upcoming multilingual extension, which widens the benchmark's reach beyond English-only enterprise deployments. Data Design Principles Five principles guided the design of the EVA-Bench datasets across all three domains. Voice-first scope. Not every enterprise workflow belongs in a voice benchmark. We started by identifying which tasks within each domain are handled over the phone in practice, then selected the most common flows from that subset. This kept the scenarios grounded in realistic call patterns. Realism. Tool schemas were modeled after the kinds of APIs a production platform uses. Scenario policies were drawn from actual enterprise constraints. For the Healthcare HRSD domain, this meant grounding scenarios in actual US healthcare policy and administration systems, including NPI numbers, FMLA, and insurance coverage, so that the benchmark reflects the domain as practitioners encounter it in real life. Variety. Scaling a dataset by simply repeating identical tasks offers limited evaluation signal. To avoid this, we defined specific workflows for each domain and sampled across three scenario types: single-intent calls, multi-intent calls with up to four intents in a single conversation, and adversarial calls where callers attempt to bypass troubleshooting steps, misclassify urgency, or access records they are not authorized to view. Within single and multi-intent scenarios, we also included cases where the user's goal is not satisfiable, because real call volume is not all happy-path, and in our experience models tend to struggle more with unsatisfiable goals than with successful interactions. Authentication. Prior work, (EVA-Bench and τ-Voice), has identified authentication as one of the most consistent failure points for voice agents. Every domain in EVA-Bench includes authentication flows, and the specific mechanisms are calibrated to the task. For example, OTP-based elevation appears where a production system would actually require it, not uniformly across all scenarios. Reproducibility. Without reproducible scenarios, it is difficult to know whether a score difference reflects a genuine capability gap or an artifact of how the scenario played out. We designed the dataset so that every scenario has exactly one correct resolution path. User goal construction ensures the simulator always has the information and instructions it needs to behave consistently, and scenario generation explicitly checks for and eliminates any cases where multiple valid action sequences could achieve the same outcome. Scenario Generation Joint generation. Scenarios are generated using SyGra, a graph-based synthetic data generation pipeline, with GPT-5.4 as the backbone. Each scenario requires three jointly consistent components which are generated together to prevent inconsistencies that arise when components are produced independently: User goal. Reproducibility requires that the user simulator behaves the same way every time a scenario is run. A vague statement of intent does not achieve this: the simulator will make different judgment calls across runs, producing inconsistent evaluation signals. To eliminate this, the user goal is structured as a decision tree that covers every situation the simulator is likely to encounter. The user goal specifies exactly which things the user should ask for along with a negotiation sequence that specifies exactly when to push back, when to ask for alternatives, and when to accept. Common edge cases, such as whether to accept a standby flight or an alternate airport, are handled with explicit instructions rather than left to the simulator to interpret. The resolution condition requires evidence of a completed action, such as a confirmation number or case ID, rather than a verbal commitment, so the simulator stays on the call until the action is actually confirmed. The result is a user that behaves like a consistent, realistic caller rather than one that improvises. Initial scenario database. The backend state the agent's tools will query and modify during the scenario. Generated jointly with the user goal to ensure that every entity referenced in the user goal, such as booking IDs, account details, and authentication credentials, exists and is consistent in the database. Expected final database state (ground truth). We derive the expected outcome by running the generation LLM on the agent instructions, user goal, and initial scenario database, producing a full action trace. As the LLM executes write tool calls, the database is updated incrementally, and the resulting terminal state becomes the ground truth that verifiers check against during evaluation. Joint generation is essential because these three components are deeply interdependent. Independent generation would introduce silent inconsistencies, such as a case ID referenced in the user goal that does not exist in the scenario database, which would corrupt the evaluation signal entirely. To enforce consistency, we run a multi-stage validation loop after each generation attempt and feed any failures back to the generation step, which retries until all checks pass. Validation proceeds in three steps. A structural check validates the scenario database against a Pydantic schema, catching type errors and missing fields. LLM-based validator checks consistency across the scenario more holistically: whether user-facing details in the goal match the database records, whether cross-references are internally valid, and whether authentication data is correctly configured. LL

原始來源：Hugging Face Blog ↗

查看原始來源

TechWebAI Agent

網易有道全面向AI轉型全場景Agent矩陣亮相圖博會

{"id":"39ef5947-b77a-4904-bf03-ff6264f08dc4","object":"response","model":"deepseek-v4-flash","output":[],"stop_reason":"max_output_tokens","usage":{"input_tokens":154,"output_tokens":200,"total_tokens":354}}

剛剛閱讀分析

Hugging Face BlogAI Agent

MosaicLeaks: Can your research agent keep a secret?

Back to Articles MosaicLeaks: Can your research agent keep a secret? Enterprise Article Published June 18, 2026 Upvote - Alexander Gurung agurung Follow ServiceNow Rafael Pardinas rafapi-snow Follow ServiceNow TL;DR Deep research agents increasingly combine private local documents with external tools like web retrieval, creating a privacy risk: an agent's external queries may leak sensitive information. MosaicLeaks proposes a new deep-research task with multi-hop questions that interleave public and private information. Across the models we tested, agents frequently leaked private information, and training only for task performance made it worse. We propose a mosaic-leakage-aware RL training method, Privacy-Aware Deep Research (PA-DR), which raises strict chain success (the share of chains

17 小時前閱讀分析