Key Takeaways
- Insurers trust our voice AI agents with real customer conversations, and that trust has to be re-earned on every release. Testing is where it gets earned, and for most teams it's the slowest, most manual part of the lifecycle.
- We built Agent Arena, a simulation harness that runs synthetic customer conversations against our P&C insurance voice agents. The agent under test runs exactly as it does in production and can't tell it's being tested.
- Backend tool calls are resolved from scenario data, so teams test agent behavior without standing up a single carrier sandbox.
- Transcripts are scored two ways: deterministic unit tests (tool arguments, call order, required fields) and LLM-as-judge evaluations (hallucination, instruction-following, use-case quality).
- Hundreds of conversations that used to take days of manual dialing now runs in minutes, and regressions surface before they reach a caller.
- Where we're headed: a self-serve platform where any team submits a dataset, gets scored results, compares against every prior run, and uses that evidence to gate release decisions.
The Testing Bottleneck Nobody Escapes
Voice AI agents are increasingly deployed in high-stakes domains, including P&C insurance, where they handle first notice of loss (FNOL), claim status inquiries, quote generation, policy servicing, and endorsements. Each workflow demands that the agent gather specific information, follow policy constraints, and invoke backend APIs through tool calls. A missed field, a hallucinated policy detail, or a malformed tool argument does not stay confined to the conversation. It propagates into downstream systems and creates indemnity and compliance risk. Validating that agents behave correctly is not a nice-to-have; it is a prerequisite for shipping and operating them safely.
Yet for virtually every team building voice AI agents, rigorous testing remains one of the most time-consuming parts of the lifecycle, both during development and long after the agent ships.
In Development: Death by a Thousand Test Calls

Building a voice agent is iterative. Every prompt revision, tool schema change, or routing adjustment can alter conversation behavior in ways that are hard to predict. Validating those changes typically requires live phone calls: dial in, walk through a scenario, record what happened, compare against expectations, and repeat. A single prompt tweak can consume hours before anyone can say whether the change helped or hurt.
The cost is concrete. Producing a test set large enough to draw conclusions from, on the order of 50 to 100 transcripts, takes at least a full day of manual calling before any analysis can begin. That set is what teams mine for signal: where the agent dropped a required field, mis-structured a tool argument, or took a wrong branch. When a single day of effort stands between a code change and the evidence to judge it, iteration slows to a crawl and the temptation to ship on intuition grows.
The work compounds as agents grow more capable. Multi-agent architectures, where a coordinator routes between specialized sub-agents, multiply the conversation paths that need exercising. Tool integrations add another dimension: the agent must not only speak correctly but call the right APIs with correctly structured arguments. Testing one happy path is manageable; testing the combinatorial space of use cases, edge cases, and failure modes is not, at least not through manual calling alone.
In Production: The Regressions Never Stop Coming
Shipping an agent does not end the testing burden. Production introduces a continuous need for regression validation: every subsequent change must be checked against the flows that already work, because a fix for one failure mode can silently break three others. Without a repeatable, scalable harness, teams fall back on ad-hoc spot checks (a few manual calls, a sampling of production transcripts, a round of third-party testing) and hope nothing critical slips through.
The consequences are asymmetric. A regression in a customer-facing voice agent does not produce a clean error message in a log file. It produces a failed conversation, a frustrated caller at midnight, and potentially incorrect data written to backend systems. In regulated domains like insurance, those failures carry real downstream cost.
Why the Usual Playbook Falls Short
The challenge runs deeper than logistics. Static benchmarks and single-turn evaluations do not capture what makes production voice agents hard. Real interactions are multi-turn, policy-bound, and tool-dependent: the agent must navigate a conversation while calling APIs with structured arguments, switching between sub-agents in multi-step workflows, and adhering to domain rules that vary by use case [1, 4]. Traditional evaluation methods that rely on small, manually curated test sets fail to scale and miss the intricate dynamics of these interactions [1].
Most teams also lack a durable system of record for quality. Test results scatter across screenshots, spreadsheets, chat threads, and engineers' memories. When the question arises ("did this change help?"), there is rarely a quantitative, comparable answer. Without persistent run history and scored outcomes, release decisions default to intuition rather than evidence.
Standing up backend test environments adds another layer of friction. Voice agents that interact with carrier systems through tool calls need coherent mock data or sandbox APIs to test against. Building and maintaining those environments for every integration is fragile, slow, and often impractical, especially when multiple teams iterate on agent logic in parallel.
The Real Risk Isn't the Voice: It's the Reasoning
The highest-risk failures are often not in the speech layer itself. Most production voice agents are built as a cascaded architecture (speech recognition feeds an LLM layer, which in turn drives speech synthesis), and it is that middle LLM layer, the reasoning layer, where data integrity breaks: does the agent collect every required field? Does it call the right tool with the right arguments? Does it route to the correct sub-agent? These questions can be exercised directly at the reasoning layer, without standing up an audio pipeline, provided the layer is fed input that reflects the imperfect transcripts it sees in production. That structural insight, isolating the reasoning layer and testing it at scale against realistic input, motivates the approach we took to solving the problem.
Test the Text, Factor In the Voice
We built Agent Arena, our internal name for the harness, to address the structural gaps in how voice agents are tested: not by replacing human judgment, but by making rigorous, repeatable validation feasible at the speed agent development demands.
The core design decision was to test the text layer and factor in the effects of the voice layer. In the cascaded architecture, the three stages run as a pipeline: speech recognition (ASR) converts audio to text, the LLM layer reasons over that text and invokes tools, and speech synthesis (TTS) converts the response back to audio. That middle stage, the reasoning layer, is where agent integrity and data integrity live: it holds the prompts, tool schemas, routing logic, and conversation state. By simulating conversations over text, we exercise that layer at a fraction of the cost and cycle time of live phone calls, while preserving the turn-taking structure that defines how the agent behaves in production.
Testing the text layer does not mean ignoring the voice layer. In production, the LLM never receives pristine text; it reads transcripts shaped by everything upstream in the cascade, from misrecognized words to interruptions. So Agent Arena feeds the agent the same kind of imperfect input, injecting realistic transcription errors into the simulated caller's turns. The voice layer is never rendered as audio, but its downstream effects on the text are reproduced, so a passing simulation reflects how the agent will hold up on a real call. We detail this in Closing the Sim2Real Gap.
The second design decision followed a basic engineering principle: a test is only meaningful if its conditions match production. If the agent behaves differently because it can tell it is being tested, the results say nothing about a real call, so the agent under test must run exactly as it does in production, unable to tell it is in a simulation at all. In production, our voice agents join a real-time communication call, find a peer participant publishing chat data, and run their normal turn loop. Whether that peer is a human caller bridged through telephony or a simulated user is an implementation detail the agent never sees. Agent Arena replicates this contract on the same substrate: the agent joins a call, exchanges messages on a chat channel, and runs its configured backend (single-agent or multi-agent with coordinator handoffs) exactly as it would on a live call.
The third decision was to make the simulated user an LLM-driven role-player, not a rigid script. Research on persona-driven user simulation shows that LLMs can emulate nuanced customer roles when grounded in structured objectives, facts, and behavioral scenarios, and that simulated conversations can be evaluated at scale using both deterministic checks and LLM-as-judge assessments [3]. Grounding simulated users in real conversation data improves fidelity further: generative agents built from in-depth interview transcripts reproduce real individuals' survey responses with accuracy approaching the individuals themselves across a two-week interval [2]. This is the principle behind our replay mode, where a source transcript anchors the simulated user while the agent under test is free to take a different path. We also took seriously a caveat from interactive-agent benchmarks [4]: LLM-based simulators tend to be unrealistically cooperative, the Sim2Real gap we tackle later in this post [5]. We built Agent Arena with that gap in mind, encoding difficult personas and tracking frustration signals today, with richer persona modeling on the roadmap.
Finally, we addressed a practical blocker: backend-independent testing. Standing up test databases and API sandboxes for every carrier integration is fragile, slow, and often impossible. Agent Arena's synthetic tool resolver emulates backend responses from scenario data (deterministic mocks, transcript replay, or LLM-generated responses grounded in authoritative backend state), so teams test agent behavior without depending on live systems.
Under the Hood
Agent Arena is built around two logical participants that interact in an isolated real-time call for each test case:
Both participants talk on the voice-backbone, the real-time communication substrate that hosts each conversation. It provides isolated calls, where participants join and exchange messages over named channels in real time. This is the same kind of transport that connects a live caller to an agent in production, which is precisely why Agent Arena reuses it: the agent under test joins a call and runs exactly as it would on a real call, unaware that its conversation partner is simulated.
The voice-backbone-proxy is the control plane that orchestrates a run. It receives a dataset, creates one call per case on the voice-backbone, dispatches both participants into each call, waits for their partial results, and merges them into a single simulation report. Cases run in parallel, so a dataset of 50 scenarios does not require 50 sequential phone calls.
flowchart TB
proxy["voice-backbone-proxy (control plane)"]
proxy -->|"dispatch case 1"| call1
proxy -->|"dispatch case N"| callN
subgraph call1 ["Call: case 1"]
us1["user-sim"] <-->|"chat channel"| as1["agent-sim"]
as1 -->|"trace channel"| tr1["tool-call trace"]
end
subgraph callN ["Call: case N"]
usN["user-sim"] <-->|"chat channel"| asN["agent-sim"]
end
call1 -->|"partial results"| proxy
callN -->|"partial results"| proxy
proxy -->|"merged report"| results["batch results"]
Anatomy of a Test Case
A dataset is the portable unit a team submits. It is intentionally simple: a shared agent configuration (the prompts and tools that define the agent being tested) bundled with one or more cases, where each case describes a single isolated conversation scenario. A case brings together a few high-level pieces of information:
- Caller profile: who the simulated caller is and what they are trying to do.
- Backend state: the domain facts the agent's tools should "see" when queried.
- Tool responses: optional canned answers for specific tool calls.
- Source conversation: an optional recorded conversation, used only when replaying a real call.
Most of the authoring effort goes into the caller profile, which grounds the simulated user. It is built from three plain-language parts.
Objective: what the caller wants and what "done" looks like. For example:
Goal: Report a new auto claim for an accident that happened last week. Success looks like: The agent collects the policy number, the date of the incident, and a description of what happened, then completes the claim intake.
Primary facts: the things the caller knows and will share when asked. These are written as everyday facts, not data fields:
- The policy number is the one on their insurance card.
- The accident happened on a specific recent date.
- It was a minor rear-end collision with no injuries.
- The vehicle is a black sedan that is still drivable.
Checkpoints: the milestones the conversation should hit, in plain language, so the simulator can tell whether the caller stayed on track:
- Caller states they want to report a new claim.
- Caller provides their policy number.
- Caller describes when and how the incident happened.
- Caller confirms there were no injuries.
The backend state plays a complementary role. It represents what the carrier's systems would return if the agent actually queried them: for instance, that a given policy is active, which perils it covers, and whether a claim has already been filed. Keeping this alongside the case lets the agent's tool calls return answers that stay consistent with the scenario, without connecting to any real system.
This structure maps directly to the persona/goal/scenario compliance framework established in conversational agent evaluation research, where a clear objective, grounding facts, and explicit checkpoints enable systematic assessment of whether a simulated user stayed on mission and whether the agent met the scenario's requirements [3].
Two Ways to Simulate a Caller
Agent Arena supports two ways to ground a simulated caller. They are complementary: replay mode anchors testing to what really happened on live calls, while freeform mode lets teams probe scenarios that have not happened yet. Most teams use both, and the balance between them shifts across the agent lifecycle.
Replay: Re-Run Reality
Replay mode starts from a recorded conversation, typically a real customer call. The simulator derives the caller profile from that source conversation (its objective, the facts the caller revealed, and the checkpoints the conversation should hit), grounds the simulated user on the original interaction, and can reuse the tool responses captured during the original call. The simulated caller then re-enacts the same intent, but the agent under test is free to take a different path, so the rerun is not a transcript playback; it is a fresh conversation anchored to a real one. That grounding in real interaction data is a direct application of the generative-agent finding cited earlier [2], which is why replay is the higher-fidelity mode for reproducing how an actual caller behaved.
Replay shines when the goal is to reproduce, diagnose, and protect against specific real-world behavior:
- Reproducing production failures. When a live call goes wrong (a missed field, a wrong tool argument, a bad handoff), that call becomes a replay case. The team can reproduce the failure deterministically instead of trying to re-trigger it by hand.
- Verifying fixes. After a prompt or tool change, rerunning the failing case confirms whether the change actually moves the agent off the recorded failure path toward a successful outcome.
- Branching from history. A real call can be replayed up to a point and then allowed to diverge, letting the team explore "what would the agent have done if the caller had said something different here?"
- Regression protection. Calls the agent already handles well become regression cases. Replaying them after every change guards against breaking flows that were already working.
In the lifecycle, replay is the backbone of in-production testing and agent health assessment. Once an agent is live, its real call traffic becomes a growing supply of grounded cases. Converting representative production calls into replay cases turns day-to-day traffic into a continuously expanding regression suite. Re-running that suite is how a team answers "is the deployed agent still healthy, and did the last change help or hurt?" with evidence drawn from real interactions rather than hand-written scenarios.
In the lifecycle, replay is the backbone of in-production testing and agent health assessment. Once an agent is live, its real call traffic becomes a growing supply of grounded cases. Converting representative production calls into replay cases turns day-to-day traffic into a continuously expanding regression suite. Re-running that suite is how a team answers "is the deployed agent still healthy, and did the last change help or hurt?" with evidence drawn from real interactions rather than hand-written scenarios.
Freeform: Invent the Calls You Haven't Gotten Yet
Freeform mode does not require a source conversation. The team authors the caller profile and backend state directly, and the simulated user generates its turns from the stated objective and facts. This makes freeform the mode for testing situations that do not yet exist in production history: there is no recorded call to anchor to, so the scenario is constructed from intent.
Freeform shines when the goal is to explore, stress, and expand coverage:
- New workflows and cold starts. When an agent or use case is brand new, there is no call history. Freeform lets a team generate a set of conversations (varying complexity, paths, and intents) from the first working version of the agent, so testing can begin at iteration zero.
- Edge cases and rare intents. Scenarios that almost never occur in production but carry high risk (unusual claim types, conflicting information, ineligible policies) can be authored deliberately rather than waited for.
- Stress testing caller behavior. Freeform caller profiles can encode difficult personas (impatient, confused, withholding, or providing too much information at once) to probe how the agent holds up under behavior that clean, cooperative simulators would never produce.
- Coverage by design. Because cases are authored from objectives, a team can systematically cover a use case's required fields, branches, and policy constraints instead of hoping production happened to exercise them.
In the lifecycle, freeform is the backbone of development and pre-production testing. It is how engineers iterate on a change before any real traffic exists, how they bootstrap a suite for a new workflow, and how they deliberately construct the hard cases that real call samples are unlikely to contain. As the agent reaches production, freeform cases are increasingly complemented by replay cases, together giving a health picture that spans both authored worst cases and grounded real-world behavior.
What Happens Inside a Run
Each case follows the same lifecycle:
sequenceDiagram
participant Proxy as voice-backbone-proxy
participant UserSim as user-sim
participant AgentSim as agent-sim
participant Resolver as tool resolver
Proxy->>UserSim: dispatch case
Proxy->>AgentSim: dispatch case
AgentSim->>UserSim: greeting (chat channel)
loop until objective complete or timeout
UserSim->>AgentSim: user turn (chat channel)
AgentSim->>AgentSim: LLM reasoning
opt domain tool call
AgentSim->>Resolver: resolve tool
Resolver-->>AgentSim: mock/synthetic result
AgentSim->>AgentSim: trace event (trace channel)
end
AgentSim->>UserSim: assistant response (chat channel)
end
UserSim->>Proxy: user-side report
AgentSim->>Proxy: conversation + tool trace
Proxy->>Proxy: merge results
User and agent messages flow on the chat channel. Tool calls, tool results, agent switches, and routing decisions are recorded on a separate trace channel for inspection. The agent under test sees only the chat channel, the same contract it would see on a live call. Messages use a simple JSON envelope:
{"type": "chat", "payload": {"role": "user", "content": "I need to file a claim, my car got rear-ended."}}
{"type": "tool_call", "payload": {"tool": "lookup_policy", "args": {"policy_number": "POL-123456"}}}
{"type": "tool_result", "payload": {"tool": "lookup_policy", "result": {"status": "active"}, "source": "mock"}}Faking the Backend, Faithfully
When the agent calls a domain tool, the resolver produces a response without hitting real backend systems. Resolution follows a priority chain:
- Source-conversation replay: return the tool result recorded in the source conversation (replay mode only).
- Mock responses: return a mock tool response authored for that tool.
- Synthetic generation: an LLM generates a response grounded in the backend state, the tool's expected inputs, the conversation so far, and any per-tool output guidance.
Every resolution is recorded in a tool-resolution trace with the tool name, arguments, result, which layer produced it, the active agent, and elapsed time. This trace is what makes tool-call validation possible in the evaluation layer.
Three Layers of Testing
Agent Arena separates testing into three layers with clear ownership:
Unit tests are built on programmatic evaluators. A team authoring a dataset can declare assertions like:
ToolArgsContain(
tool_name="file_fnol_claim",
required={
"policy_number": "POL-123456",
"date_of_loss": "2026-04-18",
"loss_type": "collision",
},
)
Other evaluator types cover tool-call ordering, call counts, maximum turn limits, agent routing sequences in multi-agent workflows, and semantic message matching. These run automatically after every simulation batch, producing a pass/fail summary per case that answers the question every agent engineer asks before shipping: is this prompt change safe?
What You Get Back
Each simulation yields a merged result containing:
- Conversation transcript: the full message history between user-sim and agent-sim.
- User-side report: checkpoint status, frustration signals, and turn records.
- Tool-resolution trace: every domain tool call, its arguments, resolved response, and source layer.
- Completion metadata: whether the conversation ended normally, why it ended, and status from each participant.
One Case, Start to Verdict
What You Get Back
Each simulation yields a merged result containing:
- Conversation transcript: the full message history between user-sim and agent-sim.
- User-side report: checkpoint status, frustration signals, and turn records.
- Tool-resolution trace: every domain tool call, its arguments, resolved response, and source layer.
- Completion metadata: whether the conversation ended normally, why it ended, and status from each participant.
One Case, Start to Verdict
To make this concrete, here is a single case followed from start to finish: a caller reporting a new auto claim for a minor rear-end collision. An engineer is about to change the FNOL agent's prompt and wants to confirm the change is safe before shipping it.
Step 1: Author the case. Because this is a development iteration, the engineer works in freeform mode, writing the caller profile (objective, primary facts, checkpoints) and the backend state directly:
Objective: Report a new auto claim for a rear-end collision that happened last week; success is a completed claim intake with policy number, date of loss, and a description captured. Primary facts: policy number from the insurance card; a recent date of loss; a minor rear-end collision with no injuries; a black sedan, still drivable. Checkpoints: (1) states intent to file a claim, (2) provides policy number, (3) describes the incident, (4) confirms no injuries. Backend state: the policy is active and covers collision; no claim has been filed yet.
Step 2: Declare what "correct" means. Alongside the case, the engineer attaches the unit tests that should hold for any healthy run of this scenario: for example, that the agent calls file_fnol_claim exactly once, with the policy number and loss date the caller provided, after first looking up the policy:
ToolCallOrder(expected_order=["lookup_policy", "file_fnol_claim"])
ToolArgsContain(
tool_name="file_fnol_claim",
required={"policy_number": "POL-123456", "loss_type": "collision"},
)
Step 3: Dispatch and run. The engineer submits the dataset. The voice-backbone-proxy spins up a call, drops in the user-sim and the agent-sim, and the conversation begins. To keep the test honest, the simulator injects realistic transcription noise into the caller's turns. The exchange might look like this on the chat channel (note the misheard policy number and a partial turn):
agent-sim: Thanks for calling. How can I help you today?
user-sim: i need to file a claim my car got rear ended last tuesday
agent-sim: I'm sorry to hear that. Can I get your policy number?
user-sim: yeah it's pee oh el one two three four five... six
agent-sim: Let me confirm, is that POL-123456?
user-sim: yes that's right
[agent-sim -> tool resolver: lookup_policy(policy_number="POL-123456")]
[resolver -> agent-sim: { status: "active", covers: ["collision"] }]
agent-sim: Got it. Was anyone injured in the collision?
user-sim: no everyone's fine
[agent-sim -> tool resolver: file_fnol_claim(policy_number="POL-123456", loss_type="collision", ...)]
[resolver -> agent-sim: { claim_id: "CLM-001", status: "filed" }]
agent-sim: Your claim is filed. Your reference number is CLM-001.
The agent never sees the tool calls or the bracketed trace; those flow on the trace channel. It only sees the noisy caller turns, recovers the garbled policy number through a confirmation, and proceeds.
Step 4: Resolve tools without a backend. Neither lookup_policy nor file_fnol_claim touches a real system. The resolver answers each call from the case's backend state and tool responses, recording every resolution (arguments, result, and which layer produced it) in the tool-resolution trace.
Step 5: Score the run. When the conversation ends, the user-sim and agent-sim return their partial results, the proxy merges them, and the testing layers run over the transcript:
- All four checkpoints are marked hit.
ToolCallOrderpasses:lookup_policyprecededfile_fnol_claim.ToolArgsContainpasses: the filed claim carried the correct policy number and loss type, even though the caller's spoken policy number arrived garbled.
Step 6: Inspect and decide. The engineer sees a green run and, more importantly, the same suite still passing on every other FNOL case in the dataset. Now suppose the prompt change had a side effect, such as the agent beginning to skip the injury question. The injury checkpoint would flip to failed and the regression would surface immediately, before the change reached a single real caller. Either way, the verdict is evidence-based: a per-case, per-assertion pass/fail summary that answers "is this change safe to ship?" without a single manual phone call.
Closing the Sim2Real Gap
The hardest part of simulation is making it faithful to reality. Research on the Sim2Real gap in user simulation shows that LLM-based user simulators tend to be more cooperative, more uniform in communication style, and less likely to express frustration or ambiguity than real callers, creating an "easy mode" that can inflate agent success rates above what real callers actually produce [5]. A simulated caller that always speaks in clean, complete, perfectly transcribed sentences is not the caller the agent encounters in production.
That gap is not only behavioral; it is also in the input itself. Agent Arena tests the LLM layer, and in production that layer never receives pristine text. It receives transcripts, and those transcripts carry the full noise of a live phone call: misrecognized words from regional and non-native accents, cross-talk when the caller and agent speak over one another, garbled or dropped phrases from background noise, partial responses when a caller trails off, and mid-sentence interruptions. An agent that collects every required field from a clean transcript may stumble when a policy number is misheard, a date is half-spoken, or the caller interrupts before a question finishes.
To close this gap, Agent Arena injects realistic transcription errors into the simulated caller's turns before they reach the agent, mirroring what speech recognition produces under real conditions:
- Accent-driven misrecognitions: substitutions and phonetic confusions typical of regional and non-native speech.
- Cross-talk: overlapping speech that bleeds fragments of one turn into another.
- Background noise: garbled, dropped, or low-confidence words from a noisy environment.
- Partial responses: turns that trail off, omit expected information, or end abruptly.
- Interruptions: the caller cutting in before the agent completes its turn, leaving the agent with incomplete conversational state.
The point is not to test speech recognition itself; it is to stop testing the LLM layer on an unrealistically clean version of its input. Omitting these artifacts produces overly optimistic agent health estimates: the agent looks more robust in simulation than it is in production, and regressions in how it handles imperfect input go undetected. By grounding simulated conversations in the same imperfections present in real call data, Agent Arena narrows the distance between a passing simulation and a successful production call.
This is also where the work connects back to the person on the other end of the line. A misheard policy number or a skipped injury question is not just a failing assertion; it is the policyholder filing a claim after an accident, the CSR who inherits a broken handoff, the adjuster working from data captured wrong. Agent Arena exists so those moments hold up. The realism is reinforced on the behavioral side too: Agent Arena tracks frustration and checkpoint deviation during conversations and supports persona guidance so simulated callers are not uniformly cooperative. We also keep human validation in the loop for end-to-end cases and release gates. Simulation accelerates iteration; it does not replace human judgment for go-live decisions.
From Days to Minutes: The Payoff
What We're Seeing So Far
Agent Arena is in active use across Liberate's agent engineering teams. The gains show up in both speed and coverage.
A full test set in minutes, not a day. Producing 50 to 100 transcripts manually takes at least 24 hours of calling before analysis begins. Agent Arena runs tens of simulations in parallel, collapsing that batch from a day to minutes. That speed changes what iteration looks like: an engineer can make a change, regenerate the full batch, and read scored results inside a single working session instead of waiting a day for enough data to judge the change.
Faster diagnosis and faster fixes. The compressed loop pays off twice: once in development, where the batch can be regenerated on every change, and again when mitigating a diagnosed problem. In replay mode, the exact conversations that exposed a failure are rerun immediately after a fix, so the gap between "we found a problem" and "we have evidence the fix works" closes within a single sitting rather than over several rounds of manual calling.
Standardized, reusable case authoring. Setup was never the single biggest bottleneck, but it was repetitive and non-portable. Agent Arena standardizes case authoring from client-provided transcripts and conversation milestones, so the resulting cases are reusable across agents rather than rebuilt from scratch each time.
Tests and evals that travel to production. The deterministic tests and evaluations authored in Agent Arena are ported directly into production monitoring. Teams do not build one suite for pre-launch validation and a separate one for production; the same assertions that gate a change continue to monitor the live agent. This cuts eval-development time and, more importantly, moves teams off generic, one-size-fits-all rubrics and onto agent-specific evals aligned with each client's business requirements, a far higher-ROI signal than a generic quality score.
Broader coverage of user profiles and scenarios. Manual testing is bounded by human effort twice over: a person has to design each caller profile, and then a person has to place each call to exercise it. That ceiling keeps manual test plans small and skewed toward a handful of cooperative, happy-path personas. Because Agent Arena generates both the caller and the call programmatically, the breadth of profiles and scenarios under test is limited by authoring intent rather than calling capacity. Teams can vary persona, emotional tone, disclosure style, and conversation path across a far wider matrix, including the difficult and rare profiles manual testing rarely reaches, and run them all in parallel.
Backend-independent validation. Teams test agent behavior against scenario data without standing up carrier test environments. The synthetic tool resolver lets engineers iterate on prompts and tool schemas while the backend integration is still in progress.
Not Just Insurance, Not Just Voice
Although the examples here are drawn from P&C insurance voice agents, Agent Arena is not tied to insurance, to voice, or to any single conversation pattern. What it tests is the reasoning layer (prompts, tools, routing, and conversation state), and that layer is common to every conversational agent regardless of domain, channel, or call direction. Agent Arena is portable along three independent dimensions:
- Domain-independent. The case format makes no assumptions about insurance. The objective, primary facts, checkpoints, backend state, and tool responses are generic containers; swapping insurance FNOL for healthcare intake, banking servicing, or retail support is a matter of authoring different case content, not changing the harness. Any agent that must gather information, follow policy, and call tools correctly is a candidate.
- Channel-independent. Because the harness operates on text turns and tool calls rather than audio, it tests the same reasoning layer whether the production agent speaks over voice, exchanges SMS, or runs in a chat widget. In practice we already use Agent Arena to benchmark SMS and chat agents alongside voice, with the same case format, resolver, and testing layers.
- Direction-independent. Agent Arena handles inbound and outbound agents equally. For an inbound agent, the agent greets first and the simulated caller responds; for an outbound agent (collections, renewals, proactive outreach), the agent opens with its objective and the simulated user reacts as the called party. The turn loop is symmetric, so initiating versus receiving the conversation is just a property of the case, not a different testing path.
That breadth makes Agent Arena a single platform rather than a point tool, a core piece of Liberate's testing infrastructure rather than an insurance-only script runner. Teams author one suite tailored to a specific agent and reuse it across domains, channels, call directions, and the full lifecycle, from first build through production monitoring. It covers the same surface area as published industrial agent-testing frameworks (synthetic scenario generation, simulated user-agent interaction, and tool-aware evaluation [1, 4]), with the added advantage of consolidating case authoring, simulation, tool resolution, and evaluation behind one workflow instead of stitching together separate tools for each.
The Flywheel: Every Run Makes the Next One Better
The single most important long-term effect of Agent Arena is compounding: every run makes the next one more valuable. Each simulation adds annotated traces (per-turn reasoning, checkpoint hits, frustration signals, tool calls, and branching metadata) to a growing library of conversations. Once an agent is in production, its real calls feed back in as new replay cases. That library only ever grows, and it grows with exactly the conversations that matter most: the ones the agent has actually faced.
flowchart LR
author["Author cases<br/>(freeform + replay)"] --> run["Run simulations<br/>in parallel"]
run --> score["Score with<br/>tests + evals"]
score --> corpus["Annotated<br/>conversation corpus"]
corpus --> ship["Ship with<br/>evidence"]
ship --> prod["Production calls"]
prod -->|"convert to replay cases"| corpus
corpus -->|"regression suite + synthetic data"| author
That library compounds value along three axes:
- Regression guard. When a change fixes a specific failure, the history verifies that unrelated flows, which were already passing, still pass. This guards against overfitting: a prompt tweak that solves one problem must not break three others. Because production calls continuously become new cases, the regression suite expands to cover precisely the situations the agent encounters in the wild.
- Portable, agent-specific evaluation. The tests and evals attached to these cases travel with them from development into production monitoring, and they are tailored to the agent rather than borrowed from a generic rubric. The more an agent is exercised, the sharper and more business-aligned its evaluation suite becomes.
- Synthetic data for downstream use. The collected conversations can seed fine-tuning datasets, few-shot prompt libraries, and evaluation benchmarks as the agent matures, turning a testing byproduct into a reusable asset.
The result is a moat that strengthens over time: every agent Liberate runs through Agent Arena, across a fleet of agents and channels, adds to a tailored, grounded, ever-expanding body of test cases and agent-specific evals that no generic testing tool can replicate from scratch. It is a meaningful part of how Liberate earns and keeps its place as the most trusted AI in insurance.
Where We're Headed
We are working toward a platform where any team can self-serve simulation: submit a dataset, get scored results, compare against prior runs, and use those results to inform release decisions. Concrete targets include:
- Multiple agent engineering teams running weekly simulation datasets.
- Release decisions regularly informed by regression output from Agent Arena.
- A growing library of test suites across insurance use cases.
- Persistent run history so prompt changes can be evaluated against every prior run over time.
If you are an engineer who wants to work on this, making agent evaluation rigorous, fast, and grounded in real conversations, we're hiring.
Contributors
Liberate Data Science & Liberate Engineering
References
- Levi, E., & Kadar, I. (2025). IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems. arXiv:2501.11067. https://arxiv.org/abs/2501.11067
- Park, J. S., Zou, C. Q., Shaw, A., Hill, B. M., Cai, C., Morris, M. R., Willer, R., Liang, P., & Bernstein, M. S. (2024). Generative Agent Simulations of 1,000 People. arXiv:2411.10109. https://arxiv.org/abs/2411.10109
- Gromada, J., Kasicka, A., Komkowska, E., Krajewski, Ł., Krawczyk, N., Przybył, B., Veyret, M., Rojas-Barahona, L., & Szczerbak, M. K. (2025). Evaluating Conversational Agents with Persona-driven User Simulations based on Large Language Models: A Sales Bot Case Study. Proceedings of EMNLP 2025 Industry Track, pages 230–245. https://aclanthology.org/2025.emnlp-industry.16/
- Yao, S., Shinn, N., Razavi, P., & Narasimhan, K. (2024). τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. arXiv:2406.12045. https://arxiv.org/abs/2406.12045
- Zhou, X., Sun, W., Ma, Q., Xie, Y., Liu, J., Du, W., Welleck, S., Yang, Y., Neubig, G., Wu, S., & Sap, M. (2026). Mind the Sim2Real Gap in User Simulation for Agentic Tasks. arXiv:2603.11245. https://arxiv.org/abs/2603.11245


