Cekura has raised $2.4M to help make conversational agents reliable

AI Voice Agent Accuracy Testing: How to Measure Accuracy at Every Layer

Tarush Agarwal
Written byJUN 15, 202610 MIN READ
Tarush AgarwalinExpert verified
Co-founder & CEO, Cekura

Has stress-tested 5M+ voice agent minutes at Cekura.

Why Trust Cekura on Voice AI Evals

  • Built by engineers from Google, Apple, Microsoft. Backed by Y Combinator.
  • 60K+ voice AI calls evaluated daily.
  • Native integration for every major voice AI stack: LiveKit, Pipecat, Vapi, Retell, ElevenLabs.

TL;DR

  • AI voice agent accuracy testing measures how correctly a voice agent hears, understands, and acts on what a caller says, across three layers: speech-to-text transcription, intent and entity recognition, and end-to-end task completion.
  • Because errors compound across these layers, you cannot judge accuracy from a single number.
  • Cekura tests all three by simulating thousands of realistic calls, scoring transcription accuracy, intent and entity correctness, and task success, then monitoring the same metrics on live production calls.

What is AI voice agent accuracy testing?

AI voice agent accuracy testing verifies that a voice agent correctly transcribes speech, identifies the caller's intent, extracts the right entities (dates, names, IDs, amounts), and completes the requested task. It spans the full cascaded pipeline (speech-to-text, an LLM/NLU layer, tool calls, text-to-speech) and treats each stage as measurable, not an end-to-end "it sounded fine" judgment.

Accuracy is not one metric: an agent can transcribe perfectly and still pick the wrong intent, or pick the right intent and extract the wrong order number. Cekura separates these failure modes so you know which layer to fix instead of re-listening to every flagged call. The reason it matters is arithmetic, because errors multiply rather than add, so a small transcription error compounds into a meaningfully lower end-to-end accuracy. Voice agents therefore show higher error rates than text chatbots running the identical NLU model, because text never pays the transcription tax.

Why voice agent accuracy is harder to measure than text accuracy

Voice agent accuracy is harder to measure than text accuracy because audio adds variables text never has: accents, background noise, interruptions, code-switching, and the clean-vs-production gap. For external context, a model around 1.5% WER on clean studio audio can exceed 10% on noisy call-center recordings, and vendor "95% accuracy" claims routinely fall to the 70-80% range in the field (AssemblyAI, 2026); treat those as third-party figures, not Cekura benchmarks.

The point holds either way: accuracy testing that only uses clean audio measures the wrong thing. This is why Cekura's persona engine drives every test call with controllable conditions, including background noise, interruption patterns, speaking pace, emotion, and 30+ languages with code-switching, and replays the same evaluator across accents and noise profiles so you see where accuracy actually breaks, not just the best-case number.

Layer 1: Speech-to-text accuracy testing for voice agents

Speech-to-text accuracy testing measures how correctly the agent's STT layer converts spoken audio into text, usually with word error rate (WER).

  • Cekura includes a Transcription Accuracy metric that scores the agent's transcript against what was actually said in a simulated or live call.
  • WER is the industry-standard formula: (substitutions + deletions + insertions) / total words, expressed as a percentage. Lower is better.
  • For reference, leading STT vendors publish single-digit clean-audio WER on their own English benchmarks (AssemblyAI, Deepgram, third-party FYI). Your production WER will be higher, which is why you test on audio that matches real traffic.
  • Raw WER has a blind spot: it penalizes harmless differences (punctuation, contractions, fillers) and undercounts errors that change meaning. The field is moving to semantic WER, which scores only the errors that change what a downstream LLM understands (Pipecat stt-benchmark, external reference).
  • Cekura complements raw WER with speech-quality metrics (pronunciation checks, letter-level errors, gibberish detection, clarity), so a "low WER, still wrong" call does not slip through.
  • Deeper dive: accent and dialect testing for speech-to-text systems.

Layer 2: Intent and entity accuracy testing for voice AI

Intent and entity accuracy testing verifies that the agent correctly classifies what the caller wants (intent) and extracts the specific values it needs (entities), under real speech conditions.

  • Intent accuracy answers "did the agent understand the goal" (book vs cancel vs reschedule).
  • Entity accuracy answers "did it capture the right values" (date, phone number, order ID, amount, name).
  • Both can fail independently, and both are corrupted by upstream transcription errors: an STT slip from "fifteenth" to "fiftieth" is a perfect transcription of the wrong word that quietly breaks entity extraction.
  • Cekura tests these the way real callers behave:
    • Phrasing variation, paraphrases, typos, slang, and multi-intent utterances for intent coverage.
    • Entity checks across names, dates, IDs, locations, order numbers, and policy values.
    • Mid-conversation intent shifts and ambiguous intents, scored across turns.
    • ASR-aware evaluation, so a misheard word is attributed to the right layer.
  • Dedicated treatment: intent and entity accuracy testing for production-ready voice agents.

Layer 3: End-to-end task accuracy and tool-call correctness

End-to-end task accuracy measures whether the agent actually completed what the caller asked, which is the metric that maps to business outcomes.

  • Cekura scores it with an Expected Outcome per evaluator: a plain-English success condition judged met or not (for example, "agent cancels the appointment and provides a confirmation number").
  • This is where transcription and intent errors finally surface, and where agents cheat: saying "I have booked your appointment" without ever calling the booking tool is a failure even though the transcript reads as success.
  • Cekura's tool-call testing captures tool name, arguments, results, and latency, then asserts the agent invoked the right tool with the right arguments. Saying it is not the same as doing it.
  • Layered scoring lets the pipeline act as a CI gate: a golden set built from real failures, graders calibrated to humans, and a pass threshold that blocks regressions before they reach callers.

The three accuracy layers at a glance

LayerWhat it measuresPrimary metricHow callers break itCekura coverage
Speech-to-textAudio transcribed to correct textWord error rate (WER), semantic WERAccents, noise, code-switching, phone-quality audioTranscription Accuracy + speech-quality metrics, persona-driven noise/accent simulation
Intent + entityRight goal and right values understoodIntent accuracy, entity precision/recallParaphrases, slang, ambiguous or multi-intent utterances, mishearsScenario-based intent/entity scoring, ASR-aware, multi-turn
End-to-end taskTask actually completed correctlyTask success / Expected Outcome, tool-call successCompounded upstream errors, agent claiming success without actingExpected-outcome verification + tool-call assertions

Which layer failed? Accuracy attribution

A single accuracy score tells you the agent failed; it does not tell you why. The highest-leverage move is attributing each failure to the layer that caused it, because the fix differs at every layer. Cekura runs an ASR-aware attribution pass on every failed call:

  1. Did the transcript match what was said? If not, the failure is speech-to-text, and a missed word like "cancel" or "not" can sink a task even at otherwise high accuracy. A 99 percent WER is not enough if the missed word is the one that matters.
  2. If the transcript was right, did the agent pick the correct intent and entities? If not, the failure is the NLU layer, not the audio.
  3. If intent and entities were right, did the agent take the right action (call the right tool with the right arguments)? If not, the failure is task execution, not understanding.
  • Without attribution, teams "fix" the prompt when the real problem was a mis-heard digit, or retrain intent when the agent never called the tool.
  • Errors compound rather than add, and severity is not uniform: the cost of a wrong word depends entirely on which word it was.

How Cekura tests voice agent accuracy end to end

Cekura runs accuracy testing as a single workflow across all three layers, then carries the same metrics into production monitoring.

  • Handles voice synthesis, transcript generation, persona simulation, and conversation management itself, so no external API keys are needed.
  • Integrates natively with Vapi, Retell, LiveKit, Pipecat, and ElevenLabs, plus raw websocket/CHIRP, SIP, and custom self-hosted agents.
  • Cekura is YC-backed, founded by engineers from Google, Apple, and Microsoft, and evaluates 60K+ voice AI calls daily with 5M+ agent minutes stress-tested (eval-metrics guide).

"At the end of every call is a real person who just wants to be understood. Whether they're mixing languages or giving a quick one-word answer, our AI needs to listen with a human-like intuition. Partnering with Cekura allowed us to move beyond vibe based tests and stress-test our system against the messy, beautiful reality of human conversation ensuring that no matter how someone speaks, they feel heard."

- Anuj Modi, Nurix

A typical accuracy pass:

  1. Define the agent and enable accuracy metrics (Transcription Accuracy, Expected Outcome, Tool Call Success, Hallucination, Relevancy, plus custom LLM-judge or Python metrics).
  2. Generate evaluators (start with about 10 diverse scenarios) across common requests, edge cases, accents, noise, and varied personas.
  3. Run them over the relevant transport; Cekura scores each layer and attributes failures to STT, intent/entity, or task.
  4. Review failures, fix the prompt or config, and lock the refined scenarios into a regression suite.
  5. Monitor production, where the same metrics auto-evaluate every live call and cluster failures into root-cause themes.

Cekura tunes each LLM judge against historical calls until it tracks human reviewers closely, which is what makes automated accuracy scoring trustworthy enough to gate releases.

FAQ

What is AI voice agent accuracy testing?

Verifying that a voice agent correctly transcribes speech, identifies intent, extracts the right entities, and completes the task. It measures accuracy at each layer of the cascaded voice pipeline rather than as a single end-to-end score. Cekura tests all three layers through large-scale call simulation, then monitors them in production.

How do you test speech-to-text accuracy for a voice agent?

Compare the agent's transcript against the actual spoken words, usually with word error rate (WER): (substitutions + deletions + insertions) / total words. Test on production-representative audio (accents, noise, phone quality), because a model around 1.5% WER on clean audio can exceed 10% on noisy calls per third-party testing (AssemblyAI). Cekura scores transcription accuracy under controllable noise and accent conditions and adds semantic and speech-quality checks.

How do you test intent and entity accuracy in voice AI?

Run scenarios with paraphrases, slang, multi-intent utterances, and mid-conversation intent shifts, then check that the agent selects the right intent and extracts the correct entities. Evaluation should be ASR-aware so a misheard word is attributed to transcription, not the NLU layer. Cekura scores both across multi-turn conversations.

Why are voice agents less accurate than text chatbots?

Because errors compound across the speech-to-text, NLU, and task layers: a small transcription error feeds the intent layer, which feeds the task layer, so accuracy multiplies down rather than holding steady. The same NLU model that performs well on typed text inherits every transcription error in the voice path. Accuracy testing has to isolate each layer to find the real cause.

What is a good accuracy target for a voice agent?

Set a clear task-accuracy bar at launch and raise it as you refine, with WER kept low on production-representative audio and intent accuracy benchmarked per use case. Set targets per layer and per scenario, not as one global number, because a strong end-to-end score can hide a weak entity-extraction step. Cekura supports per-metric thresholds tracked across releases.

Ship voice agents you can trust

Accuracy is not a single number you check once before launch. It is three layers that drift independently and compound when they fail. To see exactly where your agent breaks, across transcription, intent, entities, and task completion, book a Cekura demo or start testing at dashboard.cekura.ai.

More from Cekura on this topic:

Ready to ship voice
agents fast? 

Book a demo