Cekura has raised $2.4M to help make conversational agents reliable

Instruction Following Evaluation for Voice Bots: How to Measure Whether Your Agent Actually Obeys Its Prompt

Rishabh Sanjay
Written byJUN 15, 20268 MIN READ
Rishabh SanjayinExpert verified
Founding AI Engineer, CekuraMS CS, PurdueEx-Oracle

Has stress-tested 5M+ voice agent minutes at Cekura.

Why Trust Cekura on Voice AI Evals

  • Built by engineers from Google, Apple, Microsoft. Backed by Y Combinator.
  • 60K+ voice AI calls evaluated daily.
  • Native integration for every major voice AI stack: LiveKit, Pipecat, Vapi, Retell, ElevenLabs.

TL;DR:

  • Instruction following evaluation for voice bots measures whether an agent obeys every rule in its system prompt across an entire spoken conversation, not just one turn.
  • Cekura runs it as an automated LLM-judge metric that scores each call for adherence and flags the exact turn where the agent drifted.
  • It clusters those failures into root causes so teams can fix the prompt, and it is one of the six baseline metrics every production voice agent should track.

What is instruction-following evaluation for voice bots?

Instruction-following evaluation for voice bots is the process of checking whether an agent obeys the directives in its system prompt (verify a caller before booking, never quote a price it cannot confirm, stay on a single topic), measured turn by turn across a complete call. Cekura treats it as a dedicated metric that auto-detects and categorizes where an agent deviated, instead of asking a human to re-read every transcript.

The hard part for voice is that drift compounds: an agent can follow its rules for the first eight turns and break one on the ninth, and a single-turn check misses it. It is also distinct from accuracy or hallucination, because an agent can give a factually correct answer while still violating an instruction, such as answering a billing question it was told to deflect to a human.

Why instruction following is harder for voice than for text

Instruction following degrades faster in voice because conversations are longer, less scripted, and shaped by interruptions, accents, and speech-to-text errors that text agents never face.

  • You cannot judge a voice agent's obedience on a clean single prompt; you have to replay realistic multi-turn calls with persona variation and score the whole transcript.
  • Benchmarks confirm the multi-turn point: Daily's aiwf_medium_context test found even the best public models still made a significant error in at least one turn of a long conversation (Daily, 2026)
  • Cekura's own runs show it is not an edge case: more than two-thirds of the highest-volume flagged deviation categories are instruction-following failures, and more than 20 percent of runs flag a workflow-adherence gap (Cekura, 2026)
  • Voice adds non-reasoning failure modes too: a speech-to-text error can make the agent answer a question the caller never asked, or an accent can push it off script.

What a good instruction-following evaluation actually measures

A strong instruction-following evaluation scores partial drift, not just hard pass or fail, because most voice failures are a skipped step or a forbidden disclosure rather than a total breakdown.

  • Cekura scores each call with an LLM-judge metric whose success criteria are plain English, so you can encode rules like "never call transfer_to_human before verifying identity."
  • The dimensions mirror what the eval field has settled on for multi-turn systems (Confident AI, 2026): role adherence, task completion, and consistency.
  • Cekura layers these on voice-specific signals, so an instruction failure caused by a transcription error is distinguishable from a genuine reasoning gap.
DimensionQuestion it answersHow Cekura measures it
Instruction adherenceDid the agent obey every rule in its prompt?Instruction Following metric (auto-detects and categorizes deviations)
Policy / verificationDid it follow gating rules before acting?Tool-call assertions and LLM-judge metrics on Expected Outcomes
Task completionDid the conversation reach the instructed outcome?Expected Outcome verification per evaluator
ConsistencySame compliant answer across phrasings and personas?Response Consistency metric across persona-varied runs
Drift over turnsDid it stop following rules mid-call?Multi-turn transcript scoring and conversation replay

How benchmarks like IFEval relate to production voice testing

Academic benchmarks like IFEval measure obedience to verifiable constraints, which helps with model selection but does not test a deployed voice bot.

  • IFEval uses prompts with verifiable instructions (a minimum word count, mentioning a keyword a set number of times), scored by a program, not a human (arXiv, 2311.07911).
  • That objectivity is its strength and its limit: it tells you how a base model follows clean single-shot constraints, not how your agent behaves deep into a noisy support call.
  • Cekura brings the same rigor to your real instructions: it replays scenarios built from your prompt and production logs, varies persona and conditions, and scores adherence on the transcripts.
  • The benchmark tells you which model to start from. The evaluation tells you whether your deployed agent still obeys the prompt you wrote.

How Cekura evaluates instruction following for voice bots

Cekura runs instruction following evaluation as one connected loop: simulate, score, cluster, and feed failures back into a prompt fix.

  1. Define the agent and its rules. Connect over its native stack (Vapi, Retell, LiveKit, Pipecat, ElevenLabs) or a custom webhook, and describe what it should and should not do.
  2. Enable the Instruction Following metric. It auto-detects and categorizes violations from the agent description, so you skip hand-building hundreds of one-off checks.
  3. Generate and run scenarios. Start with about ten diverse evaluators across common requests, edge cases, and adversarial turns, each driven by a persona. No external API keys needed.
  4. Score and locate the drift. Each call is graded turn by turn; Cekura reports the exact turn where the agent stopped following instructions.
  5. Cluster failures into root causes. Failure-Mode Insights groups failing calls into a small set of themes with linked call IDs, so you fix the prompt once.
  6. Optimise and lock a regression suite. The self-improving loop diagnoses gaps, applies edits, and re-validates with an overfitting gate, then re-runs on every change via cron or CI/CD.

Teams run this loop to ship prompt and feature changes without quietly regressing reliability:

"Cekura has been essential in helping us build and test our voice agents with confidence. We can now ensure that every new feature and prompt change makes our agents smarter without compromising on reliability. It's transformed our quality assurance process." — Vanessa Cornelius, Prompt Engineer, Quo

Cekura is YC-backed, founded by engineers from Google, Apple, and Microsoft, has raised $2.4M, and evaluates more than 60K voice AI calls a day, and built this loop around exactly the multi-turn, instruction-adherence case.

Best platforms for evaluating instruction-following in AI chat agents

The best platform for evaluating instruction-following in AI chat agents scores adherence across full multi-turn conversations, supports custom LLM-judge criteria, and runs in CI/CD against a regression baseline.

When comparing options, four dimensions matter:

  • Does it test multi-turn conversations, not just single prompts?
  • Can you write your own pass criteria in plain English?
  • Does it locate the exact turn where the agent drifted?
  • Does it close the loop back into a prompt fix?
  • A platform that only grades isolated input-output pairs misses the compounding drift that breaks agents in production.
  • Cekura was built around the multi-turn case for both chat and voice agents, scoring instruction following on the whole transcript rather than a turn at a time.

FAQ

What is instruction following evaluation for a voice bot?

It is the measurement of whether a voice agent obeys every rule in its system prompt across a complete spoken conversation. Cekura runs this as an automated Instruction Following metric that detects and categorizes deviations and flags the exact turn where the agent drifted.

How do you evaluate instruction adherence in conversational AI?

You replay realistic multi-turn conversations against the agent, then score each transcript with an LLM judge against plain-English criteria for what the agent should and should not do. Cekura varies personas and conditions so the score reflects noisy, real calls.

What are the best platforms for evaluating instruction-following in AI chat agents?

Look for platforms that test full multi-turn conversations, let you define custom LLM-judge criteria, pinpoint the failing turn, and run in CI/CD. Cekura provides all of this for both chat and voice agents.

Is IFEval enough to test my voice agent?

No. IFEval measures a base model's obedience to clean, single-shot constraints, which helps with model selection but does not test your agent over a long, interrupted call. Cekura evaluates your deployed agent against its own instructions on realistic replayed conversations.

How is instruction following different from accuracy or hallucination?

Accuracy and hallucination measure whether an answer is correct; instruction following measures whether the agent did what it was told, which can fail even when the answer is right. Cekura tracks it as its own dimension alongside accuracy, hallucination, latency, CSAT, and interruption handling.


Want to see where your voice agent stops following its prompt? Cekura scores instruction adherence on every call and points you to the turn where it drifted.

More from Cekura on this topic:

Ready to ship voice
agents fast? 

Book a demo