Cekura has raised $2.4M to help make conversational agents reliable

Custom KPIs for Voice Agent Monitoring: How to Define and Track Metrics That Map to Business Outcomes

Janhvi Nandwani
Written byJUN 15, 202610 MIN READ
Janhvi NandwaniinExpert verified
Founding Member, CekuraIIT Bombay (B.Tech, Mech)

Has stress-tested 5M+ voice agent minutes at Cekura.

Why Trust Cekura on Voice AI Evals

  • Built by engineers from Google, Apple, Microsoft. Backed by Y Combinator.
  • 60K+ voice AI calls evaluated daily.
  • Native integration for every major voice AI stack: LiveKit, Pipecat, Vapi, Retell, ElevenLabs.

TL;DR:

  • Custom KPIs for voice agent monitoring are team-defined metrics that score every production call against your own business rules, not just a vendor's preset list.
  • In Cekura you write them as Boolean, Rating, or Enum judges (LLM-as-judge or Python), scoped to a specific conversation node or the full call.
  • They run automatically on live traffic with dashboards and alerts, so a passing score means the agent actually did the job your business cares about.

What are custom KPIs for voice agent monitoring?

Custom KPIs for voice agent monitoring are metrics you define yourself to measure whether a production voice agent met a specific business outcome on each call, not just generic signals like latency or sentiment. In Cekura, a custom KPI is a user-defined metric with a Boolean (pass/fail), Rating (a score), or Enum (a category) output, judged by an LLM reading the transcript and optionally the audio, or by a Python function.

Generic metrics tell you the call sounded fine; custom KPIs tell you whether the agent booked the appointment, quoted the right price, read the disclosure, or escalated. A voice agent can return a clean technical signal and still fail the caller, because a low latency reading and a positive sentiment score say nothing about whether the task actually got done.

Why preset metrics are not enough for production voice agents

Preset metrics give you a reliable baseline but cannot know your business logic, so the most useful KPIs are almost always the ones you define.

  • Cekura ships predefined metrics across four families (Accuracy, Conversation Quality, Customer Experience, Speech Quality): Hallucination, CSAT, Latency, Talk Ratio, Tool Call Success, and more.
  • Those answer "is the agent healthy?" They do not answer "did the agent quote the correct copay?" or "did it read the RESPA disclosure before discussing rates?"
  • Pacing shows the gap: per Cekura's voice AI evaluation metrics guide, most production-ready agents pace around 200 words per minute, more than half run above 190 WPM, and agents at or above the 0.80 Talk Ratio threshold are talking over callers.
  • 2026 guidance agrees outcome metrics outrank execution metrics: Google Cloud's framework says audit the agent's reasoning and tool-selection trace, not just the final answer, and pair every cost metric with a success rate.

The difference is who defines the metric and what it knows:

Preset / library metricCustom KPI (team-defined)
Who defines itThe vendorYour team
What it knowsGeneric agent health (latency, CSAT, WER)Your business rule (copay quoted correctly, RESPA disclosure read)
Example"Was latency under target?""Did the agent verify identity before discussing the account?"
When to useBaseline monitoring across every agentOutcomes and compliance specific to your domain
How it is scoredBuilt-inLLM-judge (plain-English criteria) or Python, scoped to a node or the full call

Presets tell you the agent is healthy. Custom KPIs tell you it did the job.

How to define a custom KPI for a voice agent (step by step)

You define a custom KPI in Cekura by describing the success condition in plain English, choosing an output type, scoping it, and validating it against historical calls before it goes live. A non-engineer can author one; an engineer can extend it in Python.

  1. Write the success condition in plain English. For an LLM-judge metric the description is the prompt. Example: "Return true only if the agent confirmed the caller's date of birth before discussing any account details."
  2. Pick the output type. Boolean for pass/fail compliance, Rating for graded quality (1 to 5), Enum for categorical outcomes (resolved / escalated / abandoned).
  3. Scope it. Attach the metric to the whole call or to a specific node, so a compliance rule is judged only where it applies.
  4. Add an evaluation trigger. Always-on, or conditional so it fires only when relevant (for example, only on calls that reached the payment step).
  5. Use dynamic variables for scenario-aware judging. Reference call context such as {{metadata.instructions}} so one metric adapts to different call types.
  6. Turn on audio analysis if it is a speech KPI. Set audio=True for pacing, tone, or pronunciation that the transcript alone cannot capture.
  7. Validate against historical call IDs. Test on real past calls and tune the judge prompt until its scores line up with human reviewers.

Once validated, the same KPI runs in pre-production simulations and automatically post-call on live production traffic.

Custom KPI types and what they are good for

Three output types cover the kinds of voice agent KPIs you will define.

Output typeBest forVoice agent KPI examplesHow it is judged
Boolean (pass/fail)Compliance and hard requirementsIdentity verified before account access; required disclosure read; correct tool actually calledLLM-judge or Python, true/false
Rating (graded score)Quality and degree-of-successAnswer completeness (1 to 5); empathy in a cancellation flow; clarity of the quoted priceLLM-judge, numeric scale
Enum (category)Outcome classificationCall outcome (resolved / escalated / abandoned); intent category; dropoff nodeLLM-judge, fixed label set

A practical pattern combines all three: a Boolean for the non-negotiable compliance rule, a Rating for how well the agent handled the request, and an Enum to bucket the outcome for reporting.

How custom KPIs flow into monitoring reports and dashboards

Custom KPIs surface in Cekura's Observe dashboards, where every production call is auto-evaluated and the scores roll up into live and historical reports.

  • The metric becomes a column in the Runs dashboard, a trend line over time, and a drill-down filter.
  • Each row clicks through to the timestamped transcript, audio, and tool-call data behind the score.

Three monitoring behaviors make custom KPIs operational rather than decorative:

  • Failure-Mode Insights. A daily Cekura agent clusters failing LLM-judge calls from the previous day into a handful of root-cause themes with linked call IDs, for custom metrics as well as predefined ones.
  • Smart alerting. Set thresholds on any custom KPI and route alerts to Slack, email, or a webhook. Alert on tail percentiles (P90, P95, P99), not averages, because an acceptable average can hide calls spiking to several seconds.
  • Remediate, do not just detect. Cekura ties each KPI to an action: cluster the failing calls, jump to the offending transcript and tool-call trace, patch the prompt or knowledge base, and re-validate before the fix ships.
  • Close the loop. A production call that fails a custom KPI can be turned into a regression test, so the failure becomes a permanent check.

Why this matters, with real numbers from Cekura's voice AI evaluation metrics guide:

  • More than 20 percent of runs flag a workflow-adherence gap, exactly the "did the agent follow our process?" signal custom KPIs track.
  • Kastle, running on Cekura, drove a 70 percent lower cost-per-call alongside 90 percent CSAT, the outcomes a monitoring layer exists to protect.

We set up key metrics on Cekura and could easily compare it between our old and new stack. This gave us the confidence to deploy the new stack.

— Vichar Shroff, Co-founder & CPO, Confido Health

A worked example: a compliance KPI for a lending voice agent

Node-scoped custom KPIs beat full-call averages for regulated agents.

  • A lending agent must read specific disclosures, but a single full-call "compliance" score averaged across a long conversation is nearly meaningless because the obligation lives at one node.
  • Scoping a Boolean KPI to the rate-discussion node makes the pass/fail signal sharp: it fires only where the rule applies, so a failure is unambiguous and actionable.
  • The same logic applies to a healthcare agent verifying identity before disclosing records, or a support agent that must offer a callback before a queue exceeds a wait threshold.

How Cekura runs custom KPIs end to end

Cekura is a testing, evaluation, and observability platform for voice and chat agents, so a custom KPI you define is reused across the full lifecycle rather than siloed in monitoring.

  • Integrates natively with Vapi, Retell, LiveKit, Pipecat, and ElevenLabs, plus raw websocket/CHIRP, SIP, and custom webhook agents.
  • Owns voice synthesis and conversation management, so no external API keys are needed to run the simulations that validate a new KPI.

The lifecycle for a single custom KPI:

  1. Author and validate the metric against historical calls.
  2. Attach it to simulation scenarios to test pre-production at scale, including edge cases and adversarial personas.
  3. Enable it so it runs automatically on every live call in Observe.
  4. Watch it in dashboards, get alerted on regressions, and read clustered Failure-Mode Insights.
  5. Convert failing production calls into regression tests, and optionally feed them into the Optimise Prompt loop.

Cekura is YC-backed, founded by engineers from Google, Apple, and Microsoft, has raised $2.4M, evaluates 60K+ voice AI calls daily, and has stress-tested 5M+ voice agent minutes, so a custom KPI you author runs on infrastructure proven at scale (voice AI evaluation metrics guide).

FAQ

What are custom KPIs for voice agent monitoring?

Custom KPIs for voice agent monitoring are metrics you define yourself to score live calls against your own business rules. In Cekura they are user-defined Boolean, rating, or enum metrics, judged by an LLM or Python, that run automatically on production traffic alongside presets like CSAT and latency.

What are custom metrics for a voice agent, and how are they different from preset metrics?

Custom metrics encode domain-specific success conditions (for example, "verified identity before account access") that no vendor preset can know. Presets cover generic health signals; custom metrics cover whether the agent did the specific job your business cares about, scoped to the node where the rule applies.

How do I build a voice AI monitoring reports dashboard with custom KPIs?

In Cekura, every custom KPI you author is auto-evaluated on production calls and rolls up into the Observe dashboards as live scores, trend lines, and drill-down filters, with click-through to transcript, audio, and tool-call data. You can set Slack, email, or webhook alerts on any custom KPI.

What KPIs should a production voice agent track?

A baseline of execution metrics (latency, transcription accuracy, tool-call success) plus outcome KPIs you define, such as task completion, containment or resolution, and any domain-specific compliance checks. 2026 guidance stresses pairing every metric with a success rate and auditing the trace, not just the final answer.

Can custom voice agent KPIs run before production, not just in monitoring?

Yes. The same custom metric runs during pre-production simulations and automatically post-call in production, so the definition of success is identical in testing and live monitoring.

See your custom KPIs running on real calls

Know the one metric that decides whether a call succeeded? Author it as a custom KPI in Cekura, validate it against your historical calls, and watch it score live traffic in Observe. The Cekura docs on custom metrics and the LLM-judge metric guide walk through setup.

More from Cekura on this topic:

Ready to ship voice
agents fast? 

Book a demo