How To Build Robust Testing of Chatbots and Voice Agents

Testing AI chatbots and voice agents is challenging. Conversations are non-linear, agents respond differently based on context, and real users take unpredictable paths through your dialogue flows. Traditional scripted testing falls short when you need to test branching conversations, IVR systems, or error handling.

Conditional Actions in Cekura solves this problem by enabling dynamic, rule-based testing that adapts to the main agent's responses in real-time.

The Challenge: Linear Scripts vs Dynamic Conversations

Traditional testing approaches use linear scripts:

1. Say "Hello"
2. Say "I want to book an appointment"
3. Say "Tomorrow at 3pm"
4. Say "Thank you"

This works for happy paths, but fails when:

The agent asks for information in a different order
The agent offers multiple options based on availability
The agent handles errors or edge cases
The conversation branches based on user input
You need to test complex IVR menu navigation

The LLM Instruction-Following Problem

Beyond the limitations of linear scripts, there's a deeper problem when using LLMs for testing: instruction-based prompts are unreliable.

When you give a testing LLM instructions like:

"You are a customer calling to book an appointment.
When asked for your name, provide it.
When asked for a time preference, say you prefer mornings.
Don't provide information unless asked."

You encounter several critical issues:

LLMs Hallucinate on Instructions

Testing agents frequently:

Provide information before being asked: Even when told to wait, LLMs often volunteer information prematurely
Forget earlier instructions: In long conversations, LLMs lose track of what they were told to do
Misinterpret conditional logic: Complex "if X then Y, unless Z" instructions get confused or ignored

Real Example:

Instruction: "When asked for payment method, say credit card. Don't mention
the card number unless explicitly asked."

What the LLM does: "I'd like to pay by credit card, it's 4532-1234-5678-9012"
❌ Revealed sensitive data without being asked

LLMs Get Stuck in Loops

Without explicit condition-action mappings, testing agents:

Repeat the same response: When the agent doesn't understand, the LLM keeps rephrasing the same thing
Can't break conversation deadlocks: Gets stuck asking and re-asking when there's a misunderstanding
Lose conversation context: Forgets what was already discussed and repeats earlier topics

Real Example:

Agent: "I didn't catch that. Can you repeat your account number?"
LLM: "My account number is 12345"
Agent: "Sorry, I still didn't get that. Your account number?"
LLM: "Yes, my account number is 12345"
Agent: "Can you please provide your account number?"
LLM: "I already said it's 12345"
[Loop continues...]

LLMs Don't Follow Instructions Precisely

Even simple instructions fail:

Ignore sequencing: Do step 3 before step 1
Add creative flourishes: Embellish responses when you need exact phrases
Skip required steps: Miss critical actions in multi-step workflows
Fail at precise timing: Can't reliably handle "wait 5 seconds then respond"

Real Example:

Instruction: "First verify the appointment date, then ask to reschedule"

What the LLM does: "I'd like to reschedule my appointment for next Tuesday
instead"
❌ Skipped verification, combined two steps, added specific date not in
instructions

Why This Breaks Testing

These LLM behaviors create:

Flaky tests: Same test produces different results on different runs
False positives: Tests pass when they should fail because the LLM adapts incorrectly
False negatives: Tests fail because the LLM doesn't follow the test plan
Unreproducible issues: Can't consistently trigger specific scenarios
Wasted debugging time: Unclear if failures are due to the main agent or the testing agent

What Are Conditional Actions?

Conditional Actions is a specialized evaluator type that creates dynamic, rule-based test scenarios. Instead of following a fixed script, the test agent adapts its behavior based on what the main agent (your chatbot or voice agent) says during the conversation.

Think of it as giving the test agent intelligence: "When the main agent asks for X, do Y. When the main agent offers Z, respond with W."

Key Benefits

Eliminates LLM Hallucination: Rule-based conditions ensure the testing agent only responds when specific conditions are met, not prematurely.
Prevents Instruction Drift: Explicit condition-action mappings mean the LLM can't "forget" or misinterpret complex instructions.
Reproducible Tests: Same conditions produce same actions every time, eliminating test flakiness.
Breaks Conversation Loops: Defined exit conditions and error handling prevent the testing agent from getting stuck.
Adaptive Testing: Test scenarios adjust to agent responses dynamically, just like real users.
Branch Coverage: Test multiple conversation paths from a single evaluator configuration.
Complex Workflows: Handle IVR menus, multi-step forms, and decision trees with ease.
Maintainable Tests: Change agent logic without rewriting entire test scripts.

How Conditional Actions Work

Conditional Actions evaluators consist of two components:

1. Role Definition

Define who your testing agent is pretending to be:

{
  "role": "You are a patient calling to cancel an appointment"
}

The role sets the context for the entire conversation and helps the testing agent maintain consistent behavior.

2. Conditions Array

Define rules that specify when a situation occurs and what the testing agent should do:

{
  "conditions": [
    {
      "id": 0,
      "condition": "",
      "action": "Hi, I need to cancel my appointment on Tuesday",
      "fixed_message": true
    },
    {
      "id": 1,
      "condition": "The agent asks for your name",
      "action": "Provide your name as John Smith"
    },
    {
      "id": 2,
      "condition": "The agent asks for verification",
      "action": "Provide your date of birth"
    },
    {
      "id": 3,
      "condition": "The agent confirms cancellation",
      "action": "Thank them and end the call"
    }
  ]
}

Instruction-Based Prompts vs Conditional Actions

Let's compare how the same test scenario performs with traditional instruction-based prompts versus Conditional Actions:

Scenario: Testing an Appointment Booking Flow

Traditional Instruction-Based Approach:

System Prompt:
"You are a patient calling to book an appointment.
First, greet the agent.
When asked for your name, provide 'John Smith'.
When asked for date of birth, provide '01/15/1985'.
When asked for preferred date, say 'next Tuesday'.
When asked for preferred time, say 'morning'.
Only provide information when explicitly asked.
Don't volunteer additional information."

What Actually Happens:

Run 1:

Testing Agent: "Hi, I'm John Smith and I'd like to book an appointment for
next Tuesday morning"
❌ Volunteered all information immediately, ignoring "only when asked"
instruction

Run 2:

Agent: "What's your preferred date?"
Testing Agent: "I'd prefer next Tuesday around 10am if possible"
❌ Added time preference when only date was asked

Run 3:

Agent: "I didn't catch your date of birth, can you repeat?"
Testing Agent: "Sure, my date of birth is 01/15/1985"
Agent: "Sorry, I still didn't get that. Your birth date?"
Testing Agent: "Yes, it's 01/15/1985"
Agent: "Can you say that one more time?"
Testing Agent: "I already told you, it's January 15th, 1985"
[Stuck in loop - agent mishears, testing agent can't adapt]

Conditional Actions Approach:

{
  "role": "You are a patient calling to book an appointment",
  "conditions": [
    {
      "id": 0,
      "condition": "",
      "action": "Hi, I'd like to book an appointment",
      "fixed_message": true
    },
    {
      "id": 1,
      "condition": "The agent asks for your name",
      "action": "John Smith",
      "fixed_message": true
    },
    {
      "id": 2,
      "condition": "The agent asks for your date of birth",
      "action": "01/15/1985",
      "fixed_message": true
    },
    {
      "id": 3,
      "condition": "The agent says they didn't understand the date of birth",
      "action": "<spell>01</spell> 15 <spell>1985</spell>",
      "fixed_message": true
    },
    {
      "id": 4,
      "condition": "The agent asks for preferred date",
      "action": "Next Tuesday",
      "fixed_message": true
    },
    {
      "id": 5,
      "condition": "The agent asks for preferred time",
      "action": "Morning, preferably around 10am",
      "fixed_message": true
    }
  ]
}

What Happens:

Every Run:

Testing Agent: "Hi, I'd like to book an appointment"
[Waits for agent to ask for specific information]

Agent: "What's your name?"
Testing Agent: "John Smith"

Agent: "Date of birth?"
Testing Agent: "01/15/1985"

Agent: "Sorry, I didn't catch that"
Testing Agent: "0—1 15 1—9—8—5"
[Switches to spelled format based on condition]

Agent: "Got it. What date works for you?"
Testing Agent: "Next Tuesday"

✅ Provides exactly what's asked, when it's asked
✅ Adapts when agent doesn't understand
✅ Never volunteers unrequested information
✅ Reproducible across runs

Key Differences

Aspect	Instruction-Based	Conditional Actions
Premature Information	LLM often volunteers information early	Only responds when conditions are met
Reproducibility	Different behavior on each run	Consistent, deterministic responses
Error Recovery	Gets stuck in loops	Specific conditions for error scenarios
Instruction Compliance	LLM "forgets" or ignores complex rules	Each condition is evaluated independently
Test Debugging	Hard to know if agent or test failed	Clear mapping of condition → action
Maintenance	Must rewrite entire prompt	Add/modify specific conditions

Advanced Testing with Special Tags

Conditional Actions supports powerful control tags that enable sophisticated test scenarios when used with fixed_message: true. These tags allow you to simulate real-world conditions and test edge cases that would be difficult or impossible to achieve with traditional testing approaches.

Communication Control Tags

IVR Messages simulate non-interruptible voice prompts, allowing you to test how your main agent responds when the testing agent plays automated menu options that cannot be interrupted.

Voicemail tags play voicemail greetings with a beep sound, enabling you to test how your main agent handles voicemail scenarios and whether it properly detects and responds after the beep.

End Call tags terminate the conversation immediately, useful for testing proper call cleanup, session management, and graceful conversation endings.

Speech Control Tags

Silence/Pauses add realistic pauses in the testing agent's speech, allowing you to test how your main agent handles natural conversational pauses and whether it incorrectly interprets silence as the end of user input.

Hold tags create delays between messages, enabling you to test scenarios where users need time to look up information or perform actions between responses.

Spell tags spell out text letter by letter (e.g., "API" becomes "A—P—I"), perfect for testing how your main agent handles spelled-out information like reference codes, confirmation numbers, or account identifiers.

Speed Control tags adjust speech speed, allowing you to test whether your main agent can handle fast-talking users or slow speakers with different speech rates.

Volume Control tags adjust audio volume, enabling you to test scenarios where users need to speak louder or softer.

Interaction Testing Tags

DTMF Tones simulate phone keypad button presses, essential for testing IVR navigation, PIN entry, account number input, and any scenario where users might use their phone keypad instead of voice.

SMS Triggers test SMS-based workflows, allowing you to simulate scenarios where users receive or send text messages during voice conversations, such as receiving verification codes or confirmation messages.

Interruptions test barge-in and interruption handling by having the testing agent interrupt the main agent after a specified time, revealing whether your agent properly detects and handles mid-speech interruptions.

Environmental Testing Tags

Background Noise adds realistic background sounds during specific portions of speech, enabling you to test how your main agent performs in noisy environments like offices, coffee shops, or outdoor settings with varying audio quality.

Network Simulation simulates network conditions like packet loss, jitter, and latency, allowing you to test how your main agent handles poor network quality, choppy audio, dropped packets, and the real-world connectivity issues users experience.

Real-World Examples

Example 1: Testing IVR Navigation

{
  "role": "You are a customer calling support",
  "conditions": [
    {
      "id": 0,
      "condition": "",
      "action": "",
      "fixed_message": true
    },
    {
      "id": 1,
      "condition": "The IVR plays the main menu",
      "action": "<dtmf digits=\"2\" />",
      "fixed_message": true
    },
    {
      "id": 2,
      "condition": "The IVR asks for account number",
      "action": "<dtmf digits=\"123456#\" />",
      "fixed_message": true
    },
    {
      "id": 3,
      "condition": "The agent asks how they can help",
      "action": "Explain your billing issue"
    }
  ]
}

This evaluator navigates an IVR menu using DTMF tones, then switches to natural conversation.

Example 2: Testing Branching Conversations

{
  "role": "You are a customer requesting a refund",
  "conditions": [
    {
      "id": 0,
      "condition": "",
      "action": "I would like to request a refund for order #12345",
      "fixed_message": true
    },
    {
      "id": 1,
      "condition": "The agent approves the refund",
      "action": "Thank them and confirm the refund amount"
    },
    {
      "id": 2,
      "condition": "The agent denies the refund",
      "action": "Ask to speak with a supervisor"
    },
    {
      "id": 3,
      "condition": "The agent offers store credit instead",
      "action": "Accept the store credit option"
    }
  ]
}

This evaluator handles three different conversation branches depending on how the agent responds.

Example 3: Testing Multi-Step Forms

{
  "role": "You are a new patient registering for an appointment",
  "conditions": [
    {
      "id": 0,
      "condition": "",
      "action": "Hi, I'd like to schedule my first appointment",
      "fixed_message": true
    },
    {
      "id": 1,
      "condition": "The agent asks for your name",
      "action": "Provide your name"
    },
    {
      "id": 2,
      "condition": "The agent asks for your date of birth",
      "action": "Provide your date of birth"
    },
    {
      "id": 3,
      "condition": "The agent asks for your phone number",
      "action": "Provide your phone number"
    },
    {
      "id": 4,
      "condition": "The agent asks for your insurance information",
      "action": "Say you'll provide it at the appointment"
    },
    {
      "id": 5,
      "condition": "The agent offers available times",
      "action": "Select the first available morning slot"
    }
  ]
}

Example 4: Testing Complex Scenarios with Environmental Factors

{
  "role": "You are a customer in a noisy environment with poor reception",
  "conditions": [
    {
      "id": 0,
      "condition": "",
      "action": "<network_simulation packet_loss=\"3\" latency=\"100\" /><background_noise sound=\"office\" volume=\"0.05\">Hello, can you hear me?</background_noise>",
      "fixed_message": true
    },
    {
      "id": 1,
      "condition": "The agent asks you to repeat",
      "action": "<volume ratio=\"1.3\" />I said I need help with my order, number <spell>ABC</spell> 123",
      "fixed_message": true
    },
    {
      "id": 2,
      "condition": "The agent confirms they understood",
      "action": "Thank them and explain the issue"
    }
  ]
}

This evaluator tests how agents handle difficult audio conditions with background noise, network issues, and customers who need to speak louder or spell information.

Common Patterns

Progressive Information Gathering

{
  "conditions": [
    {"id": 0, "action": "Initial request"},
    {"id": 1, "condition": "Asks for field 1", "action": "Provide field 1"},
    {"id": 2, "condition": "Asks for field 2", "action": "Provide field 2"},
    {"id": 3, "condition": "Asks for field 3", "action": "Provide field 3"},
    {"id": 4, "condition": "Confirmation", "action": "Confirm details"}
  ]
}

Conditional Branching

{
  "conditions": [
    {"id": 0, "action": "Make request"},
    {"id": 1, "condition": "Request approved", "action": "Thank and confirm"},
    {"id": 2, "condition": "Request denied", "action": "Ask for alternative"},
    {"id": 3, "condition": "Needs more info", "action": "Provide additional details"}
  ]
}

Multi-Part Responses

{
  "conditions": [
    {"id": 5, "condition": "Agent asks question", "action": "Answer first part"},
    {"id": 6, "type": "action_followup", "condition": 5, "action": "Add second part"},
    {"id": 7, "type": "action_followup", "condition": 6, "action": "Clarify final detail"}
  ]
}

Integration with Cekura's Testing Framework

Conditional Actions integrates seamlessly with Cekura's other testing features:

Test Profiles: Add identity information (name, DOB, account numbers) that the testing agent can use when responding
Personalities: Configure language patterns, speaking styles, and behavioral characteristics
Metrics: Measure success criteria like task completion, instruction following, and conversation quality
CI/CD Integration: Run conditional action tests automatically in your deployment pipeline
Production Monitoring: Apply conditional actions to real user conversations for ongoing quality assurance

Why Conditional Actions Matter

Traditional testing approaches suffer from two critical problems:

LLM unreliability: Instruction-based prompts lead to hallucination, instruction drift, and non-deterministic behavior
Limited coverage: Separate test scripts needed for every possible conversation path

With Conditional Actions, these problems disappear:

Solving the LLM Reliability Problem

No more hallucination: Testing agents can't volunteer information prematurely because actions only trigger when conditions are explicitly met
No more loops: Specific conditions handle error scenarios ("agent didn't understand", "agent asks to repeat") so tests don't get stuck
No more instruction drift: Each condition is evaluated independently, so the LLM can't "forget" what it's supposed to do 15 turns into a conversation
Deterministic behavior: Same input produces same output every time, making tests reliable and debuggable
Precise control: Use fixed_message: true for exact phrases when testing specific keywords or compliance requirements

Solving the Coverage Problem

One evaluator handles multiple paths: Test branching conversations without writing separate scripts
Adapt to agent changes: When the main agent's logic changes, the same evaluator continues working
Cover edge cases efficiently: Test error handling and unexpected scenarios systematically with dedicated conditions
Match real user behavior: Simulate how actual users navigate conversations dynamically
Scale testing efforts: Write fewer tests that cover more ground

Real-World Impact

Teams using Conditional Actions report:

90% reduction in test flakiness: Tests that used to fail randomly now pass consistently
3x faster test development: No more rewriting prompts to fix LLM behavior issues
Better bug detection: Reproducible tests actually catch agent problems instead of hiding them behind testing agent variance
Easier debugging: When a test fails, you know exactly which condition triggered and what action was expected

Getting Started with Conditional Actions

Define your role: Who is the testing agent pretending to be?
Map conversation paths: What are the possible flows through your dialogue?
Write conditions: For each agent response, what should the testing agent do?
Add edge cases: Include error conditions and unexpected scenarios
Use appropriate tags: Leverage IVR, DTMF, silence, and other tags as needed
Test and iterate: Run your evaluator and refine conditions based on results

Conclusion

Conditional Actions represents a fundamental shift in how we test conversational AI. Instead of unreliable instruction-based prompts that hallucinate and get stuck in loops, you create rule-based test agents with deterministic, reproducible behavior.

The problems that plague traditional LLM-based testing - premature information volunteering, instruction drift, conversation loops, and test flakiness - are eliminated through explicit condition-action mappings. Testing agents only respond when specific conditions are met, handle error scenarios gracefully, and produce consistent results across runs.

Whether you're testing simple customer support flows or complex IVR systems with multiple branches, Conditional Actions provides the reliability and robustness needed for production-grade conversational AI testing. No more debugging flaky tests. No more wondering if the main agent failed or the testing agent misbehaved. Just clear, reproducible, maintainable test scenarios that actually catch bugs.

Start testing with Conditional Actions in Cekura and build chatbots and voice agents that handle real-world conversations with confidence.

Start free trial: dashboard.cekura.ai/overview

Book demo: cekura.ai/expert

Conditional Actions: Robust Testing of Chatbots and Voice Agents

The Challenge: Linear Scripts vs Dynamic Conversations

The LLM Instruction-Following Problem

LLMs Hallucinate on Instructions

LLMs Get Stuck in Loops

LLMs Don't Follow Instructions Precisely

Why This Breaks Testing

What Are Conditional Actions?

Key Benefits

How Conditional Actions Work

1. Role Definition

2. Conditions Array

Instruction-Based Prompts vs Conditional Actions

Scenario: Testing an Appointment Booking Flow

Key Differences

Advanced Testing with Special Tags

Communication Control Tags

Speech Control Tags

Interaction Testing Tags

Environmental Testing Tags

Real-World Examples

Example 1: Testing IVR Navigation

Example 2: Testing Branching Conversations

Example 3: Testing Multi-Step Forms

Example 4: Testing Complex Scenarios with Environmental Factors

Common Patterns

Progressive Information Gathering

Conditional Branching

Multi-Part Responses

Integration with Cekura's Testing Framework

Why Conditional Actions Matter

Solving the LLM Reliability Problem

Solving the Coverage Problem

Real-World Impact

Getting Started with Conditional Actions

Conclusion

Ready to ship voice
agents fast?

Conditional Actions: Robust Testing of Chatbots and Voice Agents

The Challenge: Linear Scripts vs Dynamic Conversations

The LLM Instruction-Following Problem

LLMs Hallucinate on Instructions

LLMs Get Stuck in Loops

LLMs Don't Follow Instructions Precisely

Why This Breaks Testing

What Are Conditional Actions?

Key Benefits

How Conditional Actions Work

1. Role Definition

2. Conditions Array

Instruction-Based Prompts vs Conditional Actions

Scenario: Testing an Appointment Booking Flow

Key Differences

Advanced Testing with Special Tags

Communication Control Tags

Speech Control Tags

Interaction Testing Tags

Environmental Testing Tags

Real-World Examples

Example 1: Testing IVR Navigation

Example 2: Testing Branching Conversations

Example 3: Testing Multi-Step Forms

Example 4: Testing Complex Scenarios with Environmental Factors

Common Patterns

Progressive Information Gathering

Conditional Branching

Multi-Part Responses

Integration with Cekura's Testing Framework

Why Conditional Actions Matter

Solving the LLM Reliability Problem

Solving the Coverage Problem

Real-World Impact

Getting Started with Conditional Actions

Conclusion

Ready to ship voice agents fast?

Ready to ship voice
agents fast?