Cekura partners with Moss to bring real-time semantic search testing to AI agent QA

Why trust Cekura on voice AI evals

5M+ voice agent minutes tested in production.
Built by engineers from Google, Apple, Microsoft. Backed by Y Combinator.
60K+ voice AI calls evaluated daily.
Native integration for every major voice AI stack: LiveKit, Pipecat, Vapi, Retell, ElevenLabs.

About Moss

Founded by Sri Raghu Malireddi and Harsha Nalluru — engineers who spent years building large-scale agentic systems at Microsoft and Grammarly — Moss is a real-time semantic search runtime for conversational AI, voice agents, and copilots. Built in Rust and WebAssembly, Moss delivers sub-10ms semantic search with zero network hops, running wherever your agent runs: in-browser, on-device, or in the cloud.

The problem Moss solves is deceptively simple: retrieval lag breaks the illusion of intelligence. Every query that hops across networks and cloud databases adds hundreds of milliseconds. As usage scales, those lags compound into lost users, rising egress costs, and frustrated teams. Moss collapses the multi-hop retrieval stack into a single local-first runtime — connect your data once, and Moss handles indexing, packaging, distribution, and updates automatically.

Moss is already powering production deployments across voice AI and developer platforms, achieving sub-10ms retrieval and 70-90% token savings compared to traditional pipelines.

The retrieval layer is invisible — until it breaks

Most voice agent failures trace back to what the agent knew, or failed to know, at the moment it needed to respond. When retrieval is slow, agents hedge. When context is missing, agents hallucinate. When the knowledge base drifts out of sync, agents confidently answer with stale data.

These aren't LLM problems. They're retrieval infrastructure problems — and they're nearly impossible to catch without testing the full stack, including the retrieval layer, under realistic conditions.

Teams building on Moss gain a significant speed advantage. But speed also means failures happen faster. A misconfigured index, an embedding model swap, or a knowledge base update can silently degrade agent behavior in ways that only surface during live calls. You need a testing layer that keeps pace.

What you can do with Cekura × Moss

Test retrieval-augmented agents end-to-end: Run multi-turn test scenarios that exercise your agent's full stack — not just LLM responses, but whether the right context was retrieved at the right moment. Catch failures at the retrieval layer before they reach users.
Validate knowledge base changes before deployment: When you update indexes, swap embedding models, or add new documents to Moss, run regression suites automatically. Ensure that knowledge base changes improve — or at minimum don't degrade — agent behavior across your critical scenarios.
Simulate context-dependent conversations: Test agents that rely on dynamic, per-user, or session-specific context. Cekura's testing personas can exercise agents across diverse user profiles, ensuring retrieval personalisation works correctly at scale.
Benchmark retrieval configurations: A/B test different embedding models, index configurations, or chunking strategies using Cekura as the consistent evaluation layer. Make data-driven decisions about retrieval design with real conversation quality metrics — not just latency numbers.
Catch stale or missing context errors: Specifically test the failure modes that matter most for RAG agents: scenarios where context is missing, outdated, or mismatched. Surface hallucinations and incorrect answers that result from retrieval gaps, before your users do.
Continuous QA for always-on agents: Set up monitoring suites that continuously test your Moss-powered agent as your knowledge base evolves. Get alerted when retrieval changes cause regressions in response quality or task completion.

Ready to test the full stack?

Book a demo: https://www.cekura.ai/expert

Email us: sidhant@cekura.ai

Learn more about Moss: www.moss.dev