Introduction
If you’re building voice agents, you already know the enemy: latency.
I’ve spent the last months experimenting and testing diffeerent approaches to reduce latency across the STT → LLM → TTS stack. This post is a compact, practical checklist for shaving 300–600 ms off your loop without long primers, just the knobs that move the needle.
Since It’s becoming common wisdom now that the end-to-end latency in a voice agent should be under one second, ( preferably < 800ms )…. Here are the things usually recommended to reduce latency in a voice agent:
- Overlap tactics : streaming everywhere, tightening down turn-detection parameters
- Model and infra choices: smaller/faster models, careful model/cluster placement, and self hosting models when it makes sense
- A few tricks: filler words, early acknowledgment, and something like semantic caching which will learn about below
In this post, I’ll show you how to reduce your LLM latency by up to ~60% using semantic caching with steps that you can reproduce
A brief note on semantic caching
I’ve read about using semantic caching for voice agents in a couple of blog posts, but here i’ll to provide solid implementation details for a POC that you can get together in a day or two
But first let me give credit to Canonical AI’s Semantic Caching FAQ (link), it’s the most descriptive read that i’ve found online about the topic.
In simple terms, a semantic cache stores previous Q→A pairs and, on a new request, looks up semantically similar questions (via embeddings) to return the saved answer from a local (or remote, but didnt try that) cache instead of hitting an LLM.
Real-life example (clinic scheduling):
- Caller asks “What are your hours?” or “Are you open on Saturdays?”.
- Scope the cache to the current clinic (tenant).
- Search for semantically similar past questions (embedding search).
- On a hit, return the cached hours immediately (~50–200 ms); skip the LLM.
- On a miss, call the LLM, respond, and write the Q→A back to the cache for next time.
How to Implement Semantic Caching in Your Voice Agent
Here’s a practical breakdown of how you can wire up a semantic cache. The architecture has two main parts: a wrapper on the agent side and a separate middleware service to handle the handle the embedding and caching requests .
1. The Agent-Side Cache Wrapper
This is a small component that sits in front of your main LLM client. Its job is simple:
- Before calling the LLM, it first asks the cache if a similar answer already exists.
- If it gets a cache hit, it streams the response back immediately. Your time-to-first-byte will be extremely low, you’re just sending back stored data.
- If it’s a cache miss, it calls the real LLM, streams the response to the user, and once the full response is complete, it tells the cache middleware to store it for next time.
Thankfully since both Pipecat and LiveKit Agents are open-source, you should be able to implement this cache wrapper without any issues
It’s also smart to build in a circuit breaker. If the cache service is down or repeatedly failing, the wrapper should temporarily stop trying to contact it and just fall back to the LLM directly. This prevents a degraded cache from impacting the quality of agent responses.
2. The Middleware Service
This is a simple API service (you can build it easily with something like FastAPI) that handles the core caching logic. It manages three key tasks:
- PII (Personally Identifiable Information) Guard: Before checking or storing anything, it scans the text for sensitive data like emails, phone numbers, or credit cards. If any PII is found, the cache is skipped entirely for that turn. This is non-negotiable for safety and compliance 😊
- Embedding and Vector Search: On a lookup request, it takes the recent conversation history, creates a context-aware text string, and generates a vector embedding. It then queries your vector database to find semantically similar entries.
- Storage: On a store request, it saves the new question-and-answer pair along with its embedding and a TTL (Time To Live) to ensure the data stays fresh.
3. The Vector Database
You’ll need a vector database to store and search the embeddings. A great, easy-to-start option is Redis Stack, which includes the RediSearch module for high-performance vector search (HNSW index). A critical point: most vector databases (including Redis) calculate similarity using distance metrics like COSINE. This means a smaller number is a better match. You’ll need to convert this distance to a more intuitive similarity score: similarity = 1 - distance.
4. Tenancy and Scoping
To prevent answers from leaking between different agent functions (e.g., a “billing agent” using responses from a “scheduler agent”), you must implement some sort of tenancy mechanism which is able to associate different cache with voice agents that handle different tasks
While I didn’t actually test this yet but I thought of assigning a unique identifier to each agent persona or use case using job metadata. When your application dispatches an agent to a room, you can pass custom metadata that defines its function. For example, in LiveKit Agents you can:
- Dispatch Request for Clinic A Scheduler:
metadata='{"use_case": "clinic_a_scheduler"}' - Dispatch Request for Clinic A Billing:
metadata='{"use_case": "clinic_a_billing"}'
Inside the agent, you can access this identifier from the JobContext (ctx) via ctx.job.metadata. This metadata string becomes your reliable tenant_id. Your agent then forwards this tenant_id with every request to the cache middleware. The middleware uses it to scope all lookups and storage, creating a completely isolated cache for each agent persona.
Defaults and Tuning
As for parameters, I’ve found that a similarity threshold of 0.85 is a solid starting point. It’s good at finding meaningful matches without being too aggressive and causing false positives. For embeddings, OpenAI’s text-embedding-3-small model offers a great balance of performance and cost, or you can check the Hugging Face MTEB leaderboard that ranks embedding models . for the latest best open source models. I’ve also set a default TTL of one hour (3600s) on cached items to ensure the information doesn’t get too stale, but you can adjust this based on how quickly the correct information might change in your use case.
Safety and Reliability
A couple of final points on keeping this production-ready: a PII guard is essential for preventing sensitive data from ever being stored. Secondly, the circuit breaker pattern is a lifesaver. It ensures that if your cache service has issues, your voice agent continues to function smoothly by simply bypassing the cache and relying on the LLM until the service recovers.
Finally, it’s crucial to prevent cache responses from leaking between different agents or use cases. You don’t want an agent for Clinic A accidentally sharing information from Clinic B. This is a form of multi-tenancy, where you create isolated cache “clusters” for each agent persona
The End Result
Instead of your normal LLM TTFT looking the following (this is GPT 4.1 nano)
![]()
With semantic caching, the response for a similar user intent looks like this:
![]()
That’s a ~60% reduction in Time-To-First-Token (TTFT), dropping from 690ms down to just 272ms. The total response duration is also cut by over 75%. This is the kind of tangible improvement that makes conversations feel fluid and natural, directly addressing the latency problem at its source.
So there you go, here is quick way to cut your LLM latency by more than 50% ( ~ 400ms in the example above)
Next, we’ll see how this can be replicated in other stages in the pipeline such as TTS and STT…. I don’t want to spend much time optimizing the 3-tier architecture though as I firmly believe that things are moving towards E2E Speech Models….