Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124


Our LLM API bill was growing at 30% monthly. Traffic was increasing, but not as fast. When I analyzed our query logs, I discovered the real problem: users were asking the same questions in different ways.
"What is your return policy?" "How do I return something?"and "Can I get a refund?" all hit our LLM individually, generating nearly identical responses, each incurring full API costs.
Exact match caching, the obvious first solution, caught only 18% of these redundant calls. The same semantic question, phrased differently, bypasses the cache entirely.
So, I implemented semantic caching based on what the queries mean, not how they are phrased. Since its implementation, our cache hit rate has increased to 67%, reducing LLM API costs by 73%. But getting there requires solving problems that naïve implementations miss.
Traditional caching uses the query text as the cache key. This works when the requests are identical:
# Exact match caching
cache_key = hash(query_text)
if cache_key in cache:
return cache(cache_key)
But users do not frame questions equally. My analysis of 100,000 production requests found:
only 18% were exact duplicates of previous requests
47% were semantically similar to previous requests (same intent, different wording)
35% were really new inquiries
That 47% represents huge cost savings that we were missing. Any semantically similar query triggers a full invocation of LLM, generating a response almost identical to the one we had already computed.
Semantic caching replaces text-based keys with embedding-based similarity searches:
SemanticCache class:
def __init__(self, embedding_model, similarity_threshold=0.92):
self.embedding_model = embeddingmodel
self.threshold = threshold_similarity
self.vector_store = VectorStore() # FAISS, cone, etc.
self.response_store = ResponseStore() # Redis, DynamoDB, etc.
def get(self, query: str) -> Optional(str):
"""Returns a cached response if a semantically similar request exists."""
query_embedding = self.embedding_model.encode(query)
# Find the most similar cached request
match = self.vector_store.search(query_embedding, top_k=1)
if matches and matches(0).similarity >= self.threshold:
cache_id = match(0).id
return self.response_store.get(cache_id)
return None
def set(self, query: str, response: str):
"""Cache request-response pair."""
query_embedding = self.embedding_model.encode(query)
cache_id = generate_id()
self.vector_store.add(cache_id, query_embedding)
self.response_store.set(cache_id, {
‘request’: request,
‘response’: response,
‘timestamp’: datetime.utcnow()
})
The key insight: Instead of hashing query text, I embed queries in a vector space and find cached queries within a similarity threshold.
The similarity threshold is the critical parameter. Set it too high and you’ll miss valid cache hits. Set it too low and you’ll return wrong answers.
Our initial threshold of 0.85 seemed reasonable; 85% similar should be "same question" right?
wrong At 0.85 we got cache hits like:
Inquiry: "How do I cancel my subscription?"
Cached: "How do I cancel my order?"
Similarity: 0.87
These are different questions with different answers. Returning the cached response would be incorrect.
I’ve found that the optimal thresholds vary by query type:
|
Request type |
Optimal threshold |
Justification |
|
FAQ style questions |
0.94 |
High accuracy is required; wrong answers damage credibility |
|
Search for products |
0.88 |
More tolerance for close matches |
|
Support requests |
0.92 |
Balance between coverage and accuracy |
|
Transaction Inquiries |
0.97 |
Very low fault tolerance |
I applied request type specific thresholds:
class AdaptiveSemanticCache:
def __init__(self):
self.thresholds = {
‘faq’: 0.94,
‘search’: 0.88,
‘support’: 0.92,
‘transactional’: 0.97,
‘default’: 0.92
}
self.query_classifier = QueryClassifier()
def get_threshold(self, query: str) -> float:
query_type = self.query_classifier.classify(query)
return self.thresholds.get(query_type, self.thresholds(‘default’))
def get(self, query: str) -> Optional(str):
threshold = self.get_threshold(query)
query_embedding = self.embedding_model.encode(query)
match = self.vector_store.search(query_embedding, top_k=1)
if match and match(0).similarity >= threshold:
return self.response_store.get(matches(0).id)
return None
I couldn’t set thresholds blindly. I needed the ground truth of what the query pairs actually were "the same."
Our methodology:
Step 1: Sample query pairs. I sampled 5000 query pairs at various similarity levels (0.80-0.99).
Step 2: Human labeling. Annotators labeled each pair as "same intention" or "different intention." I used three annotators per pair and voted by majority vote.
Step 3: Calculate precision/recall curves. For each threshold we calculated:
Precision: What fraction of cache hits had the same intent?
Remember: of pairs with the same intent, what fraction did we cache?
def compute_precision_recall(pairs, labels, threshold):
"""Calculating precision and recall given a similarity threshold."""
predictions = (1 if pair.similarity >= threshold else 0 for pair by pairs)
true_positives = sum(1 for p, l in zip(predictions, labels) if p == 1 and l == 1)
false_positives = sum ( 1 for p , l in zip ( predictions , labels ) if p == 1 and l == 0 )
false_negatives = sum ( 1 for p , l in zip ( predictions , labels ) if p == 0 and l == 1 )
accuracy = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
call = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
return precision, recall
Step 4: Choose a threshold based on the cost of errors. For FAQ queries where wrong answers damage confidence, I optimized for accuracy (threshold of 0.94 gives 98% accuracy). For search queries where missing a hit in the cache simply costs money, I optimized for recall (threshold of 0.88).
Semantic caching adds latency: You have to inline the query and lookup the vector store before knowing whether to call LLM.
Our measurements:
|
Operation |
Latency (p50) |
Delay (p99) |
|
Embed request |
12 ms |
28 ms |
|
Vector search |
8 ms |
19 ms |
|
Total cache lookup |
20 ms |
47 ms |
|
LLM API call |
850 ms |
2400ms |
The 20ms overhead is negligible compared to the 850ms LLM call we avoid on cache hits. Even on p99 the 47ms overhead is acceptable.
However, cache misses now take 20ms longer than before (embed + lookup + LLM call). At our 67% hit rate, the math works out favorably:
Before: 100% of requests × 850 ms = 850 ms average
After: (33% × 870ms) + (67% × 20ms) = 287ms + 13ms = 300ms average
Net latency improvement of 65% along with cost reduction.
Cached responses get stale. Product information changes, policy update, and yesterday’s correct answer becomes today’s wrong answer.
I applied three strategies to invalidate:
Simple leak based on content type:
TTL_BY_CONTENT_TYPE = {
‘pricing’: timedelta(hours=4), # Changes often
‘policy’: timedelta(days=7), # Changes rarely
‘product_info’: timedelta(days=1), # Daily refresh
‘general_faq’: timedelta(days=14), # Very stable
}
When master data changes, invalidate associated cache entries:
class CacheInvalidator:
def on_content_update(self, content_id: str, content_type: str):
"""Invalid cache entries related to updated content."""
# Find cached requests that refer to this content
affected_queries = self.find_queries_referencing(content_id)
for query_id in affected_queries:
self.cache.invalidate(query_id)
self.log_invalidation(content_id, len(affected_queries))
For answers that might become out of date without explicit events, I’ve implemented periodic up-to-date checks:
def check_freshness(self, cached_response: dict) -> bool:
"""Check if the cached response is still valid."""
# Rerun the query against the current data
fresh_response = self.generate_response(cached_response(‘query’))
# Compare the semantic similarity of the answers
cached_embedding = self.embed(cached_response(‘response’))
fresh_embedding = self.embed(fresh_response)
similarity = cosine_similarity(cached_embed, fresh_embed)
# If the answers differ significantly, invalid
if similarity < 0.90:
self.cache.invalidate(cached_response(‘id’))
return False
return True
We perform freshness checks on a sample of cached records daily, catching obsolescence that TTL and event-based invalidation miss.
After three months in production:
|
Metric |
before |
after |
change |
|
Cache hit rate |
18% |
67% |
+272% |
|
LLM API costs |
$47K/mo |
$12.7K/mo |
-73% |
|
Average latency |
850 ms |
300 ms |
-65% |
|
False positive rate |
N/A |
0.8% |
— |
|
Customer complaints (wrong answers) |
Base level |
+0.3% |
Minimal increase |
0.8% false positives (queries where we returned a cached response that was semantically incorrect) was within acceptable limits. These cases occurred mostly at the limits of our threshold, where the similarity was just above the limit, but the intent differed slightly.
Do not use a single global threshold. Different query types have different error tolerances. Adjust the thresholds for each category.
Don’t skip the embed step on cache visits. You might be tempted to skip the embedding when returning cached responses, but you need the embedding to generate a cache key. Overhead costs are inevitable.
Don’t forget the cancellation. Semantic caching without an invalidation strategy leads to stale responses that undermine user trust. Invalid build from day one.
Don’t cache everything. Some requests should not be cached: custom responses, time-sensitive information, transaction confirmations. Build exclusion rules.
def should_cache(self, query: str, response: str) -> bool:
"""Determine if the response should be cached.""
# Do not cache custom responses
if self.contains_personal_info(response):
return False
# Do not cache time-sensitive information
if self.is_time_sensitive(query):
return False
# Do not cache transaction confirmations
if self.is_transactional(query):
return False
return True
Semantic caching is a practical LLM cost control model that catches redundant exact-match caching errors. The key challenges are thresholding (using query type-specific thresholds based on precision/recall analysis) and cache invalidation (combining event-based TTL and staleness detection).
At a 73% cost reduction, this was our highest ROI optimization for production LLM systems. The implementation complexity is moderate, but the threshold setting requires careful attention to avoid quality degradation.
Sreenivasa Reddy Hulebeedu Reddy is a Lead Software Engineer.
Orchestration,DataDecisionMakers,Infrastructure
#LLM #bill #rising #semantic #caching #cut