Why your LLM bill is rising – and how semantic caching can cut it by 73%

Our LLM API bill was growing at 30% monthly. Traffic was increasing, but not as fast. When I analyzed our query logs, I discovered the real problem: users were asking the same questions in different ways.

"What is your return policy?" "How do I return something?"and "Can I get a refund?" all hit our LLM individually, generating nearly identical responses, each incurring full API costs.

Exact match caching, the obvious first solution, caught only 18% of these redundant calls. The same semantic question, phrased differently, bypasses the cache entirely.

So, I implemented semantic caching based on what the queries mean, not how they are phrased. Since its implementation, our cache hit rate has increased to 67%, reducing LLM API costs by 73%. But getting there requires solving problems that naïve implementations miss.

Why exact match caching is insufficient

Traditional caching uses the query text as the cache key. This works when the requests are identical:

# Exact match caching

cache_key = hash(query_text)

if cache_key in cache:

return cache(cache_key)

But users do not frame questions equally. My analysis of 100,000 production requests found:

  • only 18% were exact duplicates of previous requests

  • 47% were semantically similar to previous requests (same intent, different wording)

  • 35% were really new inquiries

That 47% represents huge cost savings that we were missing. Any semantically similar query triggers a full invocation of LLM, generating a response almost identical to the one we had already computed.

Semantic Caching Architecture

Semantic caching replaces text-based keys with embedding-based similarity searches:

SemanticCache class:

def __init__(self, embedding_model, similarity_threshold=0.92):

self.embedding_model = embeddingmodel

self.threshold = threshold_similarity

self.vector_store = VectorStore() # FAISS, cone, etc.

self.response_store = ResponseStore() # Redis, DynamoDB, etc.

def get(self, query: str) -> Optional(str):

"""Returns a cached response if a semantically similar request exists."""

query_embedding = self.embedding_model.encode(query)

# Find the most similar cached request

match = self.vector_store.search(query_embedding, top_k=1)

if matches and matches(0).similarity >= self.threshold:

cache_id = match(0).id

return self.response_store.get(cache_id)

return None

def set(self, query: str, response: str):

"""Cache request-response pair."""

query_embedding = self.embedding_model.encode(query)

cache_id = generate_id()

self.vector_store.add(cache_id, query_embedding)

self.response_store.set(cache_id, {

‘request’: request,

‘response’: response,

‘timestamp’: datetime.utcnow()

})

The key insight: Instead of hashing query text, I embed queries in a vector space and find cached queries within a similarity threshold.

The Threshold Problem

The similarity threshold is the critical parameter. Set it too high and you’ll miss valid cache hits. Set it too low and you’ll return wrong answers.

Our initial threshold of 0.85 seemed reasonable; 85% similar should be "same question" right?

wrong At 0.85 we got cache hits like:

  • Inquiry: "How do I cancel my subscription?"

  • Cached: "How do I cancel my order?"

  • Similarity: 0.87

These are different questions with different answers. Returning the cached response would be incorrect.

I’ve found that the optimal thresholds vary by query type:

Request type

Optimal threshold

Justification

FAQ style questions

0.94

High accuracy is required; wrong answers damage credibility

Search for products

0.88

More tolerance for close matches

Support requests

0.92

Balance between coverage and accuracy

Transaction Inquiries

0.97

Very low fault tolerance

I applied request type specific thresholds:

class AdaptiveSemanticCache:

def __init__(self):

self.thresholds = {

‘faq’: 0.94,

‘search’: 0.88,

‘support’: 0.92,

‘transactional’: 0.97,

‘default’: 0.92

}

self.query_classifier = QueryClassifier()

def get_threshold(self, query: str) -> float:

query_type = self.query_classifier.classify(query)

return self.thresholds.get(query_type, self.thresholds(‘default’))

def get(self, query: str) -> Optional(str):

threshold = self.get_threshold(query)

query_embedding = self.embedding_model.encode(query)

match = self.vector_store.search(query_embedding, top_k=1)

if match and match(0).similarity >= threshold:

return self.response_store.get(matches(0).id)

return None

Threshold setting methodology

I couldn’t set thresholds blindly. I needed the ground truth of what the query pairs actually were "the same."

Our methodology:

Step 1: Sample query pairs. I sampled 5000 query pairs at various similarity levels (0.80-0.99).

Step 2: Human labeling. Annotators labeled each pair as "same intention" or "different intention." I used three annotators per pair and voted by majority vote.

Step 3: Calculate precision/recall curves. For each threshold we calculated:

  • Precision: What fraction of cache hits had the same intent?

  • Remember: of pairs with the same intent, what fraction did we cache?

def compute_precision_recall(pairs, labels, threshold):

"""Calculating precision and recall given a similarity threshold."""

predictions = (1 if pair.similarity >= threshold else 0 for pair by pairs)

true_positives = sum(1 for p, l in zip(predictions, labels) if p == 1 and l == 1)

false_positives = sum ( 1 for p , l in zip ( predictions , labels ) if p == 1 and l == 0 )

false_negatives = sum ( 1 for p , l in zip ( predictions , labels ) if p == 0 and l == 1 )

accuracy = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0

call = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0

return precision, recall

Step 4: Choose a threshold based on the cost of errors. For FAQ queries where wrong answers damage confidence, I optimized for accuracy (threshold of 0.94 gives 98% accuracy). For search queries where missing a hit in the cache simply costs money, I optimized for recall (threshold of 0.88).

Delay cost

Semantic caching adds latency: You have to inline the query and lookup the vector store before knowing whether to call LLM.

Our measurements:

Operation

Latency (p50)

Delay (p99)

Embed request

12 ms

28 ms

Vector search

8 ms

19 ms

Total cache lookup

20 ms

47 ms

LLM API call

850 ms

2400ms

The 20ms overhead is negligible compared to the 850ms LLM call we avoid on cache hits. Even on p99 the 47ms overhead is acceptable.

However, cache misses now take 20ms longer than before (embed + lookup + LLM call). At our 67% hit rate, the math works out favorably:

  • Before: 100% of requests × 850 ms = 850 ms average

  • After: (33% × 870ms) + (67% × 20ms) = 287ms + 13ms = 300ms average

Net latency improvement of 65% along with cost reduction.

Invalid cache

Cached responses get stale. Product information changes, policy update, and yesterday’s correct answer becomes today’s wrong answer.

I applied three strategies to invalidate:

  1. Time-based TTL

Simple leak based on content type:

TTL_BY_CONTENT_TYPE = {

‘pricing’: timedelta(hours=4), # Changes often

‘policy’: timedelta(days=7), # Changes rarely

‘product_info’: timedelta(days=1), # Daily refresh

‘general_faq’: timedelta(days=14), # Very stable

}

  1. Event-based invalidation

When master data changes, invalidate associated cache entries:

class CacheInvalidator:

def on_content_update(self, content_id: str, content_type: str):

"""Invalid cache entries related to updated content."""

# Find cached requests that refer to this content

affected_queries = self.find_queries_referencing(content_id)

for query_id in affected_queries:

self.cache.invalidate(query_id)

self.log_invalidation(content_id, len(affected_queries))

  1. Obsolescence detection

For answers that might become out of date without explicit events, I’ve implemented periodic up-to-date checks:

def check_freshness(self, cached_response: dict) -> bool:

"""Check if the cached response is still valid."""

# Rerun the query against the current data

fresh_response = self.generate_response(cached_response(‘query’))

# Compare the semantic similarity of the answers

cached_embedding = self.embed(cached_response(‘response’))

fresh_embedding = self.embed(fresh_response)

similarity = cosine_similarity(cached_embed, fresh_embed)

# If the answers differ significantly, invalid

if similarity < 0.90:

self.cache.invalidate(cached_response(‘id’))

return False

return True

We perform freshness checks on a sample of cached records daily, catching obsolescence that TTL and event-based invalidation miss.

Production results

After three months in production:

Metric

before

after

change

Cache hit rate

18%

67%

+272%

LLM API costs

$47K/mo

$12.7K/mo

-73%

Average latency

850 ms

300 ms

-65%

False positive rate

N/A

0.8%

Customer complaints (wrong answers)

Base level

+0.3%

Minimal increase

0.8% false positives (queries where we returned a cached response that was semantically incorrect) was within acceptable limits. These cases occurred mostly at the limits of our threshold, where the similarity was just above the limit, but the intent differed slightly.

Pitfalls to avoid

Do not use a single global threshold. Different query types have different error tolerances. Adjust the thresholds for each category.

Don’t skip the embed step on cache visits. You might be tempted to skip the embedding when returning cached responses, but you need the embedding to generate a cache key. Overhead costs are inevitable.

Don’t forget the cancellation. Semantic caching without an invalidation strategy leads to stale responses that undermine user trust. Invalid build from day one.

Don’t cache everything. Some requests should not be cached: custom responses, time-sensitive information, transaction confirmations. Build exclusion rules.

def should_cache(self, query: str, response: str) -> bool:

"""Determine if the response should be cached.""

# Do not cache custom responses

if self.contains_personal_info(response):

return False

# Do not cache time-sensitive information

if self.is_time_sensitive(query):

return False

# Do not cache transaction confirmations

if self.is_transactional(query):

return False

return True

Key findings

Semantic caching is a practical LLM cost control model that catches redundant exact-match caching errors. The key challenges are thresholding (using query type-specific thresholds based on precision/recall analysis) and cache invalidation (combining event-based TTL and staleness detection).

At a 73% cost reduction, this was our highest ROI optimization for production LLM systems. The implementation complexity is moderate, but the threshold setting requires careful attention to avoid quality degradation.

Sreenivasa Reddy Hulebeedu Reddy is a Lead Software Engineer.

Orchestration,DataDecisionMakers,Infrastructure

#LLM #bill #rising #semantic #caching #cut

Laisser un commentaire

Votre adresse e-mail ne sera pas publiée. Les champs obligatoires sont indiqués avec *