Semantic LLM Cache
SmartMemo
A semantic cache for LLM agents where a learned classifier — not raw cosine similarity — decides when a cached answer is safe to reuse.
+30 pts
precision at equal recall vs. cosine
- FAISS retrieves candidates; a learned classifier decides reuse
- +30 precision points at equal recall vs. a tuned cosine baseline
- Bundled classifier — no training needed for a cold start
- Implicit + explicit bad-hit feedback feeds gated retraining
Live demo
Watch SmartMemo block an unsafe cache hit
SmartMemo live demo
Can this cached answer be reused safely?
One cached answer, one near-match request, two decisions: threshold-only reuse versus SmartMemo's classifier gate.
Advanced thresholds
1. Cached first
Stored response
Change the config to enable debug logging.
2. New request
Near match, opposite action
Revise the configuration so that debug logging is turned off.
3. Compare decisions
Same candidate, different safety decision
common baseline
Cosine-only cache
cosine
-
threshold
0.900
llm
-
This branch shows what a normal cosine-threshold semantic cache would do.
smartmemo path
SmartMemo
classifier
-
threshold
0.950
llm
-
This branch shows SmartMemo's classifier-gated decision on the same candidate.
Evidence: steps, feedback, JSON
Runtime
load embeddings
Seed
store response
Cosine
nearest match
Classifier
check equivalence
Feedback
label baseline hit
Video walkthrough
Watch the complete SmartMemo demo
An eight-minute walkthrough of the project story: why cosine-only semantic caching fails, how the classifier gate blocks unsafe reuse, how the live demo works, and how bad-hit feedback becomes retraining data.
The problem
Cosine similarity is not semantic equivalence
Semantic caches reuse an LLM response when a new prompt is “close enough” to an old one. But close in embedding space isn't the same as equivalent in meaning. “Approve the refund” and “Deny the refund” sit a hair apart by cosine — and a threshold-only cache will happily serve the wrong one.
In a support, medical, or finance setting that's not a stale cache — it's a wrong, confident answer. SmartMemo keeps cosine as a fast candidate selector and adds a learned classifier as the decision.
How it works
Retrieve with cosine, decide with a classifier
Encode
Embed the prompt with all-MiniLM-L6-v2 (384-dim) — fast, local, no API call.
FAISS search
Find nearest cached prompts by cosine similarity — the candidate set, not the answer.
Pairwise classifier
A small MLP over the embedding pair scores true semantic equivalence. Below bar → cache miss.
Reuse or call
Equivalent → return the cached response. Otherwise call the LLM and store the new pair.
backbone
Embeddings: all-MiniLM-L6-v2 (384-dim). The bundled classifier-v2 is a small MLP trained on 16,576 labeled pairs across 9 domains — local-paraphraser positives and templated hard negatives.
The result
Classifier-v2 vs. a tuned cosine baseline
| Metric | Tuned cosine | Classifier-v2 |
|---|---|---|
| Precisionat equal recall | 0.53 | 0.83 |
| Recall | 0.94 | 0.94 |
| F1 | 0.67 | 0.88 |
| False positiveson the 84-pair gold set | 26 | 6 |
Gold test set: 84 held-out pairs (31 equivalent, 53 not). +30 precision points with recall held constant — the cache rejects 20 more wrong reuses without losing a single correct hit.
On a deliberately adversarial high-stakes set (16 medical/legal/finance opposite-action pairs), false-positive hits dropped from 8 → 6. A generic classifier isn't infallible out of distribution — which is exactly why the feedback-and-retraining loop exists.
The API
One call, cache-aware
from smartmemo import SmartMemo, ClassifierConfigcache = SmartMemo(domain="customer-support",classifier=ClassifierConfig.bundled(),)result = await cache.get_or_call(prompt="Summarize this customer's latest billing ticket",llm_function=call_llm,)result.response, result.was_cache_hit, result.classifier_score
closing the loop
report_bad_hit(query_id, reason=…) and implicit re-issue detection record misses; export_feedback_pairs(path) turns them into JSONL, and smartmemo retrain ships a new classifier only if it clears the validation gates.
What I built
Why it holds up in production
Classifier, not threshold
Cosine similarity is a candidate selector, not a proof of equivalence. A small MLP over the embedding pair makes the final reuse decision — so near-duplicate-but-opposite prompts don't share a cached answer.
Learns from its own mistakes
Implicit feedback flags a re-issued prompt within a window as a likely bad hit; report_bad_hit() records explicit ones. export_feedback_pairs() turns them into JSONL retraining data.
Gated retraining, not auto-reload
smartmemo retrain runs behind validation gates — a new classifier only ships if it passes. No silent background swaps that could regress precision in production.
WAL-backed SQLite
Durable, thread-safe persistence with an async get_or_call API, bounded-backoff retries, and clean async-context resource teardown.
Get started
Install from PyPI
$ pip install "smartmemo[ml]"
Ships a pretrained classifier — no training required for a cold start. CI across Python 3.11–3.14.