Semantic LLM Cache
SmartMemo
A semantic cache for LLM agents where a learned classifier — not raw cosine similarity — decides when a cached answer is safe to reuse.
+30 pts
precision at equal recall vs. cosine
- FAISS retrieves candidates; a learned classifier decides reuse
- +30 precision points at equal recall vs. a tuned cosine baseline
- Bundled classifier — no training needed for a cold start
- Implicit + explicit bad-hit feedback feeds gated retraining
The problem
Cosine similarity is not semantic equivalence
Semantic caches reuse an LLM response when a new prompt is “close enough” to an old one. But close in embedding space isn't the same as equivalent in meaning. “Approve the refund” and “Deny the refund” sit a hair apart by cosine — and a threshold-only cache will happily serve the wrong one.
In a support, medical, or finance setting that's not a stale cache — it's a wrong, confident answer. SmartMemo keeps cosine as a fast candidate selector and adds a learned classifier as the decision.
How it works
Retrieve with cosine, decide with a classifier
Encode
Embed the prompt with all-MiniLM-L6-v2 (384-dim) — fast, local, no API call.
FAISS search
Find nearest cached prompts by cosine similarity — the candidate set, not the answer.
Pairwise classifier
A small MLP over the embedding pair scores true semantic equivalence. Below bar → cache miss.
Reuse or call
Equivalent → return the cached response. Otherwise call the LLM and store the new pair.
backbone
Embeddings: all-MiniLM-L6-v2 (384-dim). The bundled classifier-v2 is a small MLP trained on 16,576 labeled pairs across 9 domains — local-paraphraser positives and templated hard negatives.
See it run
The false positive it refuses to make
Two prompts a hair apart in embedding space, opposite in meaning. The classifier catches what the cosine threshold can't.
cache.get_or_call("Approve the customer's refund request")The result
Classifier-v2 vs. a tuned cosine baseline
| Metric | Tuned cosine | Classifier-v2 |
|---|---|---|
| Precisionat equal recall | 0.53 | 0.83 |
| Recall | 0.94 | 0.94 |
| F1 | 0.67 | 0.88 |
| False positiveson the 84-pair gold set | 26 | 6 |
Gold test set: 84 held-out pairs (31 equivalent, 53 not). +30 precision points with recall held constant — the cache rejects 20 more wrong reuses without losing a single correct hit.
On a deliberately adversarial high-stakes set (16 medical/legal/finance opposite-action pairs), false-positive hits dropped from 8 → 6. A generic classifier isn't infallible out of distribution — which is exactly why the feedback-and-retraining loop exists.
The API
One call, cache-aware
from smartmemo import SmartMemo, ClassifierConfigcache = SmartMemo(domain="customer-support",classifier=ClassifierConfig.bundled(),)result = await cache.get_or_call(prompt="Summarize this customer's latest billing ticket",llm_function=call_llm,)result.response, result.was_cache_hit, result.classifier_score
closing the loop
report_bad_hit(query_id, reason=…) and implicit re-issue detection record misses; export_feedback_pairs(path) turns them into JSONL, and smartmemo retrain ships a new classifier only if it clears the validation gates.
What I built
Why it holds up in production
Classifier, not threshold
Cosine similarity is a candidate selector, not a proof of equivalence. A small MLP over the embedding pair makes the final reuse decision — so near-duplicate-but-opposite prompts don't share a cached answer.
Learns from its own mistakes
Implicit feedback flags a re-issued prompt within a window as a likely bad hit; report_bad_hit() records explicit ones. export_feedback_pairs() turns them into JSONL retraining data.
Gated retraining, not auto-reload
smartmemo retrain runs behind validation gates — a new classifier only ships if it passes. No silent background swaps that could regress precision in production.
WAL-backed SQLite
Durable, thread-safe persistence with an async get_or_call API, bounded-backoff retries, and clean async-context resource teardown.
Get started
Install from PyPI
$ pip install "smartmemo[ml]"
Ships a pretrained classifier — no training required for a cold start. CI across Python 3.11–3.14.