Semantic LLM Cache

SmartMemo

A semantic cache for LLM agents where a learned classifier — not raw cosine similarity — decides when a cached answer is safe to reuse.

+30 pts

precision at equal recall vs. cosine

FAISSSentenceTransformersPyTorchSQLitePydantic

GitHub ↗PyPI ↗Video demo ↗

FAISS retrieves candidates; a learned classifier decides reuse
+30 precision points at equal recall vs. a tuned cosine baseline
Bundled classifier — no training needed for a cold start
Implicit + explicit bad-hit feedback feeds gated retraining

Watch SmartMemo block an unsafe cache hit

SmartMemo live demo

Can this cached answer be reused safely?

One cached answer, one near-match request, two decisions: threshold-only reuse versus SmartMemo's classifier gate.

Package repo Demo API PyPI

Scenario

New request

Advanced thresholds

Cosine threshold

90%

Classifier threshold

95%

Feedback

1. Cached first

Stored response

Change the config to enable debug logging.

2. New request

Near match, opposite action

Revise the configuration so that debug logging is turned off.

3. Compare decisions

Same candidate, different safety decision

software-engineering

common baseline

Cosine-only cache

Waiting

cosine

threshold

0.900

llm

This branch shows what a normal cosine-threshold semantic cache would do.

smartmemo path

SmartMemo

Waiting

classifier

threshold

0.950

llm

This branch shows SmartMemo's classifier-gated decision on the same candidate.

Run it to see whether SmartMemo reuses the cached answer or blocks it.

Evidence: steps, feedback, JSON

pending

Runtime

load embeddings

pending

Seed

store response

pending

Cosine

nearest match

pending

Classifier

check equivalence

pending

Feedback

label baseline hit

Watch the complete SmartMemo demo

Open on YouTube ↗

An eight-minute walkthrough of the project story: why cosine-only semantic caching fails, how the classifier gate blocks unsafe reuse, how the live demo works, and how bad-hit feedback becomes retraining data.

Cosine similarity is not semantic equivalence

Semantic caches reuse an LLM response when a new prompt is “close enough” to an old one. But close in embedding space isn't the same as equivalent in meaning. “Approve the refund” and “Deny the refund” sit a hair apart by cosine — and a threshold-only cache will happily serve the wrong one.

In a support, medical, or finance setting that's not a stale cache — it's a wrong, confident answer. SmartMemo keeps cosine as a fast candidate selector and adds a learned classifier as the decision.

Retrieve with cosine, decide with a classifier

embed

Encode

Embed the prompt with all-MiniLM-L6-v2 (384-dim) — fast, local, no API call.

retrieve

FAISS search

Find nearest cached prompts by cosine similarity — the candidate set, not the answer.

decide

Pairwise classifier

A small MLP over the embedding pair scores true semantic equivalence. Below bar → cache miss.

serve

Reuse or call

Equivalent → return the cached response. Otherwise call the LLM and store the new pair.

backbone

Embeddings: all-MiniLM-L6-v2 (384-dim). The bundled classifier-v2 is a small MLP trained on 16,576 labeled pairs across 9 domains — local-paraphraser positives and templated hard negatives.

Classifier-v2 vs. a tuned cosine baseline

Metric	Tuned cosine	Classifier-v2
Precisionat equal recall	0.53	0.83
Recall	0.94	0.94
F1	0.67	0.88
False positiveson the 84-pair gold set	26	6

Gold test set: 84 held-out pairs (31 equivalent, 53 not). +30 precision points with recall held constant — the cache rejects 20 more wrong reuses without losing a single correct hit.

On a deliberately adversarial high-stakes set (16 medical/legal/finance opposite-action pairs), false-positive hits dropped from 8 → 6. A generic classifier isn't infallible out of distribution — which is exactly why the feedback-and-retraining loop exists.

One call, cache-aware

get-or-call with the bundled classifier

from smartmemo import SmartMemo, ClassifierConfig
 
cache = SmartMemo(
    domain="customer-support",
    classifier=ClassifierConfig.bundled(),
)
 
result = await cache.get_or_call(
    prompt="Summarize this customer's latest billing ticket",
    llm_function=call_llm,
)
result.response, result.was_cache_hit, result.classifier_score

closing the loop

report_bad_hit(query_id, reason=…) and implicit re-issue detection record misses; export_feedback_pairs(path) turns them into JSONL, and smartmemo retrain ships a new classifier only if it clears the validation gates.

Why it holds up in production

Classifier, not threshold

Cosine similarity is a candidate selector, not a proof of equivalence. A small MLP over the embedding pair makes the final reuse decision — so near-duplicate-but-opposite prompts don't share a cached answer.

Learns from its own mistakes

Implicit feedback flags a re-issued prompt within a window as a likely bad hit; report_bad_hit() records explicit ones. export_feedback_pairs() turns them into JSONL retraining data.

Gated retraining, not auto-reload

smartmemo retrain runs behind validation gates — a new classifier only ships if it passes. No silent background swaps that could regress precision in production.

WAL-backed SQLite

Durable, thread-safe persistence with an async get_or_call API, bounded-backoff retries, and clean async-context resource teardown.

Install from PyPI

pip

$ pip install "smartmemo[ml]"

Ships a pretrained classifier — no training required for a cold start. CI across Python 3.11–3.14.

GitHub ↗PyPI ↗

← All projects