Semantic LLM Cache

SmartMemo

A semantic cache for LLM agents where a learned classifier — not raw cosine similarity — decides when a cached answer is safe to reuse.

+30 pts

precision at equal recall vs. cosine

FAISSSentenceTransformersPyTorchSQLitePydantic

GitHub ↗PyPI ↗

FAISS retrieves candidates; a learned classifier decides reuse
+30 precision points at equal recall vs. a tuned cosine baseline
Bundled classifier — no training needed for a cold start
Implicit + explicit bad-hit feedback feeds gated retraining

The problem

Cosine similarity is not semantic equivalence

Semantic caches reuse an LLM response when a new prompt is “close enough” to an old one. But close in embedding space isn't the same as equivalent in meaning. “Approve the refund” and “Deny the refund” sit a hair apart by cosine — and a threshold-only cache will happily serve the wrong one.

In a support, medical, or finance setting that's not a stale cache — it's a wrong, confident answer. SmartMemo keeps cosine as a fast candidate selector and adds a learned classifier as the decision.

How it works

Retrieve with cosine, decide with a classifier

embed

Encode

Embed the prompt with all-MiniLM-L6-v2 (384-dim) — fast, local, no API call.

retrieve

FAISS search

Find nearest cached prompts by cosine similarity — the candidate set, not the answer.

decide

Pairwise classifier

A small MLP over the embedding pair scores true semantic equivalence. Below bar → cache miss.

serve

Reuse or call

Equivalent → return the cached response. Otherwise call the LLM and store the new pair.

backbone

Embeddings: all-MiniLM-L6-v2 (384-dim). The bundled classifier-v2 is a small MLP trained on 16,576 labeled pairs across 9 domains — local-paraphraser positives and templated hard negatives.

See it run

The false positive it refuses to make

Two prompts a hair apart in embedding space, opposite in meaning. The classifier catches what the cosine threshold can't.

smartmemo · approve vs. deny

1/7

PROMPT

cache.get_or_call("Approve the customer's refund request")

▍

representative trace

The result

Classifier-v2 vs. a tuned cosine baseline

Metric	Tuned cosine	Classifier-v2
Precisionat equal recall	0.53	0.83
Recall	0.94	0.94
F1	0.67	0.88
False positiveson the 84-pair gold set	26	6

Gold test set: 84 held-out pairs (31 equivalent, 53 not). +30 precision points with recall held constant — the cache rejects 20 more wrong reuses without losing a single correct hit.

On a deliberately adversarial high-stakes set (16 medical/legal/finance opposite-action pairs), false-positive hits dropped from 8 → 6. A generic classifier isn't infallible out of distribution — which is exactly why the feedback-and-retraining loop exists.

The API

One call, cache-aware

get-or-call with the bundled classifier

from smartmemo import SmartMemo, ClassifierConfig
 
cache = SmartMemo(
    domain="customer-support",
    classifier=ClassifierConfig.bundled(),
)
 
result = await cache.get_or_call(
    prompt="Summarize this customer's latest billing ticket",
    llm_function=call_llm,
)
result.response, result.was_cache_hit, result.classifier_score

closing the loop

report_bad_hit(query_id, reason=…) and implicit re-issue detection record misses; export_feedback_pairs(path) turns them into JSONL, and smartmemo retrain ships a new classifier only if it clears the validation gates.

What I built

Why it holds up in production

Classifier, not threshold

Cosine similarity is a candidate selector, not a proof of equivalence. A small MLP over the embedding pair makes the final reuse decision — so near-duplicate-but-opposite prompts don't share a cached answer.

Learns from its own mistakes

Implicit feedback flags a re-issued prompt within a window as a likely bad hit; report_bad_hit() records explicit ones. export_feedback_pairs() turns them into JSONL retraining data.

Gated retraining, not auto-reload

smartmemo retrain runs behind validation gates — a new classifier only ships if it passes. No silent background swaps that could regress precision in production.

WAL-backed SQLite

Durable, thread-safe persistence with an async get_or_call API, bounded-backoff retries, and clean async-context resource teardown.

Get started

Install from PyPI

pip

$ pip install "smartmemo[ml]"

Ships a pretrained classifier — no training required for a cold start. CI across Python 3.11–3.14.

GitHub ↗PyPI ↗

← All projects