← Projects

Semantic LLM Cache

SmartMemo

A semantic cache for LLM agents where a learned classifier — not raw cosine similarity — decides when a cached answer is safe to reuse.

+30 pts

precision at equal recall vs. cosine

FAISSSentenceTransformersPyTorchSQLitePydantic
  • FAISS retrieves candidates; a learned classifier decides reuse
  • +30 precision points at equal recall vs. a tuned cosine baseline
  • Bundled classifier — no training needed for a cold start
  • Implicit + explicit bad-hit feedback feeds gated retraining

Live demo

Watch SmartMemo block an unsafe cache hit

SmartMemo live demo

Can this cached answer be reused safely?

One cached answer, one near-match request, two decisions: threshold-only reuse versus SmartMemo's classifier gate.

Scenario
New request
Advanced thresholds
Cosine threshold
90%
Classifier threshold
95%

1. Cached first

Stored response

Change the config to enable debug logging.

2. New request

Near match, opposite action

Revise the configuration so that debug logging is turned off.

3. Compare decisions

Same candidate, different safety decision

software-engineering

common baseline

Cosine-only cache

Waiting

cosine

-

threshold

0.900

llm

-

This branch shows what a normal cosine-threshold semantic cache would do.

smartmemo path

SmartMemo

Waiting

classifier

-

threshold

0.950

llm

-

This branch shows SmartMemo's classifier-gated decision on the same candidate.

Run it to see whether SmartMemo reuses the cached answer or blocks it.
Evidence: steps, feedback, JSON
pending

Runtime

load embeddings

pending

Seed

store response

pending

Cosine

nearest match

pending

Classifier

check equivalence

pending

Feedback

label baseline hit

Video walkthrough

Watch the complete SmartMemo demo

Open on YouTube ↗

An eight-minute walkthrough of the project story: why cosine-only semantic caching fails, how the classifier gate blocks unsafe reuse, how the live demo works, and how bad-hit feedback becomes retraining data.

The problem

Cosine similarity is not semantic equivalence

Semantic caches reuse an LLM response when a new prompt is “close enough” to an old one. But close in embedding space isn't the same as equivalent in meaning. “Approve the refund” and “Deny the refund” sit a hair apart by cosine — and a threshold-only cache will happily serve the wrong one.

In a support, medical, or finance setting that's not a stale cache — it's a wrong, confident answer. SmartMemo keeps cosine as a fast candidate selector and adds a learned classifier as the decision.

How it works

Retrieve with cosine, decide with a classifier

embed

Encode

Embed the prompt with all-MiniLM-L6-v2 (384-dim) — fast, local, no API call.

retrieve

FAISS search

Find nearest cached prompts by cosine similarity — the candidate set, not the answer.

decide

Pairwise classifier

A small MLP over the embedding pair scores true semantic equivalence. Below bar → cache miss.

serve

Reuse or call

Equivalent → return the cached response. Otherwise call the LLM and store the new pair.

backbone

Embeddings: all-MiniLM-L6-v2 (384-dim). The bundled classifier-v2 is a small MLP trained on 16,576 labeled pairs across 9 domains — local-paraphraser positives and templated hard negatives.

The result

Classifier-v2 vs. a tuned cosine baseline

MetricTuned cosineClassifier-v2
Precisionat equal recall0.530.83
Recall0.940.94
F10.670.88
False positiveson the 84-pair gold set266

Gold test set: 84 held-out pairs (31 equivalent, 53 not). +30 precision points with recall held constant — the cache rejects 20 more wrong reuses without losing a single correct hit.

On a deliberately adversarial high-stakes set (16 medical/legal/finance opposite-action pairs), false-positive hits dropped from 8 → 6. A generic classifier isn't infallible out of distribution — which is exactly why the feedback-and-retraining loop exists.

The API

One call, cache-aware

get-or-call with the bundled classifier
from smartmemo import SmartMemo, ClassifierConfig
 
cache = SmartMemo(
domain="customer-support",
classifier=ClassifierConfig.bundled(),
)
 
result = await cache.get_or_call(
prompt="Summarize this customer's latest billing ticket",
llm_function=call_llm,
)
result.response, result.was_cache_hit, result.classifier_score

closing the loop

report_bad_hit(query_id, reason=…) and implicit re-issue detection record misses; export_feedback_pairs(path) turns them into JSONL, and smartmemo retrain ships a new classifier only if it clears the validation gates.

What I built

Why it holds up in production

Classifier, not threshold

Cosine similarity is a candidate selector, not a proof of equivalence. A small MLP over the embedding pair makes the final reuse decision — so near-duplicate-but-opposite prompts don't share a cached answer.

Learns from its own mistakes

Implicit feedback flags a re-issued prompt within a window as a likely bad hit; report_bad_hit() records explicit ones. export_feedback_pairs() turns them into JSONL retraining data.

Gated retraining, not auto-reload

smartmemo retrain runs behind validation gates — a new classifier only ships if it passes. No silent background swaps that could regress precision in production.

WAL-backed SQLite

Durable, thread-safe persistence with an async get_or_call API, bounded-backoff retries, and clean async-context resource teardown.

Get started

Install from PyPI

pip
$ pip install "smartmemo[ml]"

Ships a pretrained classifier — no training required for a cold start. CI across Python 3.11–3.14.

← All projects