Semantic caches save tokens until the world changes — then they quietly serve stale answers. YORO is a drop-in OpenAI-compatible proxy that invalidates on change and, instead of re-deriving from scratch, replays the reasoning it already paid for. Every claim below is measured, on real models, with the benchmark released alongside.
We swept drift rate (the fraction of recurring tasks whose true answer changes) on a clean, embedding-separable workload at matched thresholds. A no-invalidation semantic cache doesn't degrade politely: at just 5% drift, over half of its cache hits are already wrong, because popular items drift too — and every later hit on a drifted item serves the dead answer. YORO's dependency invalidation holds staleness at ~0 across the whole range.
Read the two panels together: the naive cache keeps its ~93% hit rate by force-fitting look-alikes — and its accuracy collapses to 0.29. YORO gives up hit rate (0.89 → 0.33 as near-misses rise) and keeps accuracy at ~1.00. That is the safety/savings trade, measured — not a free lunch, an honest dial.
Add perfect invalidation and something surprising happens in agent workloads: accuracy still collapses. When the method behind an answer lives in earlier interactions (the normal case for long-running agents — "recompute it for the new numbers"), dropping the stale entry forces a re-derivation without the method. The engine re-derives wrong, caches it, and serves that. We call it re-poisoning — and it's measurable. Splitting every wrong same-entity serve by whether the served answer was ever correct:
The no-invalidation cache fails outdated-heavy (genuine staleness). The invalidating cache fails ~99% re-poisoned — its invalidation is flawless; the failure is re-derivation, and note it manufactures more poison than the naive cache (3,369 vs 1,671): invalidation alone converts staleness into garbage production. Accuracy can't tell any of this apart; the taxonomy can. And the third bar is the resolution — replay makes the re-derivation succeed, so both failure modes collapse to 169 wrong serves total (§3).
YORO stores the reasoning trace alongside every answer. When a dependency changes, it doesn't serve the frozen answer or re-derive blind — it replays the validated method against the new inputs: a short, exploration-free completion. Both replay tiers stay flat above ~90% accuracy while every alternative collapses; and the model's reasoning-effort setting becomes a cost dial.
Honest scope: the stateless collapse here is by construction — these re-asks reference a procedure established earlier, the regime where memory is necessary. If every request restates its full context, a plain cache with invalidation is fine (§1 — YORO holds 1.00 accuracy there too). Replay is validated on multi-step arithmetic procedures so far; non-numeric procedures are future work. 4-bit quantization biases against replay (more arithmetic slips) — it dominates anyway. And the invalidation signal isn't an oracle: strip it and YORO degrades gracefully — staleness ≈ the share of missed signals, converging exactly to the naive cache's 0.866 at zero signal. With the signal: ~zero.
One gate, three tiers. Every request is embedded, matched against the case store, and routed to the cheapest tier that's safe:
Return the cached answer. Zero model calls, zero tokens. Dependency fingerprints scope every entry — a changed dep means this tier is off the table.
Inject the stored reasoning trace: “apply this validated procedure to the new inputs.” Short output, no re-exploration. Reasoning effort dials cost vs accuracy (96%@21% or 92%@10% of no-cache tokens). Shipped in the library and benchmark today; the proxy tier lands in the next release.
Full reasoning on the upstream model. The trace and answer are cached with dependency fingerprints — so you only reason once.
X-YORO-Cache are explicit
opt-ins.
X-YORO-Cache: HIT|MISS|SKIP +
similarity; /yoro/stats has running totals.
One base-URL change. Works with any OpenAI-compatible client or upstream (vLLM, llama.cpp server, OpenRouter, …).
# install (until the PyPI release: pip install git+https://github.com/ChaitanyaPinapaka/yoro-cache)
pip install "yoro-cache[embed]"
# example upstream: llama.cpp serving a local GGUF model
brew install llama.cpp # or your platform's build
llama-server -m your-model.gguf --port 8000
# run the proxy in front of it
YORO_UPSTREAM=http://127.0.0.1:8000/v1 yoro serve # listens on :8400
# point your client at it — that's the whole integration
export OPENAI_BASE_URL=http://127.0.0.1:8400/v1
# 1. serve a local model via llama.cpp
# (example: the one this page was tested with)
llama-server -hf deepreinforce-ai/Ornith-1.0-35B-GGUF \
--port 8000
# 2. put YORO in front of it
YORO_UPSTREAM=http://127.0.0.1:8000/v1 yoro serve
# 3. point OpenCode at the proxy — opencode.json:
{ "provider": { "yoro": {
"npm": "@ai-sdk/openai-compatible",
"options": { "baseURL": "http://127.0.0.1:8400/v1" },
"models": { "ornith-35b": {} } } } }
Measured on this setup (35B Q4, M-series Mac): repeated asks serve in ~12 ms vs ~3.3 s upstream. The safe policy caches plain turns and passes tool-bearing (agentic) turns through untouched.
$ curl -s :8400/yoro/stats
{
"hit": 42, "miss": 17, "skip": 6,
"stored": 17, "served": 65,
"hit_rate": 0.646
}
Every response carries the decision in its headers, so you can audit each reuse.
| Header | Direction | Meaning |
|---|---|---|
X-YORO-Deps | request | name:fingerprint,… — entry serves only while these match |
X-YORO-Cache: 0|1 | request | force caching off / on for this call |
X-YORO-Cache | response | HIT · MISS · SKIP:<reason> |
X-YORO-Sim | response | similarity of the matched case |
| Variable | Default |
|---|---|
YORO_UPSTREAM | http://127.0.0.1:8000/v1 |
YORO_PORT | 8400 |
YORO_POLICY | safe (refuses agentic/sampled turns) · aggressive |
YORO_TAU_HIT / YORO_TAU_MISS | 0.95 / 0.6 |
YORO_EMBED | all-MiniLM-L6-v2 |
YORO_CACHE_PATH | ~/.yoro/proxy_cache.json |
For fully agentic tools, YORO can harvest the reusable
methods from your past OpenCode sessions and keep a
marked, auto-updating block in AGENTS.md —
reasoning your future sessions inherit for free:
YORO_UPSTREAM=http://127.0.0.1:8000/v1 python -m yoro.opencode_behaviors --out AGENTS.md
Safe by default: the proxy refuses to cache tool-bearing (agentic) or sampled turns — a stale hit must never corrupt an agent's file tree. Cache entries only serve while their dependency fingerprints match; change the file and the entry stops serving. No deps header? The novelty gate and conservative defaults still apply.
Every number on this page comes from a released benchmark you can run yourself: a tunable stress harness that injects drift (answers change), near-misses (look-alikes with different answers), and signal loss (invalidation fidelity) into Zipf-recurrent request streams — and scores staleness, brittleness, and the failure taxonomy against ground truth.
bench/.
outdated_rate
/ repoisoned_rate), defined by correctness
lineage and pinned by unit test.
Related work, honestly: template/thought reuse exists (Buffer of Thoughts, Metacognitive Reuse, Analogical Prompting). YORO's contribution is making reuse safe and accounted — invalidation, the staleness taxonomy, and input/output token accounting that a cheap-output claim can't hide behind.