Semantic caches save tokens until the world changes — then they quietly serve stale answers. YORO is a drop-in OpenAI-compatible proxy that invalidates on change and, instead of re-deriving from scratch, replays the reasoning it already paid for. Every claim below is measured, on real models, with the benchmark released alongside.
I swept drift rate (the fraction of recurring tasks whose true answer changes) on a clean, embedding-separable workload at matched thresholds. A no-invalidation semantic cache doesn't degrade politely: at just 5% drift, over half of its cache hits are already wrong, because popular items drift too — and every later hit on a drifted item serves the dead answer. YORO's dependency invalidation holds staleness at ~0 across the whole range.
Read the two panels together: the naive cache keeps its ~93% hit rate by force-fitting look-alikes — and its accuracy collapses to 0.29. YORO gives up hit rate (0.89 → 0.33 as near-misses rise) and keeps accuracy at ~1.00. That is the safety/savings trade, measured — not a free lunch, an honest dial.
Add perfect invalidation and something surprising happens in agent workloads: accuracy still collapses. When the method behind an answer lives in earlier interactions (the normal case for long-running agents — "recompute it for the new numbers"), dropping the stale entry forces a re-derivation without the method. The engine re-derives wrong, caches it, and serves that. I call it re-poisoning — and it's measurable. Splitting every wrong same-entity serve by whether the served answer was ever correct:
The no-invalidation cache fails outdated-heavy (genuine staleness). The invalidating cache fails ~99% re-poisoned — its invalidation is flawless; the failure is re-derivation, and note it manufactures more poison than the naive cache (3,369 vs 1,671): invalidation alone converts staleness into garbage production. Accuracy can't tell any of this apart; the taxonomy can. And the third bar is the resolution — replay makes the re-derivation succeed, so both failure modes collapse to 169 wrong serves total (§3).
YORO stores the reasoning trace alongside every answer. When a dependency changes, it doesn't serve the frozen answer or re-derive blind — it replays the validated method against the new inputs: a short, exploration-free completion. Both replay tiers stay flat above ~90% accuracy while every alternative collapses; and the model's reasoning-effort setting becomes a cost dial.
Honest scope: the stateless collapse here is by construction — these re-asks reference a procedure established earlier, the regime where memory is necessary. If every request restates its full context, a plain cache with invalidation is fine (§1 — YORO holds 1.00 accuracy there too). Replay is validated on multi-step arithmetic procedures so far; non-numeric procedures are future work. 4-bit quantization biases against replay (more arithmetic slips) — it dominates anyway. And the invalidation signal isn't an oracle: strip it and YORO degrades gracefully — staleness ≈ the share of missed signals, converging exactly to the naive cache's 0.866 at zero signal. With the signal: ~zero.
One gate, three tiers. Every request is embedded, matched against the case store, and routed to the cheapest tier that's safe:
Return the cached answer. Zero model calls, zero tokens. Dependency fingerprints scope every entry — a changed dep means this tier is off the table.
Inject the stored reasoning trace: “apply this validated procedure to the new inputs.” Short output, no re-exploration. Reasoning effort dials cost vs accuracy (96%@21% or 92%@10% of no-cache tokens). Shipped in the library and benchmark today; the proxy tier lands in the next release.
Full reasoning on the upstream model. The trace and answer are cached with dependency fingerprints — so you only reason once.
X-YORO-Cache are explicit
opt-ins.
X-YORO-Cache: HIT|MISS|SKIP +
similarity; /yoro/stats has running totals.
One base-URL change. Works with any OpenAI-compatible client or upstream (vLLM, llama.cpp server, OpenRouter, …).
# install (until the PyPI release: pip install git+https://github.com/ChaitanyaPinapaka/yoro-cache)
pip install "yoro-cache[embed]"
# example upstream: llama.cpp serving a local GGUF model
brew install llama.cpp # or your platform's build
llama-server -m your-model.gguf --port 8000
# run the proxy in front of it
YORO_UPSTREAM=http://127.0.0.1:8000/v1 yoro serve # listens on :8400
# point your client at it — that's the whole integration
export OPENAI_BASE_URL=http://127.0.0.1:8400/v1
# 1. serve a local model via llama.cpp
# (example: the one this page was tested with)
llama-server -hf deepreinforce-ai/Ornith-1.0-35B-GGUF \
--port 8000
# 2. put YORO in front of it
YORO_UPSTREAM=http://127.0.0.1:8000/v1 yoro serve
# 3. point OpenCode at the proxy — opencode.json:
{ "provider": { "yoro": {
"npm": "@ai-sdk/openai-compatible",
"options": { "baseURL": "http://127.0.0.1:8400/v1" },
"models": { "ornith-35b": {} } } } }
Measured on this setup (35B Q4, M-series Mac): repeated asks serve in ~12 ms vs ~3.3 s upstream. The safe policy caches plain turns and passes tool-bearing (agentic) turns through untouched.
$ curl -s :8400/yoro/stats
{
"hit": 42, "miss": 17, "skip": 6,
"stored": 17, "served": 65,
"hit_rate": 0.646
}
Every response carries the decision in its headers, so you can audit each reuse.
| Header | Direction | Meaning |
|---|---|---|
X-YORO-Deps | request | name:fingerprint,… — entry serves only while these match |
X-YORO-Cache: 0|1 | request | force caching off / on for this call |
X-YORO-Cache | response | HIT · MISS · SKIP:<reason> |
X-YORO-Sim | response | similarity of the matched case |
| Variable | Default |
|---|---|
YORO_UPSTREAM | http://127.0.0.1:8000/v1 |
YORO_PORT | 8400 |
YORO_POLICY | safe (refuses agentic/sampled turns) · aggressive |
YORO_TAU_HIT / YORO_TAU_MISS | 0.95 / 0.6 |
YORO_EMBED | all-MiniLM-L6-v2 |
YORO_CACHE_PATH | ~/.yoro/proxy_cache.json |
For fully agentic tools, YORO can harvest the reusable
methods from your past OpenCode sessions and keep a
marked, auto-updating block in AGENTS.md —
reasoning your future sessions inherit for free:
YORO_UPSTREAM=http://127.0.0.1:8000/v1 python -m yoro.opencode_behaviors --out AGENTS.md
Safe by default: the proxy refuses to cache tool-bearing (agentic) or sampled turns — a stale hit must never corrupt an agent's file tree. Cache entries only serve while their dependency fingerprints match; change the file and the entry stops serving. No deps header? The novelty gate and conservative defaults still apply.
Every number on this page comes from a released benchmark you can run yourself: a tunable stress harness that injects drift (answers change), near-misses (look-alikes with different answers), and signal loss (invalidation fidelity) into Zipf-recurrent request streams — and scores staleness, brittleness, and the failure taxonomy against ground truth.
bench/.
outdated_rate
/ repoisoned_rate), defined by correctness
lineage and pinned by unit test.
Related work, honestly: template/thought reuse exists (Buffer of Thoughts, Metacognitive Reuse, Analogical Prompting). YORO's contribution is making reuse safe and accounted — invalidation, the staleness taxonomy, and input/output token accounting that a cheap-output claim can't hide behind.