yoro-cache · open source

You Only Reason Once.
Cache the plan, not the answer.

Semantic caches save tokens until the world changes — then they quietly serve stale answers. YORO is a drop-in OpenAI-compatible proxy that invalidates on change and, instead of re-deriving from scratch, replays the reasoning it already paid for. Every claim below is measured, on real models, with the benchmark released alongside.

GitHub Quickstart The receipts
0.90 → 0.00
cache staleness under drift, GPTCache-style vs YORO
E1′ · gpt-oss-120B · matched τ=0.90
96% vs 16%
accuracy at heavy drift — replay vs serve-or-stale
E7 · 5 rungs × 5 drift levels
~10–21%
of no-cache output tokens, while staying correct
E7 · replay tiers
2 families
divergence reproduces on gpt-oss-120B (H100) and Qwen2.5-32B (4-bit, consumer RTX 5090)
E5

1 · Semantic caches rot — faster than you think

I swept drift rate (the fraction of recurring tasks whose true answer changes) on a clean, embedding-separable workload at matched thresholds. A no-invalidation semantic cache doesn't degrade politely: at just 5% drift, over half of its cache hits are already wrong, because popular items drift too — and every later hit on a drifted item serves the dead answer. YORO's dependency invalidation holds staleness at ~0 across the whole range.

Staleness vs drift rate

share of cache hits that serve a wrong answer · easy workload · τ=0.90 both
Data table

Accuracy vs drift rate

same runs — what the user experiences
Data table

Brittleness vs near-miss rate

look-alike queries force-fit into wrong reuse · fixed drift 0.05 · E2′
Data table

The honest trade: hit rate

YORO's gate declines risky reuse — hit rate falls, correctness holds at ~1.00
Data table

Read the two panels together: the naive cache keeps its ~93% hit rate by force-fitting look-alikes — and its accuracy collapses to 0.29. YORO gives up hit rate (0.89 → 0.33 as near-misses rise) and keeps accuracy at ~1.00. That is the safety/savings trade, measured — not a free lunch, an honest dial.

2 · The twist: invalidation alone isn't enough

Add perfect invalidation and something surprising happens in agent workloads: accuracy still collapses. When the method behind an answer lives in earlier interactions (the normal case for long-running agents — "recompute it for the new numbers"), dropping the stale entry forces a re-derivation without the method. The engine re-derives wrong, caches it, and serves that. I call it re-poisoning — and it's measurable. Splitting every wrong same-entity serve by whether the served answer was ever correct:

Why each cache is wrong, at drift 0.4

wrong serves split by failure mode · hard (method-in-history) workload · E7
outdated — was correct once, world moved re-poisoned — never correct, re-derived wrong & cached
Data table

The no-invalidation cache fails outdated-heavy (genuine staleness). The invalidating cache fails ~99% re-poisoned — its invalidation is flawless; the failure is re-derivation, and note it manufactures more poison than the naive cache (3,369 vs 1,671): invalidation alone converts staleness into garbage production. Accuracy can't tell any of this apart; the taxonomy can. And the third bar is the resolution — replay makes the re-derivation succeed, so both failure modes collapse to 169 wrong serves total (§3).

3 · The fix: replay the cached reasoning

YORO stores the reasoning trace alongside every answer. When a dependency changes, it doesn't serve the frozen answer or re-derive blind — it replays the validated method against the new inputs: a short, exploration-free completion. Both replay tiers stay flat above ~90% accuracy while every alternative collapses; and the model's reasoning-effort setting becomes a cost dial.

Accuracy vs drift — the four-tier result

hard workload (re-asks reference the established procedure) · gpt-oss-120B · E7
Data table

Output tokens vs drift

thousands of generation tokens per stream · same runs
Data table

It's not a model quirk

same divergence on Qwen2.5-32B-Instruct, 4-bit AWQ, one consumer GPU · E5
Data table

The signal is a dial, not an oracle

staleness as the invalidation signal weakens · fixed drift 0.3 · E4
Data table

Honest scope: the stateless collapse here is by construction — these re-asks reference a procedure established earlier, the regime where memory is necessary. If every request restates its full context, a plain cache with invalidation is fine (§1 — YORO holds 1.00 accuracy there too). Replay is validated on multi-step arithmetic procedures so far; non-numeric procedures are future work. 4-bit quantization biases against replay (more arithmetic slips) — it dominates anyway. And the invalidation signal isn't an oracle: strip it and YORO degrades gracefully — staleness ≈ the share of missed signals, converging exactly to the naive cache's 0.866 at zero signal. With the signal: ~zero.

4 · How it works: graduated reuse

One gate, three tiers. Every request is embedded, matched against the case store, and routed to the cheapest tier that's safe:

The request path

request embed + match nearest case · freshness SERVE fresh + high sim → cached answer REPLAY same case, deps changed → re-apply method REASON novel / borderline → full reasoning, cache it 0 tokens ~10–21% of tokens full cost, once
Tier 1 · Serve

Fresh + high similarity

Return the cached answer. Zero model calls, zero tokens. Dependency fingerprints scope every entry — a changed dep means this tier is off the table.

Tier 2 · Replay

Same case, deps changed

Inject the stored reasoning trace: “apply this validated procedure to the new inputs.” Short output, no re-exploration. Reasoning effort dials cost vs accuracy (96%@21% or 92%@10% of no-cache tokens). Shipped in the library and benchmark today; the proxy tier lands in the next release.

Tier 3 · Reason

Novel or borderline

Full reasoning on the upstream model. The trace and answer are cached with dependency fingerprints — so you only reason once.

5 · Using it

One base-URL change. Works with any OpenAI-compatible client or upstream (vLLM, llama.cpp server, OpenRouter, …).

# install  (until the PyPI release: pip install git+https://github.com/ChaitanyaPinapaka/yoro-cache)
pip install "yoro-cache[embed]"

# example upstream: llama.cpp serving a local GGUF model
brew install llama.cpp        # or your platform's build
llama-server -m your-model.gguf --port 8000

# run the proxy in front of it
YORO_UPSTREAM=http://127.0.0.1:8000/v1 yoro serve        # listens on :8400

# point your client at it — that's the whole integration
export OPENAI_BASE_URL=http://127.0.0.1:8400/v1

With OpenCode + a local model

# 1. serve a local model via llama.cpp
#    (example: the one this page was tested with)
llama-server -hf deepreinforce-ai/Ornith-1.0-35B-GGUF \
             --port 8000

# 2. put YORO in front of it
YORO_UPSTREAM=http://127.0.0.1:8000/v1 yoro serve

# 3. point OpenCode at the proxy — opencode.json:
{ "provider": { "yoro": {
    "npm": "@ai-sdk/openai-compatible",
    "options": { "baseURL": "http://127.0.0.1:8400/v1" },
    "models": { "ornith-35b": {} } } } }

Measured on this setup (35B Q4, M-series Mac): repeated asks serve in ~12 ms vs ~3.3 s upstream. The safe policy caches plain turns and passes tool-bearing (agentic) turns through untouched.

Watch it work

$ curl -s :8400/yoro/stats
{
  "hit": 42, "miss": 17, "skip": 6,
  "stored": 17, "served": 65,
  "hit_rate": 0.646
}

Every response carries the decision in its headers, so you can audit each reuse.

The header contract

HeaderDirectionMeaning
X-YORO-Depsrequestname:fingerprint,… — entry serves only while these match
X-YORO-Cache: 0|1requestforce caching off / on for this call
X-YORO-CacheresponseHIT · MISS · SKIP:<reason>
X-YORO-Simresponsesimilarity of the matched case

Configuration (env)

VariableDefault
YORO_UPSTREAMhttp://127.0.0.1:8000/v1
YORO_PORT8400
YORO_POLICYsafe (refuses agentic/sampled turns) · aggressive
YORO_TAU_HIT / YORO_TAU_MISS0.95 / 0.6
YORO_EMBEDall-MiniLM-L6-v2
YORO_CACHE_PATH~/.yoro/proxy_cache.json

Bonus: mine your agent sessions into AGENTS.md

For fully agentic tools, YORO can harvest the reusable methods from your past OpenCode sessions and keep a marked, auto-updating block in AGENTS.md — reasoning your future sessions inherit for free:

YORO_UPSTREAM=http://127.0.0.1:8000/v1 python -m yoro.opencode_behaviors --out AGENTS.md

Safe by default: the proxy refuses to cache tool-bearing (agentic) or sampled turns — a stale hit must never corrupt an agent's file tree. Cache entries only serve while their dependency fingerprints match; change the file and the entry stops serving. No deps header? The novelty gate and conservative defaults still apply.

6 · The receipts

Every number on this page comes from a released benchmark you can run yourself: a tunable stress harness that injects drift (answers change), near-misses (look-alikes with different answers), and signal loss (invalidation fidelity) into Zipf-recurrent request streams — and scores staleness, brittleness, and the failure taxonomy against ground truth.

Related work, honestly: template/thought reuse exists (Buffer of Thoughts, Metacognitive Reuse, Analogical Prompting). YORO's contribution is making reuse safe and accounted — invalidation, the staleness taxonomy, and input/output token accounting that a cheap-output claim can't hide behind.