YORO — You Only Reason Once. Cache the plan, not the answer.

1 · Semantic caches rot — faster than you think

I swept drift rate (the fraction of recurring tasks whose true answer changes) on a clean, embedding-separable workload at matched thresholds. A no-invalidation semantic cache doesn't degrade politely: at just 5% drift, over half of its cache hits are already wrong, because popular items drift too — and every later hit on a drifted item serves the dead answer. YORO's dependency invalidation holds staleness at ~0 across the whole range.

Staleness vs drift rate

share of cache hits that serve a wrong answer · easy workload · τ=0.90 both

Data table

Accuracy vs drift rate

same runs — what the user experiences

Data table

Brittleness vs near-miss rate

look-alike queries force-fit into wrong reuse · fixed drift 0.05 · E2′

Data table

The honest trade: hit rate

YORO's gate declines risky reuse — hit rate falls, correctness holds at ~1.00

Data table

Read the two panels together: the naive cache keeps its ~93% hit rate by force-fitting look-alikes — and its accuracy collapses to 0.29. YORO gives up hit rate (0.89 → 0.33 as near-misses rise) and keeps accuracy at ~1.00. That is the safety/savings trade, measured — not a free lunch, an honest dial.

2 · The twist: invalidation alone isn't enough

Add perfect invalidation and something surprising happens in agent workloads: accuracy still collapses. When the method behind an answer lives in earlier interactions (the normal case for long-running agents — "recompute it for the new numbers"), dropping the stale entry forces a re-derivation without the method. The engine re-derives wrong, caches it, and serves that. I call it re-poisoning — and it's measurable. Splitting every wrong same-entity serve by whether the served answer was ever correct:

Why each cache is wrong, at drift 0.4

wrong serves split by failure mode · hard (method-in-history) workload · E7

outdated — was correct once, world moved re-poisoned — never correct, re-derived wrong & cached

Data table

The no-invalidation cache fails outdated-heavy (genuine staleness). The invalidating cache fails ~99% re-poisoned — its invalidation is flawless; the failure is re-derivation, and note it manufactures more poison than the naive cache (3,369 vs 1,671): invalidation alone converts staleness into garbage production. Accuracy can't tell any of this apart; the taxonomy can. And the third bar is the resolution — replay makes the re-derivation succeed, so both failure modes collapse to 169 wrong serves total (§3).

3 · The fix: replay the cached reasoning

YORO stores the reasoning trace alongside every answer. When a dependency changes, it doesn't serve the frozen answer or re-derive blind — it replays the validated method against the new inputs: a short, exploration-free completion. Both replay tiers stay flat above ~90% accuracy while every alternative collapses; and the model's reasoning-effort setting becomes a cost dial.

Accuracy vs drift — the four-tier result

hard workload (re-asks reference the established procedure) · gpt-oss-120B · E7

Data table

Output tokens vs drift

thousands of generation tokens per stream · same runs

Data table

It's not a model quirk

same divergence on Qwen2.5-32B-Instruct, 4-bit AWQ, one consumer GPU · E5

Data table

The signal is a dial, not an oracle

staleness as the invalidation signal weakens · fixed drift 0.3 · E4

Data table

Honest scope: the stateless collapse here is by construction — these re-asks reference a procedure established earlier, the regime where memory is necessary. If every request restates its full context, a plain cache with invalidation is fine (§1 — YORO holds 1.00 accuracy there too). Replay is validated on multi-step arithmetic procedures so far; non-numeric procedures are future work. 4-bit quantization biases against replay (more arithmetic slips) — it dominates anyway. And the invalidation signal isn't an oracle: strip it and YORO degrades gracefully — staleness ≈ the share of missed signals, converging exactly to the naive cache's 0.866 at zero signal. With the signal: ~zero.

4 · How it works: graduated reuse

One gate, three tiers. Every request is embedded, matched against the case store, and routed to the cheapest tier that's safe:

The request path

Tier 1 · Serve

Fresh + high similarity

Return the cached answer. Zero model calls, zero tokens. Dependency fingerprints scope every entry — a changed dep means this tier is off the table.

Tier 2 · Replay

Same case, deps changed

Inject the stored reasoning trace: “apply this validated procedure to the new inputs.” Short output, no re-exploration. Reasoning effort dials cost vs accuracy (96%@21% or 92%@10% of no-cache tokens). Live in the proxy as of v0.1.2: responses carry X-YORO-Cache: REPLAY.

Tier 3 · Reason

Novel or borderline

Full reasoning on the upstream model. The trace and answer are cached with dependency fingerprints — so you only reason once.

Safe by default: the proxy refuses to cache agentic (tool-bearing) or sampled turns — a stale hit can't corrupt an agent's file tree. Aggressive mode and per-request X-YORO-Cache are explicit opt-ins.
Novelty gate: look-alike queries below the hit threshold escalate instead of force-fitting (that's the ~0 brittleness in E2′).
Observability: every response carries X-YORO-Cache: HIT|MISS|SKIP + similarity; /yoro/stats has running totals.

5 · Using it

One base-URL change. Works with any OpenAI-compatible client or upstream (vLLM, llama.cpp server, OpenRouter, …).

# install  (until the PyPI release: pip install git+https://github.com/ChaitanyaPinapaka/yoro-cache)
pip install "yoro-cache[embed]"

# example upstream: llama.cpp serving a local GGUF model
brew install llama.cpp        # or your platform's build
llama-server -m your-model.gguf --port 8000

# run the proxy in front of it
YORO_UPSTREAM=http://127.0.0.1:8000/v1 yoro serve        # listens on :8400

# point your client at it — that's the whole integration
export OPENAI_BASE_URL=http://127.0.0.1:8400/v1

With OpenCode + a local model

# 1. serve a local model via llama.cpp
#    (example: the one this page was tested with)
llama-server -hf deepreinforce-ai/Ornith-1.0-35B-GGUF \
             --port 8000

# 2. put YORO in front of it
YORO_UPSTREAM=http://127.0.0.1:8000/v1 yoro serve

# 3. point OpenCode at the proxy — opencode.json:
{ "provider": { "yoro": {
    "npm": "@ai-sdk/openai-compatible",
    "options": { "baseURL": "http://127.0.0.1:8400/v1" },
    "models": { "ornith-35b": {} } } } }

Measured on this setup (35B Q4, M-series Mac): repeated asks serve in ~12 ms vs ~3.3 s upstream. The safe policy caches plain turns and passes tool-bearing (agentic) turns through untouched.

Watch it work

$ curl -s :8400/yoro/stats
{
  "hit": 42, "miss": 17, "skip": 6,
  "stored": 17, "served": 65,
  "hit_rate": 0.646
}

Every response carries the decision in its headers, so you can audit each reuse.

The header contract

Header	Direction	Meaning
`X-YORO-Deps`	request	`name:fingerprint,…` — entry serves only while these match
`X-YORO-Cache: 0\|1`	request	force caching off / on for this call
`X-YORO-Cache`	response	`HIT` · `REPLAY` · `MISS` · `SKIP:<reason>`
`X-YORO-Sim`	response	similarity of the matched case

Configuration (env)

Variable	Default
`YORO_UPSTREAM`	`http://127.0.0.1:8000/v1`
`YORO_PORT`	`8400`
`YORO_POLICY`	`safe` (refuses agentic/sampled turns) · `aggressive`
`YORO_TAU_HIT` / `YORO_TAU_MISS`	`0.95` / `0.6`
`YORO_EMBED`	`all-MiniLM-L6-v2`
`YORO_CACHE_PATH`	`~/.yoro/proxy_cache.json`

Bonus: mine your agent sessions into AGENTS.md

For fully agentic tools, YORO can harvest the reusable methods from your past OpenCode sessions and keep a marked, auto-updating block in AGENTS.md — reasoning your future sessions inherit for free:

YORO_UPSTREAM=http://127.0.0.1:8000/v1 python -m yoro.opencode_behaviors --out AGENTS.md

Safe by default: the proxy refuses to cache tool-bearing (agentic) or sampled turns — a stale hit must never corrupt an agent's file tree. Cache entries only serve while their dependency fingerprints match; when they change, the proxy replays the stored derivation against the new inputs instead of serving stale. Signals can come from the request header, from yoro serve --git . (the working tree as one dependency — the natural signal for coding agents), from a sidecar deps-file, or from an MCP server's resources via yoro mcp-bridge. Adapters ship for LiteLLM and LangChain (pip install "yoro-cache[litellm]" / [langchain]): same gate, same invalidation, inside your existing stack.

6 · The receipts

Every number on this page comes from a released benchmark you can run yourself: a tunable stress harness that injects drift (answers change), near-misses (look-alikes with different answers), and signal loss (invalidation fidelity) into Zipf-recurrent request streams — and scores staleness, brittleness, and the failure taxonomy against ground truth.

Benchmark + harness: github.com/ChaitanyaPinapaka/yoro-cache — rungs, sweeps, taxonomy metrics, one command per figure; result curves ship in bench/.
Experiments behind this page: E1′/E2′ (safety divergence, 6 levels × ~15 seeds), E7 (four-tier Pareto, 5 rungs × 5 levels), E5 (second model family, quantized, consumer GPU), E4 (invalidation-fidelity envelope). Result curves ship in the repo; per-event logs are available on request.
Scale: 25 sweep levels · 1,027 (level×rung×seed) runs · 616,200 scored queries · 72.7M tokens (35.5M output + 37.2M input) · 2 model families on 2 GPU classes · ~17 GPU-hours ≈ $50 of compute for the reported experiments.
Methodology: matched reuse thresholds across all rungs (τ=0.90 easy / 0.80 hard regime) · workloads randomized per seed, 8–30 seeds per level with convergence early-stop · failure taxonomy defined by correctness lineage (outdated = served a once-correct answer; re-poisoned = served a never-correct one) and pinned by unit test · every query's decision, correctness, and token counts logged per event, so any metric re-derives offline.
Failure taxonomy: outdated vs re-poisoned is a first-class metric (outdated_rate / repoisoned_rate), defined by correctness lineage and pinned by unit test.

Related work, honestly: template/thought reuse exists (Buffer of Thoughts, Metacognitive Reuse, Analogical Prompting). YORO's contribution is making reuse safe and accounted — invalidation, the staleness taxonomy, and input/output token accounting that a cheap-output claim can't hide behind.