This document consolidates every audited benchmark run of NUXS Capsule conducted between June 2 and June 11, 2026. Every number is recomputable from the harness, execution log, and per-sample raw records published with each run.
Executive summary
NUXS Capsule is a context-compression layer for AI agents (Claude Code, Cursor, Cline, any MCP agent). This study consolidates every audited run between June 2 and June 11, 2026.
Three commitments distinguish this study from the prevailing practice in context-compression measurement:
- Full auditability. Every run publishes its harness, execution log, and per-sample raw records (one line per sample, with the sha256 of the input). Every number is recomputable from the artifacts.
- Quality as a gate, not a footnote. Compression margin only becomes product after passing the Arena — real agents performing real tasks on the capsules, with judges scoring quality. As of v0.5.36 this gate is a formal release rule.
- Two honest numbers where the industry usually publishes one. We distinguish margin (compression over what is intercepted) from effective savings (margin × coverage of real traffic) — a distinction most published measurements do not make.
Scope and thesis
The technical thesis is that context compression is mapping, not encoding: each data type has its own structure, and a specialized parser that understands that structure preserves the load-bearing signal and discards the noise — in contrast to generic perplexity-based compressors, which prune token by token without a model of the data.
The system measured comprises 17 text/code capsules (11 algorithmic — deterministic, zero marginal cost, executed locally; 6 LLM-based — measured via real provider calls) and 3 multimodal capabilities, coordinated by a content-sniffing router with confidence scoring and LLM pre-classification in the gray zone.
Measurement protocol
The protocol eliminates the three most common weaknesses in compression benchmarks: fixtures that favor the compressor, numbers that cannot be recomputed, and margin reported without quality verification.
3.1 Wild fixtures with digit-level mutation
Inputs are "wild giants": large real-world data (production logs, SQL schemas, real git diffs, technical PDFs, codebases) under digit-level mutation that preserves structure and vocabulary while preventing cache or memorization from inflating the result. Where synthetic input penalizes the parser (artificially high entropy), this is declared as a conservative floor — the codebase capsule measured 77.6% on synthetic fixtures and 95.2% on 40 real files; both are published.
3.2 Two margin bases
Each capsule reports margin over accumulated volume and margin on a single pass (the more conservative number). In the current run the two bases converge within 0.2 percentage points — the convergence is itself a consistency result.
3.3 Real LLM, real cost
The 6 LLM capsules are measured via real provider calls (deepseek-v3 in the current run), never simulated. Where processing full volume would be unnecessary provider spend, the measurement is sampled and extrapolated — and declared as such ("server-sampled→extrapolated"), reported separately from genuinely processed volume.
3.4 Tokenizer declared as proxy
All counting uses cl100k_base. Percentage margin is stable across tokenizers; the absolute dollar value is an approximation (±15%) because providers use their own tokenizers. This limitation is declared, not hidden.
3.5 Raw record per sample
Every sample produces a .jsonl line with the input sha256, input tokens, output tokens, and execution metadata. The full harness ships with every run. Any third party can recompute the aggregates.
3.6 Passthrough counted, not hidden
When output does not compress enough (per-capsule assertCompressed guards) or would be a stub with no semantics (assertNotEmpty), the capsule is rejected and the data passes through raw — and the passthrough rate is reported per capsule. In the wild run through the production hook, per-capsule passthrough rates of 30% to 61% are published. Passthrough is a product guarantee: the worst case is costing exactly zero extra.
Evolution across audited runs
The 626.8M are cumulative across runs, not a single corpus. Separate runs reuse fixture families under independent mutations.
The series (87.45 → 88.44 → 91.97 → 95.56 → 95.42%) reflects both genuine product gains and fixture/method differences across runs — we declare this: it is not a pure like-for-like delta. What it does prove is consistency: the 100M (v0.5.32) and 200M (v0.5.33) runs, with identical methods, land at the same margin (~95.5%), confirming no regression between consecutive versions.
Current-run results (200M · v0.5.33)
5.1 Aggregates by profile
Two profiles use distinct per-capsule traffic weights, modeling a knowledge/RAG agent versus a coding agent. Zero errors across 9,333 samples.
5.2 Per capsule (TEXT profile)
The CODE profile inverts the weights (codebase, diff, stack dominant) and holds the same per-capsule margins — the full table ships with the artifacts. (*) The image line carries a retroactive methodological caveat — see §9.
5.3 Multimodal (separate 1M phase, real providers)
Production validation: the wild run
A laboratory benchmark, however wild its fixtures, is still a laboratory. An entire run (20.2M tokens) was executed through the real production hook (dist/intercept.js) — the same binary that runs on the user machine, with guards active, router active, and passthrough counted.
Result: 91.97% aggregate margin (text profile 80.6–89.4%; code profile 93.7–94.4%), with per-capsule passthrough published (e.g. log 61%, api 42%, schema 56%). The gap to the ~95% fixture benchmark is expected and informative: in production the router rejects what is not worth compressing — that conservative behavior is measured, not masked.
This run also exposed and fixed a real defect: the PDF capsule was failing silently in production due to an API change in pdf-parse 2.x (data fell through to raw, no visible error). Caught by the protocol skip log, fixed in v0.5.33, validated at 86.3–96% on real PDFs. We record this episode because it shows the protocol working as an engineering instrument, not just a marketing one.
Quality: the Arena and the guards
Compression margin, on its own, is a dangerous number — one can "compress" 99% while destroying the agent’s ability to perform the task. NUXS treats quality as a gate at three layers.
7.1 Deterministic runtime guards
assertCompressed rejects any capsule that misses its type’s compression floor; assertNotEmpty rejects stubs (large input → tiny output with no semantics — the pattern that masks a broken parser behind a fake-high number); every rejection is logged locally and reported via beacon to the admin panel.
7.2 The Arena
Real Claude agents perform real tasks on the capsules (navigating a codebase, answering questions about a schema, diagnosing a stack trace) and judges score quality 0–10. Arena results have already changed the product: the two-tier schema graph was validated with a margin gain of +19 to +27 percentage points while holding a quality score of 8.5/10; the codebase and prompt capsules were corrected after Arena failures. Margin that does not pass the Arena is not promoted to product.
7.3 Arena as a release gate
As of v0.5.36, every capsule that replaces whole content with a compact representation must pass a mandatory task smoke test before publishing — the rule formalized in RELEASE-GATES. The motivation is documented in §9 (the image episode).
7.4 Two-tier strategy
For bulky data, the capsule delivers a dense index (tier-1) and keeps the raw body accessible via on-demand retrieve (an MCP tool). The agent always receives the map and pulls exact content when the task requires full fidelity (e.g. code editing). This is what enables high margins without breaking operations that depend on exact bytes.
Margin is not savings: the two metrics we report
Most published measurements report a single number — the compression ratio over what was compressed. That number, in isolation, overstates real savings, because no honest system compresses 100% of traffic: part of the data must pass through raw by design (content the agent needs intact, small files where overhead exceeds the gain).
- Margin — compression over intercepted traffic. This is what the tables above measure: 95.42% in the current run.
- Coverage — the fraction of total traffic that is intercepted. Measured by production telemetry (embedded as of v0.5.36), per client and per usage profile, reported in the product dashboard — including the distinction between "traffic uncompressed by design" and "compression opportunity."
A client’s effective savings is the product of the two, amplified by the mechanics of modern agents: because the session context is retransmitted on every turn, every token compressed out of the context saves on every subsequent turn of the session — context compression compounds over the session rather than paying out once. We publish margin because it is the property of the engine; we publish per-client coverage because it is the truth of the bill. As of this date we are not aware of another public measurement in this space that reports the two separately.
Declared limitations and corrections
A benchmark’s credibility is measured by what it declares against its own interest. For the record:
- prompt (99.8%): the giant fixture (near-identical prompts) corresponds exactly to the capsule’s design purpose (deduplicating agent telemetry). On a single unique system prompt the mode is different — structural compression at ~75.7% with every guard clause preserved. Both numbers are published.
- api (99–100%): legitimate — the capsule keeps the schema plus one sample; verified at 99.4% over 200 varied objects.
- image (99.4%) — retroactive caveat (Jun 11): the number is a bytes→tokens estimate, and it measured an interception that, on first read, replaced the image with a metadata pointer — blinding the agent to the visual content (a quality regression present from v0.5.7 through v0.5.35, fixed in v0.5.36: first read now passes to model vision; only unchanged re-reads within the same context epoch serve the pointer). Going forward, image savings are measured as vision tokens saved on re-reads, never bytes/4. The slice is small (0.3–0.8% of the corpus) and does not affect the aggregate — but the line stays annotated for full auditability. This episode is the origin of the Arena-gate rule in §7.
- Synthetic fixtures as a floor: structural capsules (codebase/diff/schema) plateau on high-entropy synthetic input; on real data they rise 10–20 p.p. Both regimes are published.
- Improvements rejected on fidelity grounds: experimental variants with margin gains of up to +42% were rejected when they implied signal loss (e.g. a diff-hunk pool that discarded line offsets; truncation of failing-test lists). Margin never buys fidelity.
Methodological comparison with industry practice
We do not compare margins directly against third-party numbers, because different methods produce incommensurable numbers — that is precisely the point. We compare protocols.
On the technical plane: perplexity-based compressors such as LLMLingua and LongLLMLingua (Jiang et al., 2023) prune tokens by estimated importance; RECOMP (Xu et al., 2023) demonstrates query-aware abstractive summarization for RAG — an approach the NUXS rag capsule adopts and extends. The template extraction in the log capsule follows the lineage of Drain (He et al., 2017). The architectural difference from generic compressors is per-type specialization with a native output format per capsule, combined with a retrieve layer that preserves access to the intact data — compression as a navigable map, not as pruning.
Compression also composes with provider prompt caching: capsules are byte-deterministic (a tested invariant — repeated runs produce identical bytes), making them eligible for provider prefix caching — the two savings multiply rather than compete.
Reproducibility
Every run publishes, in the benchmarks folder, REPORT.md, the per-capsule summary (*-summary.json), per-sample raw (*-raw.jsonl, one line per sample with the input sha256), the execution log (*-run.log), and the harness (benchmark-*.mjs). Every aggregate in this document is recomputable from those artifacts. Methodological challenges are welcome: the protocol exists to be audited.
- Jiang, H. et al. (2023). LLMLingua / LongLLMLingua — prompt compression. EMNLP 2023.
- Xu, F. et al. (2023). RECOMP — retrieval-augmented compression.
- He, P. et al. (2017). Drain — online log parsing. IEEE ICWS 2017.
- Anthropic. Prompt Caching — official API documentation.
Cumulative total — per-capsule distribution
The study headline — 626,784,439 tokens processed · 574,252,194 saved · 91.62% weighted margin — is the sum of five separate audited runs. The table below distributes that cumulative total by capsule: each row sums the capsule across all five runs (official 180.3M + F4 127.6M + wild-hook 20.2M + 100M + 200M), TEXT and CODE profiles combined. Where §5.2 shows a single run, this is the per-capsule view of the entire study.
Reconciliation: the rows total 626,779,041 processed / 574,248,120 saved — within rounding of the audited 626,784,439 / 574,252,194 (91.62%). “session” is a retired capsule (no longer in the 17-capsule product), kept here only so the rows reconcile to the audited total; the three multimodal capabilities (image-LLM, meeting, video) are a separate 1M phase, outside this text/code total.
Benchmark updated each release. Raw files published for independent reaudit.