Technical study · 5 audited runs · June 2026
626 million audited tokens: AI savings aren't where the market is measuring
Josué Ramos
·
NUXS
·
June 2026
·
opencore
As the current benchmark study suggests, the real savings in AI usage aren't found in the compression margin. This idea may already exist in pieces, but not with the scope this study brings to it.
Well, one thing has become clear in this new AI era: a new economic model has formed, and it points in a clear direction. Tokens are here to stay, and they're already part of daily life. Just as we are learning to measure a new pace of work delivery under AI acceleration, we are learning to measure budgets by token burn.
Late last year I lived this firsthand: massive token burn across my team. And a paradox you might recognize: the cheaper the models got per token, the higher the total bill climbed. Better, faster models invite more use, and consumption grows faster than prices fall. Budgets kept rising instead of stabilizing.
I tried everything to cut costs. Last of all, I went after data compression, and here I had home-field advantage: I come from building bots for predictive markets, and squeezing data has been my job for years. To my surprise, the experience was bad. What I found compressed text as text, in the style of the academic state of the art (LLMLingua and successors): drop the tokens with the lowest statistical weight. That works for prose. It fails exactly on what an agent eats all day: logs, SQL schemas, diffs, stack traces, test output, API responses. The same goes for generic compression. And nothing I saw measured coverage, which is the real driver of savings. For a team writing code 24 hours a day, it was useless.
I started working alone on an architecture project. And I discovered that compression wasn't the trigger of real savings. The trigger came from properly mapping and tracking the data, refined logic work, with savings ceilings that depend on usage style. In other words: real savings live in data coverage, not in a pretty compression ratio. My compressions reach 99.9% savings margin. But how much of that actually comes off my bill?
The intuition is simple. Good mapping hands the AI the data it needs to keep working with the minimum of tokens. But that AI is fed by hundreds of different sources, and the more you fence it in and feed it a disciplined diet, the more agile and economical it works. It doesn't just save money: it gets faster and more precise, because model attention is finite, and a clean context is a more qualified context.
But there's a solid mathematical logic here. The AI spends tokens per API call, and an isolated call is tokens burned at full price. Calls in a large group can generate savings, but only if some of the initial calls create familiarity with the rest, meaning compressing a piece of data early on that will be obsessively re-read throughout all the remaining calls. That's how you save tokens.
But the concept of savings doesn't come only from understanding how to avoid isolating calls and how to create families. We'll see this further on.
§ 01How this became engineering
Structured data is not prose. A log is a pattern that repeats with variations; a schema is a shape, not a sentence. So instead of one compressor, I wrote 20 specialized parsers (the capsules and multimodals), each carrying an opinion about what is signal and what is noise in its format.
Architecture — capsule flow and two tiers
Agent · Hook
Algorithmic
11
log · api · network · schema · codebase · diff · test · build · apispec · prompt · image
Deterministic. Zero marginal cost. Runs locally.
▲ 5 open core — AGPL-3.0
LLM-based
6
rag · sql · stack · threads · events · pdf
Query-aware abstractive digest (RECOMP) on the user's own keys.
Multimodal
3
image-LLM · meeting · video
Measured in a separate phase with real provider calls.
Dense index
always delivered to the agent
Raw body
one retrieve away, byte-exact
If compression misses the floor, or the data is too small: passthrough. The system returns the input intact. By construction, the worst case is paying exactly what you would have paid without the system.
Eleven are algorithmic, all deterministic: log, api, network, schema, codebase, diff, test, build, apispec, prompt, and image. The log capsule extracts templates; the API one keeps the schema plus a sample; the codebase one indexes imports and signatures. Six others use LLM (rag, sql, stack, threads, events, and pdf), with query-aware abstractive digests in the RECOMP style, running on the user's own keys. Finally, three multimodal: image-LLM, meeting, and video.
The continuous architect
Everyone uses their agents differently (Claude Code, Codex, Cursor, OpenClaw, Hermes, and so on), so I designed the system with an architect running 24/7 on top of real telemetry. It works like this: every hook decision is recorded (intercepted, passed raw, and for what reason); every capsule that fails logs locally and reports to the server via beacon; and each user's coverage is measured per usage profile, separating "saved" from "passed raw by design".
This data feeds the system's evolution: size floors per data type, session-epoch deduplication rules, and even the capsules' internal notation. Every symbol is audited against the tokenizer, so if tokenization changes and a symbol starts costing more than a word, the notation readjusts. And no adjustment ships to production without passing the real-task arena. No concrete data leaves the user's machine, and the entire system runs isolated in a container.
That may be the real engineering here: not the capsules, but the circuit that audits and adapts them continuously.
And in my own team's data, with each person on a different profile (some on code, some on reviews, some on text), the savings came out different, adapting to each profile. A capsule's mistake taught the system a new way to compress, to map. All done safely and in isolation.
§ 02The benchmarks, open for audit
Every work profile operates differently: what's economical for one may not be for another. So instead of a marketing number, I built a measurement protocol and published everything, with raw per-sample records (sha256), the harness, and execution logs.
| Run |
Tokens processed |
Margin |
| Official published benchmark | 180,322,482 | 87.45% |
| F4 battery, usage-weighted | 127,586,488 | 88.44% |
| Wild via production hook | 20,227,044 | 91.97% |
| 100M · v0.5.32 | 96,626,712 | 95.56% |
| 200M · v0.5.33 (current) | 202,021,713 | 95.42% |
626.8 million tokens processed and audited, accumulated across five independent runs, all with public raw artifacts. The cumulative is not a single corpus: runs reuse fixture families under independent mutations, and that's declared.
The current run, of 200M tokens, reached an aggregate margin of 95.42% with zero errors across 9,333 samples, wild fixtures under digit-level mutation (so no cache can inflate the result), and LLM capsules measured with real provider calls.
The run through the real production hook, meaning the same binary that runs on the user's machine, processed 20.2M tokens at 91.97% margin, and the per-capsule passthrough rate is published.
Rejected experiments were also published. Variants with up to +46% margin gains, meaning 46% of coverage. And those 46% represent, in effective savings, around 40 to 42%.
A note on method: several capsules also have a fine-tuned variant, and this benchmark deliberately runs without it — the numbers above show NUXS on its open, deterministic core alone. The upcoming community benchmark beyond one billion tokens (§ 07) will measure the system with fine-tuning on.
§ 03The 7 discoveries the benchmarks forced on me
DISCOVERY 01
Margin is a vanity metric; coverage is the bill.
The 95% above measures what the system intercepts. I instrumented my own real usage (90 Claude Code sessions, 1,621 file reads) and measured coverage. In my code-heavy profile, about 46% of traffic becomes a capsule; the rest passes raw on purpose: first reads of files about to be edited, tiny files, byte-exact content. Real savings is margin × coverage, and in my worst profile that comes out to about 40% of the input bill. Data-heavy profiles, like RAG, logs, and pipelines, sit structurally higher. I haven't found another product that publishes this distinction.
When someone sells you the margin as if it were the bill, be suspicious.
DISCOVERY 02
Compression compounds.
Parsing the usage records of 30,000 turns across those sessions (Claude with prompt caching enabled), 78 to 92% of the cost was cache reads, meaning the entire context re-read on every turn. Output was 8 to 22%. So a token compressed out of the context doesn't save once: it saves on every remaining turn of the session, and it also delays the compaction that degrades long sessions. The layer I built to cut input was aimed, without my knowing it, at the dominant term of the bill.
DISCOVERY 03
The savings number doesn't come from compressing, but from mapping.
The capsules that deliver the most savings aren't always the ones that compress the hardest. You can see 99.9% of tokens saved on one type, but the one sitting at 45% ends up being the most used, the one that comes back the most and covers the most of the system. Around every corner the agent turns, there's a data chase it will consume. And it is, in fact, volume and repetition that generate the real savings.
DISCOVERY 04
Green numbers lie; only the task tells the truth.
This was the most uncomfortable discovery, and it came from two real failures. The first: my image capsule scored 99.4% compression, with every guard passing, and it was blinding the agent. Instead of the image, it received a metadata pointer (image png 1024×1024, 8bit) and improvised without seeing the screenshot, mockup, or diagram. This ran silently for 29 releases, because compression guards measure compression, and none measured whether the agent could still do the task.
The second: the PDF capsule died in production after an API change in a dependency, and the failure was silent. Data fell through to raw, with no visible error, and a customer would get zero PDF compression. It was the protocol's own skip log that exposed it.
Both failures were fixed, the benchmark gained retroactive caveats, and a release rule was born: every capsule that replaces whole content must pass a real-task arena (a real agent doing real work on top of the capsule) before shipping.
It's possible to display 99% compression while breaking the user's task, and no margin metric will ever tell you. If you evaluate any product in this market, demand to see how it measures task quality, not just compression ratio.
DISCOVERY 05
The first compressed context and the data chase are the key to savings.
Understanding the system of calls and families helped me see that we need to compress before a call even exists, and the earlier you compress, the better. Token delimitation is also crucial. The capsules' specialty in certain isolated situations didn't help fence in what actually mattered: every token that passed through my agent.
DISCOVERY 06
The user's profile and organization are the clearest key to savings.
1M context windows that aren't needed for certain tasks. History of completed tasks accumulated and loaded continuously. Simple housekeeping facts that change the economics drastically. My effective margin sat for a long time below the 1% mark, simply because of accumulated junk. Accumulating junk distorts your numbers.
Organization should happen by sessions that complete tasks: finished, open a new window and keep operating in the same epic. Familiarity happens anyway and generates much greater savings. In numbers, the accumulated waste is staggering, which is why session sanitization is always necessary. We're talking billions of unnecessary tokens. In my case, 1.36 billion.
Effective margin would never be satisfactory without organization. So it's no use consuming token-saving tools without being organized: your session will sprawl and burn useless tokens. NUXS couldn't automate this organization, but we built a notification mechanism, a radar that guides the client through their session.
DISCOVERY 07
Not everything needs a high-intelligence model for execution.
Most tasks don't require a high-end model. Many tasks are trivial, and cheaper models can handle them solidly, as long as they're well guided. The architecture can't be left to the cheapest models, since they make too many mistakes and accumulate tokens through rework. But, with good planning and proper guidance, these models reveal a surprising accuracy, with savings of up to 179x without losing execution quality.
Intelligent routing with hierarchy protocols is the key to even greater savings. In test runs with models Claude Code Opus 4.7, Claude Code Opus 4.8, and Claude Code Fable 5/Mythos, the interplay between higher and lower-cost models was a success: the protocols were obeyed, and savings reached up to 179x in tokens for isolated tasks. The number is driven by the model's token cost, with an extra boost from the capsules.
That's how the economy mode was created, which operates in two variants, with intelligent protocols and a simple on/off switch. One protocol that worked very well was delegating a coding task to a lower-tier model, and when it got an execution wrong, the task was immediately reassigned to the higher-tier model: it avoids rework, and the cost in relation to the error margin ends up positive for the user.
Some important caveats that weren't part of the study, since that wasn't the objective. Planning should always be done by the high-intelligence models, leaving execution planned out. And in our operating model we already ship the architecture ready for the user to use in the way we found best for quality and cost-benefit.
§ 04The results for my team
For those using the system via API, paying token by token, the savings were evident and clear on the bill. For those on subscription plans, usable time went up consistently, and the model's agility was just as evident.
The organization of my team, on its own, already produced significant savings. The protocols generated and followed were a success. Alternating with the economy mode amplified the leap. Basically, organization to avoid waste, capsules, and our intelligent saving mechanism, combined with alternation between the economy mode, together generate satisfactory effective savings.
Jevons' Paradox applied to tokens
However, you have to understand Jevons' Paradox (William Stanley Jevons, English economist, 1865). His original observation: when steam engines became more efficient with coal, England's total coal consumption went up instead of down, because efficiency made steam cheaper, the cheaper price enabled more uses, and the more uses consumed more coal than the efficiency was saving.
I saw this play out with me and my team. The more we saved, the more room we had to accelerate the project, more screen time, more qualified use of the models, and we ended up using more. The bill with its spending ceiling stayed the same, but we used the product much more and produced much more than before with the same budget. And here's where the paradox lands: we accepted expanding the budget and our AI cost precisely because it let us accelerate the project further and further.
At the end of this equation, it turns out to be profitable for the companies that supply the intelligence models to have other companies driving savings optimizations. Because users end up accepting more spending when they see clear results.
The system was used very well in Claude Code, Codex, Gemini, Cursor, OpenClaw, Hermes, and several other AIs.
§ 05Summary of what I found in the studies
The 626M processed and the 91.62% margin come from the benchmark (fixtures, audited run). The 46% coverage comes from a separate study (replay of 90 real sessions, code profile).
Engine margin
91.62%
Auditable, raws kept.
Coverage (code profile)
46%
Text / RAG clients are higher.
Real effective savings
40–42%
Of the input bill, with cache compound effect.
The effective savings margin is found in the mathematical relationship between total tokens in calls versus tokens saved. But the user's profile dictates this result. So one user's numbers don't translate to another. And this effectiveness number doesn't discredit the per-capsule savings margin; quite the opposite, those are the cleanest, most economical margins of all. But it shows that economic effectiveness comes from a set of factors. The way the user employs the models will significantly alter this result.
§ 06Simple organization protocols
| 01 |
1M context only on dense tasks.For simpler tasks, use smaller contexts. |
| 02 |
Sanitize completed tasks.Accumulating junk distorts your numbers. |
| 03 |
Central planning session.Split into epics and tasks per epic, and open new sessions for each task. |
| 04 |
Use the capsules.The most direct route to optimizing token spend. |
| 05 |
Intelligent router on simple tasks.Reserve high-intelligence models for what actually demands them. |
| 06 |
Don't delegate delicate tasks to weaker models.Avoids heavy rework. Cheap can come out expensive. |
This protocol optimizes user savings significantly. The dependency here is on human intervention. The rest of the technology operates between the lines of the protocol.
§ 07Test and measure it yourself
That's what I'm inviting you to do today with NUXS: you don't have to believe, you can measure. The system adapts to each user's style, and the dashboard separates "saved" from "passed raw by design", so you can audit your own coverage instead of trusting mine.
I believe this can help many people save much more on their bills. I left it free to use, and you can test it at will. You can use it directly or try the compression via the playground.
It's at nuxs.ai, with a free tier that includes the 11 algorithmic capsules, running locally or via MCP (the installer figures it out). The raw artifacts of every benchmark are at nuxs.ai/benchmark, with the path to the GitHub repository.
I've already decided to run a new audit and benchmark to surpass 1 billion tokens, open to the community and live. Date to be defined with whoever wants to help with the new studies.
And so, one piece of work at a time here. NUXS is now open core: the first five algorithmic capsules — log, apispec, prompt, network, and image — are open under AGPL-3.0 on GitHub, while the rest of the engine stays proprietary. The aggregate distribution table from all audited runs is below.
§ 08Aggregate per-capsule distribution
626.8M cumulative · 5 audited runs
| Capsule |
Class |
Tokens processed |
Tokens saved |
Margin |
| log | algorithmic | 93,336,248 | 92,478,692 | 99.1% |
| codebase | algorithmic | 68,100,699 | 57,771,407 | 84.8% |
| diff | algorithmic | 63,931,733 | 51,418,235 | 80.4% |
| api | algorithmic | 55,406,079 | 55,329,455 | 99.9% |
| prompt | algorithmic | 49,918,938 | 49,571,251 | 99.3% |
| rag | llm | 42,674,288 | 39,067,940 | 91.5% |
| build | algorithmic | 39,381,798 | 38,577,940 | 98.0% |
| test | algorithmic | 36,577,472 | 33,486,726 | 91.6% |
| schema | algorithmic | 28,994,463 | 16,818,449 | 58.0% |
| network | algorithmic | 28,443,603 | 28,257,309 | 99.3% |
| apispec | algorithmic | 22,007,114 | 18,966,567 | 86.2% |
| stack | llm | 21,944,586 | 20,138,460 | 91.8% |
| threads | llm | 20,308,235 | 18,773,742 | 92.4% |
| events | llm | 19,003,173 | 18,769,332 | 98.8% |
| sql | llm | 15,555,985 | 15,260,167 | 98.1% |
| pdf | llm | 15,276,099 | 13,734,220 | 89.9% |
| session† | retired | 3,901,389 | 3,825,574 | 98.1% |
| image | algorithmic | 2,017,139 | 2,002,654 | 99.3% |
| TOTAL | | 626,779,041 | 574,248,120 | 91.62% |
Sum of each capsule across the five runs, TEXT and CODE profiles where applicable. Tokenizer cl100k_base; margin is saved divided by processed on volume basis. †session is a retired capsule (no longer part of the 17-capsule product), kept in the table only to reconcile with the audited total. The three multimodal capabilities (image-LLM, meeting, video) are measured in a separate 1M-token phase, with real provider calls, and don't compose this text/code total.
Sources: docs-publicos/benchmarks/{2026-06-05-oficial-publicado, 2026-06-10-100m-v0532, 2026-06-10-200m-v0533, 2026-06-10-f4-200m, 2026-06-10-wild-hook}. Raw artifacts, harness, and logs at nuxs.ai/benchmark.
Try it · Audit it · Fork it
Don't believe — measure.
Free tier. 11 algorithmic capsules running locally or via MCP. No account required to test the playground. Audit your own coverage and decide for yourself.