Technical study · 1B audited tokens · June 2026
99% compressed, 1% on the bill: I audited 1B tokens to find out why.
Josué Ramos
·
NUXS
·
June 2026
·
opencore
Late last year I went through this firsthand: massive token burn on my team. And a paradox you might recognize. The cheaper tokens got, the higher the bill came in. Better, faster models invite heavier use, and consumption grows faster than prices fall. Budgets kept rising instead of leveling off.
I tried everything to cut costs. Eventually I looked into data-compression companies and GitHub repos, and here I had a home-field advantage: I come from building bots for prediction markets, and squeezing data is what I've done for years. To my surprise, it was disappointing. What I found compressed text as text, in the academic state-of-the-art style (LLMLingua and its successors): it drops the tokens with the lowest statistical weight. Works for prose. Fails on exactly what an agent eats all day, like logs, SQL schemas, diffs, stack traces, test output, and API responses. Generic compression fails the same way.
To be clear about where the cost comes from: an LLM generates spend on four sides of the bill, which are input, cache write, cache read, and output. Data compression never touches the output, one of the biggest bottlenecks. And in every compressor I looked at, I ran into the same thing. They sold a high compression ratio as if it were savings, with no clear study behind it. It isn't the same thing, and that gap was exactly what I wanted to measure.
Tired of all this, I started working on an architecture of my own. The idea was that the real savings trigger wasn't compressing more, but mapping and tracking correctly what passed through, with savings ceilings that depend on how each person uses it. My compressions reach 99.9% compression margin. But how much of that actually comes off my bill? That was the question that pulled the rest along.
The intuition behind it is simple. Good mapping gives the AI only what it needs to keep working, meaning the fewest tokens in the end. And it isn't only money. It gets faster and more accurate, because the model's attention is finite and a clean context goes further.
But compressing input wasn't enough. I had to look at the output too.
§ 01How this became engineering
Structured data isn't prose. A log is a pattern that repeats with variations; a schema is a shape, not a sentence. So instead of a generic compressor, I wrote 20 specialized parsers (the capsules and the multimodals), each with its own opinion about what's signal and what's noise in its format.
But the engineering that matters isn't in the capsules on their own. It's in the three-layer engine on top of them. Each layer reaches savings the previous one can't, and they work like a filter, in this order:
Layer 1, Capsule (runs first). 17 specialist capsules plus 3 multimodals. When the data is the capsule's type, it compresses better than any generic compressor while preserving structure: 87 to 95% margin. What it doesn't recognize, it lets through on purpose. In specialist use (hundreds of PDFs, a shopping agent on the multimodal), the capsule alone already cuts cost heavily within its parameters and hands the rest off to what comes next.
Layer 2, Squeeze (runs next, on what's left). It intercepts the traffic that had no dedicated capsule and keeps the reference recoverable, so the agent restores the data when it needs to. What the capsule already compressed doesn't pass through Squeeze again. The result: coverage rises from ~46% (from the capsules) to as much as 84% (the ceiling reached in the study), and effective margin reaches 80.8%, which is the number actually saved.
Layer 3, Economy (reaches the output). Capsule and Squeeze act on input only. The output is the side of the bill no compression touches. Economy generates substantial savings there through smart routing between models, with the exact amount varying by task and usage profile.
Architecture — the three-layer engine
Agent · Hook
Layer 1 · Capsule — runs first
Algorithmic
11
log · api · network · schema · codebase · diff · test · build · apispec · prompt · image
Deterministic. Zero marginal cost. Runs locally.
LLM-based
6
rag · sql · stack · threads · events · pdf
Query-aware abstractive digest (RECOMP) on the user's own keys.
Multimodal
3
image-LLM · meeting · video
Measured in a separate phase with real provider calls.
Dense index
always delivered to the agent
Raw body
one retrieve away, byte-exact
what slips past the capsules
Layer 2 · Squeeze — runs on what's left
Squeeze
Intercepts traffic with no dedicated capsule. Maps the conversation between turns, expels cold context, leaves a recoverable reference. What a capsule already compressed doesn't re-enter Squeeze.
coverage up to 84%
80.8% effective · 99.1%
input handled · output remains
Layer 3 · Economy — reaches the output
Economy
Capsule and Squeeze act on input only. Economy routes generation to a cheaper model on the user's own keys, and closes the only side that was left open: the output.
substantial output savings
via smart routing
If compression doesn't meet the floor, or the data is too small, it's passthrough: the input is returned intact. By construction, the worst case is paying exactly what you would have paid without the system.
The capsules split into three classes: algorithmic (11) — log · api · network · schema · codebase · diff · test · build · apispec · prompt · image. LLM-based (6) — rag · sql · stack · threads · events · pdf. Multimodal (3) — image-LLM · meeting · video.
In all of them, the agent gets a dense index and the raw body stays one query away, byte-exact. If compression doesn't meet the floor, or the data is too small, it's passthrough: input returned intact. And that's where Squeeze comes in.
The continuous architect
Everyone uses their agents differently (Claude Code, Codex, Cursor, and so on), so I designed the system with an architect running 24/7 over real telemetry. Every hook decision is logged (intercepted, passed raw, reason); every capsule that fails logs locally and reports via beacon, with counters, not content. Each user's coverage is measured by profile, separating "saved" from "passed raw by design."
That data feeds the system's evolution: size floors per data type, deduplication rules per session-epoch, and even the capsules' internal notation. Every symbol is audited against the tokenizer. If tokenization changes and a symbol starts costing more than a word, the notation readjusts. Nothing goes to production without passing through the real-task arena first. Security is a priority in the design, with execution in an isolated environment.
What makes the system work isn't the loose capsules. It's the three-layer engine plus the circuit that audits everything and keeps adapting.
§ 02The benchmarks, open for audit
Every work profile operates differently, so variation shows up in use. That's why, instead of a marketing number, I built a measurement protocol and published all of it: raw per-sample records (sha256), harness, and execution logs.
The study passed 1 billion audited tokens: 1,026,804,861 tokens across two independent tracks. The capsules processed 626.8M over five rounds and closed at 91.62% weighted margin (saved ÷ intercepted). Squeeze processed 400M in one main round and delivered 80.8% effective margin (saved ÷ total, the real saving on the bill) at 99.1% compression and 81.5% coverage, with quality 8.57/10 and zero unrecoverable across 138 judgments (46 items × 3 judges; rubric and judge models documented in the repo). Two different denominators, reported side by side and never merged.
| Run |
Tokens processed |
Margin |
| Official capsule benchmark (v0.1.57) | 180,322,482 | 87.45% |
| F4 battery, usage-weighted | 127,586,488 | 88.44% |
| Wild via production hook | 20,227,044 | 91.97% |
| 100M · v0.5.32 | 96,626,712 | 95.56% |
| 200M · v0.5.33 | 202,021,713 | 95.42% |
| Squeeze 400M (v0.5.73) | 400,020,422 | 80.8% effective |
| 1B audited (cumulative) | 1,026,804,861 | — |
The run through the real production hook (the same binary that runs on the user's machine) processed 20.2M tokens at 91.97% margin, and the per-capsule passthrough rate is published. Rejected experiments were published too.
The core evolution was coverage. The capsule alone touched ~46% of code traffic (effective savings of ~40%). With Squeeze catching what gets past the specialists, coverage rose to 81.5% and effective savings to 80.8%, proven on a real bill in the benchmark.
§ 03The 8 points the benchmarks reveal
POINT 01
Margin is a vanity metric.
The 95% above measures what the system intercepts. I instrumented my own usage across 100 Claude Code sessions, thousands of file reads, and measured coverage. On my code-heavy profile, the capsule alone becomes only ~46% of the traffic; the rest passes raw on purpose, like first reads of files about to be edited, tiny files, content that has to stay byte-exact. With Squeeze catching that rest, coverage rises to 81.5%, and effective savings (saved ÷ total) reach 80.8% of the input bill. Data-heavy profiles, like RAG, logs, and pipelines, come out structurally higher.
When someone sells you the margin as if it were the bill, be skeptical.
POINT 02
Compression compounds.
Parsing the records from 30,000 turns across these sessions (Claude with prompt caching on), cache read was one of the biggest costs on the input side: the entire context reread every turn. So a token compressed out of the context doesn't save once. It saves on every remaining turn of the session, and it also delays the compaction that degrades long sessions. The layer I built to cut input was aimed at one of the heaviest terms on the bill, and that's why it compounds over the session instead of paying off once and stopping.
POINT 03
The savings number doesn't come from compressing, it comes from mapping.
The capsules that save the most aren't always the ones that compress the most. Volume dominates the bill. Sometimes the fact that a capsule doesn't recognize a piece of data generates more savings, because it falls into Squeeze, which intercepts and compresses it. The capsules are specialists and compress better, but even with 17 capsules and 3 multimodals I couldn't cover the whole. Mapping the flow became what actually generates savings. Compression happens in stages: first the specialists intercept what they can (the capsules); then what passes falls into Squeeze, which takes the right amount and compresses without compromising quality. That way, up to ~84% of what passes through input is covered depending on the profile (the study ceiling), with 81.5% measured on my code profile. The output is left, and only Economy reaches it, without losing quality and with a well-targeted model.
POINT 04
A green number lies; only the task tells the truth.
This was the most uncomfortable point, and it came from two real failures. The first: my image capsule reported 99.4% compression, with all the guards passing, and it was blinding the agent. Instead of the image, it received a metadata pointer (image png 1024×1024, 8bit) and improvised without ever seeing the screenshot, mockup, or diagram. This ran silently for 29 releases, because compression guards measure compression, and none measured whether the agent could still do the task.
The second: the PDF capsule died in production after an API change in a dependency, and the failure was silent. The data fell through to raw, with no visible error, and a customer would have had zero compression on PDF. What exposed the problem was the protocol's own skip log.
I fixed both, the benchmark got retroactive caveats, and a release rule was born: every capsule that replaces whole content has to pass through a real-task arena (a real agent doing real work on top of it) before it ships. The image capsule is usable today and saves on purpose. On the first call it lets the image through so the agent sees the content; on later reappearances of the same image in the session, it swaps in the description already recorded. That's where the saving happens, without the agent rereading the whole image every turn.
You can show 99% compression while breaking the user's task, and no margin metric will warn you. If you're going to evaluate any product in this market, demand to see how it measures task quality, not just compression ratio.
POINT 05
The right accounting gives you room to map better.
The compressible data is input, cache read, and cache write, each with a different dollar value and a different volume. Just summing saved tokens doesn't give you the real margin, because you can't tell what kind of data got compressed. With the right system, you can estimate with precision what type of data it was, assign the correct token value, and also map what passed without compression so you can plan a strategy to intercept it later. I split the accounting by input, cache read, and cache write, with a stamp for intercepted tokens and a map for the ones that passed raw. I cross that with the price of the model in use and the type of usage, and arrive at coverage margin and effective margin, which is what actually comes off the bill. That strict accounting holds up the rule across the whole study: input savings count only on input, output savings only on output, never summed. The one exception is Economy, which affects both ends and is therefore reported on both, separately.
POINT 06
The user's profile and organization are the clearest key to savings.
A 1M context window the task didn't need. A history of completed tasks piling up and reloading every turn. These are simple things to clean up that change the whole economy. Accumulated junk distorts your numbers.
The right organization is by sessions that complete tasks: finish, open a new window, and keep going on the same epic. Familiarity happens anyway, and the savings are much higher. Accumulated waste is staggering, and that's why session sanitization is always necessary. In my case, it was 1.36 billion unnecessary tokens.
Without organization, no effective margin goes far. There's no point using a token-savings tool without being organized: the session stretches and burns tokens for nothing. NUXS handles that side with a radar that guides the customer through their session; full automation of this layer is on the roadmap.
And there's a detail no organization solves. However much you teach the model and give it context, it will always check and recheck the paths it needs to take. That only changes once agent memory evolves, and until then it stays expensive if the size of the data it consumes isn't optimized.
POINT 07
Not every execution needs a frontier model.
The output is one of the biggest bottlenecks on the bill, and it's what compression can't reach, because you can't compress output. The only way out is to use a cheaper model. And here's the catch: most tasks don't require the top model. A lot is trivial, and a cheaper model handles it solidly when well guided. The architecture can't be left to the cheap ones, because they make too many mistakes and pile up tokens in rework. But with good planning and direction, a low-tier model executes a specific task with surprising precision. In my usage, routing sends most trivial executions to the low tier and reserves the high tier for architecture and correction; the net output savings vary by task profile.
Routing with a hierarchy protocol is what opens up bigger savings. In test rounds with Claude models, GPT, and others, the play between high and low tier worked: the protocols were obeyed and savings varied by task, with an extra push from the capsules and Squeeze when active together.
That's how Economy mode was born, in two variants, with a smart protocol and a simple on/off switch. One that worked well: delegate a coding task to a low tier, and when it got an execution wrong, reassign it to the high tier on the spot. It avoids rework, and the cost relative to the error margin ends up positive for the user.
POINT 08
The 1% in the title: I compressed up to 99.9% and saved 1% on the bill.
This is the point that gives the study its name, and I need to explain where it came from. I did the compression the best way I could. Some capsules hit 99.9% margin, technically everything was lined up. When I went to measure effective margin on the bill, it came out at 1%. It wasn't a measurement error and it wasn't accumulated junk. It was structural: the capsules were compressing perfectly what they knew how to recognize, but what they knew how to recognize was a small slice of what passed through the bill. The rest of the bill was a pile of things I couldn't cover, that I couldn't map properly.
From there it got hard. I started with 8 capsules and they didn't come close to what I imagined. I went up to 17 and still didn't have decent coverage. I had to widen what each capsule could intercept, and in some cases even lower their compression margin to reach more data types. Even then it wasn't good enough for my use. That's when I felt forced to map and intercept every piece of data passing through the capsules. Only then did the reach grow. That path led to Squeeze.
What I want to make clear: I kept the capsules for the specialty of each one, because being such specialists, they compress perfectly when the data is their type. What wasn't solved was coverage of the whole. And it was coverage that separated the 99.9% technical from the 1% real on the bill. That's the distance the whole study measures.
To close with point 06: there are two different causes dragging effective margin down. One is structural, which is insufficient coverage, and it's solved with mapping and Squeeze. That's what this point digs into. The other is behavioral, which is accumulated session junk, and it's solved with organization. That's what point 06 covers. Both exist. The 1% in the title is the structural one.
The "I compress X%" story the market sells isn't quite like that. The data is denser and more varied than it looks. How much you compress matters less than how much of the traffic you manage to intercept.
§ 04How the system behaves in real use
The team that uses the system every day is multidisciplinary (engineering, design, product, growth). Each profile stresses the system differently, so it was a good real-use lab.
For anyone on the API, paying token by token, the savings dropped straight onto the bill. For anyone on a subscription, useful time went up: more turns per session before the window tightens, and the session takes longer to degrade.
The team's organization, on its own, already cut consumption. The protocols were followed. Economy added on top of that. Basically, it's four things. Organization to avoid waste, capsules as specialized compression, Squeeze as input coverage, and Economy routing the output. The work only closes with them together. None of them alone delivers the result.
The Jevons Paradox applied to tokens
Here's where the Jevons Paradox comes in (William Stanley Jevons, English economist, 1865). His observation: when steam engines became more efficient with coal, England's total coal consumption rose instead of falling. The efficiency lowered the price of steam, the lower price enabled more uses, and the new uses burned more coal than the efficiency was saving.
I saw exactly this with me and the team. The more we saved per call, the more fit inside the same budget. The spending ceiling didn't change, but the work that fit under it multiplied. It's plain math: cheaper execution, more execution for the same check. Where we used to run 100 meters on the budget, we started running 500. That's the economic unit NUXS delivers: not a smaller bill, but far more output for the same check.
That's why NUXS doesn't play in the "AI cost-cutting" market as a reduction story. The proposal is to multiply capacity inside the budget the customer already decided to spend.
The system ran on Claude Code, Codex, Gemini, Cursor, and other agents. And the type of use matters. Because the capsules are specialists, assembling varied systems got simpler. One example: an AI that shops for the customer; with the image multimodal, the operating cost fell sharply. A recorded meeting with an optimized summary: the Meeting capsule delivers it without high cost. Research and summary of a long video: the video one already does it professionally. New branches are emerging, but the focus is directed. Much of what we built can be used to cut cost across several projects, and that stays available to anyone in open-core form.
§ 05Summary of what I found
The question that opened the study was how much of this actually comes off my bill. The answer: around 80%+ of everything that passes through input. In practice, each call costs a fraction of what it would without the system, and that multiplies the work that fits in the same budget. That's the real gain. What backs the number is below.
The total audited is 1,026,804,861 tokens, across two tracks. The capsules processed 626.8M and closed at 91.62% margin (saved ÷ intercepted) on the fixtures benchmark. Squeeze processed 400M and delivered 80.8% effective margin (saved ÷ total), the real saving on the bill, with 81.5% coverage on the code profile. Different bases, reported separately, never mixed.
Engine margin (capsules)
91.62%
Auditable, raws kept. Saved ÷ intercepted.
Effective margin (Squeeze)
80.8%
Real saving on the bill, with cache compound effect. Saved ÷ total.
Coverage (code profile)
81.5%
Text / RAG clients run structurally higher. Intercepted ÷ total.
Effective margin comes from the ratio between total tokens across calls and tokens saved, and the user's profile dictates the result. Fencing in and squeezing input and output (when possible) changes the economic unit of using AI day to day. Today anyone using NUXS accelerates product delivery on the same AI budget. I left the studies documented and transparent so anyone can keep testing and pick up where I left off. And I keep evolving the savings methods as the market moves.
§ 06Simple organization protocols
| 01 |
1M context only on dense tasks.For simpler tasks, use smaller contexts. |
| 02 |
Sanitize completed tasks.Accumulating junk distorts your numbers. |
| 03 |
Central planning session.Split into epics and tasks per epic, and open new sessions for each task. |
| 04 |
Use the capsules.The most direct route to optimizing token spend. |
| 05 |
Intelligent router on simple tasks.Reserve high-intelligence models for what actually demands them. |
| 06 |
Don't delegate delicate tasks to weaker models.Avoids heavy rework. Cheap can come out expensive. |
The protocol optimizes the user's savings, and it depends on human intervention. The rest of the technology operates between its lines.
§ 07Test and measure it yourself
The invitation is this: you don't have to believe it, you can measure it. The system adapts to each person's style, and the dashboard separates "saved" from "passed raw by design," so you audit your own coverage instead of trusting mine.
I left it free to use. You can use it directly or test the compression in the playground.
It's at nuxs.ai, with a free tier that includes the 11 algorithmic capsules, running locally or via MCP (the installer handles it). The raw artifacts for each benchmark are at nuxs.ai/benchmark, with the path to the GitHub repository.
There's a public audit running on the site right now, tallying millions of tokens of real usage from people on the product. The next frontier is widening coverage by usage profile, with a date to be set with whoever wants to help on the next studies.
The benchmark artifacts are auditable: sha256 per sample, harness, and raw logs in the repo, run it yourself. The 5 algorithmic capsules are open under AGPL. The engine (Squeeze, Economy) is closed, so you audit the results, not the code. The free tier lets you measure your own coverage and check whether the numbers hold up in your use.
§ 08Total audited and per-capsule distribution
1,026,804,861 audited · capsules + Squeeze
The study's headline, 1,026,804,861 audited tokens, is the sum of two independent tracks: 626,784,439 from the five cumulative capsule rounds (June 2 to 11) and 400,020,422 from the Squeeze run (June 20). Below is the per-capsule distribution from the most recent round (200M, v0.5.33, TEXT profile), followed by the cumulative totals of each track and the total audited. The per-capsule margins hold up on the CODE profile, which inverts the traffic weights (codebase, diff, and stack come to dominate).
| Capsule / Track |
Class |
Tokens processed |
Tokens saved |
Margin / Effective |
| rag | llm | 20,010,072 | 18,089,105 | 90.4% |
| log | algorithmic | 16,063,645 | 15,870,881 | 98.8% |
| pdf | llm | 12,023,183 | 11,542,256 | 96.0% |
| threads | llm | 10,001,241 | 9,161,137 | 91.6% |
| events | llm | 9,070,336 | 9,024,984 | 99.5% |
| prompt | algorithmic | 8,012,934 | 7,996,908 | 99.8% |
| api | algorithmic | 7,195,617 | 7,152,443 | 99.4% |
| sql | llm | 5,010,920 | 4,970,833 | 99.2% |
| network | algorithmic | 3,563,305 | 3,538,362 | 99.3% |
| stack | llm | 2,004,372 | 1,813,957 | 90.5% |
| schema | algorithmic | 2,000,200 | 1,540,154 | 77.0% |
| diff | algorithmic | 1,537,461 | 1,452,901 | 94.5% |
| codebase | algorithmic | 1,513,733 | 1,477,403 | 97.6% |
| build | algorithmic | 1,013,132 | 972,607 | 96.0% |
| test | algorithmic | 1,013,040 | 985,688 | 97.3% |
| apispec | algorithmic | 708,552 | 614,315 | 86.7% |
| image† | algorithmic | 304,934 | 303,104 | 99.4%† |
| Capsules — 5 cumulative runs |
|
626,784,439 |
574,252,194 |
91.62% margin |
| Squeeze — 400M run |
|
400,020,422 |
323,317,527 |
80.8% effective |
| TOTAL AUDITED |
|
1,026,804,861 |
897,569,721 |
— |
The first 17 rows are the per-capsule distribution of the most recent capsule round (200M run · v0.5.33, TEXT profile, ~101M tokens). The saved column per capsule is derived from processed × margin (the published margins are rounded to one decimal, so the row sum reconciles to within 0.07% of the published TEXT-profile total). The three rows below are the cumulative totals of each track and the audited grand total. There is no single "total margin." The two tracks use different denominators: capsules report margin (saved ÷ intercepted = 91.62%); Squeeze reports effective margin (saved ÷ total = 80.8%). Mixing the two into a single percentage would mean comparing incompatible bases. That's why they go side by side, never summed. The api capsule was verified at 99.4% over 200 varied objects (it keeps the schema plus a sample); margins in the 99 to 100% range are characteristic of a deterministic capsule, not rounding. † The image line (99.4%) is a bytes→tokens estimate with a retroactive caveat. It measured an interception that, on first reads, blinded the agent to the visual content (bug v0.5.7 to v0.5.35, fixed in v0.5.36; see point 04). The 3 multimodals (image-LLM, meeting, video) are measured in a separate 1M-token phase and don't make up this text/code total.
Sources: docs-publicos/benchmarks/. Raw artifacts, harness, and logs at nuxs.ai/benchmark.
Try it · Audit it · Fork it
Don't believe — measure.
Free tier. 11 algorithmic capsules running locally or via MCP. No account required to test the playground. Audit your own coverage and decide for yourself.