Manifesto
Every new technology creates its own unit of value. Years ago we saw this with bitcoin and the economy that formed around it. Today we are seeing it happen again — and this time the unit of value is the token, the currency of the AI era, used as the unit of cost and measurement for the intelligence that powers agents. However, a large share of those tokens is spent on something nobody chose to buy: redundancy.
An AI agent, with every step it takes, re-sends the entire context to the model: the whole conversation, the complete log, the document from which only three lines were needed. In the end, the model charges for all of this overhead and responds more slowly because of it. The waste is invisible to the human eye because it is charged in tokens, and today's tokens do not weigh in the hand — but at the end of the month they will weigh on the wallet, on the response time, and on the quality of the answer.
We live in an era where intelligence and high productivity are one click away, but the cost of this is still being measured. And that becomes a new problem, a problem of the new modern economy.
It was with the goal of controlling this waste that NUXS was born. The mechanic is simple to understand: NUXS compresses context before it reaches the model, returning a smaller, cheaper, and faster input — becoming a valuable ally for anyone operating AI at scale.
The problem and the origin
In parallel with the high intelligence of our era, there is a recurring problem across hundreds of thousands of companies around the world in the age of AI. The technology exists, companies hand it over to their employees, but adoption is low and offers no real quality or productivity gain.
I felt this difficulty firsthand in late 2025. So I had the "brilliant" idea of driving AI adoption and team productivity by building a virtual office with pixel-art characters, where AI agents and humans interacted in the same environment and chatted to complete tasks together. The office had so many integrations that we managed to centralize most of our work in a single place. That is how I built PixelDesk.ai, over two weekends of coding. The office practically taught the team to adopt AI resources in their day-to-day work in a gamified way.
The team embraced the idea and, to my surprise, in a short time they became dependent on the office's agents. In less than two weeks we accelerated months of work. As people learned to work with the agents, usage surged and productivity rose visibly. Token consumption rose along with it, in the same proportion as how much each person learned to extract value from the model. The better the team got, the more expensive the operation became. Plans were maxing out before the end of the month; the costs of tokens and APIs exploded. If it happened to you too, you are in good company.
My first move was the obvious one: switch models, cache, trim history. Each adjustment solved one layer and revealed another underneath. Until it became clear that the bottleneck was not usage — it was the architecture. The log arrived whole when the pattern was enough. The conversation arrived complete when the context was enough.
My name is Josué Ramos, CEO at NUXS. Since 2010 I have been developing bot models for financial operations, with experience in Singapore, London, and Estonia. Data compression has been part of that work for years, and once it was clear where the bottleneck sat, I decided to act on it. And so, I built my first capsules. I started with 8, soon there were 17, and shortly after I moved on to compressing videos, images, and later audio and meetings. The savings became evident in the very first month of recurring use.
But doesn’t compressing make the response worse?
It is the first reaction of almost everyone. Intuition says that squeezing the input is a trade-off: you gain cost savings, you lose quality.
Research shows that, in most cases, this trade-off does not happen. Microsoft Research measured exactly this in two peer-reviewed papers — LLMLingua (Jiang et al., 2023), at EMNLP, and LongLLMLingua (Jiang et al., 2024), at ACL. Evaluating models on prompts compressed 4× to 20×, they found the opposite of intuition: quality holds in most tasks, and in some it improves. On the NaturalQuestions benchmark, compression delivered up to a 21.4% performance gain using about one-quarter of the input.
The reason becomes clear when you think about it. A model's attention is finite. When the input is full of things that do not matter, it spends attention sifting through noise. When the input arrives distilled, it spends attention reasoning. Cleaner input is not worse input — it is often better.
But isn’t the token getting cheaper?
The second common reaction is to say that compressing is yesterday's problem, because tokens get cheaper every year.
The unit price is indeed falling. AI.cc's 2026 API Infrastructure Report, analyzing 2.4 billion API calls between early 2025 and early 2026, recorded a 67% drop in average cost per token. Yet bills moved the other way: in the FinOps Foundation's 2026 State of FinOps survey, 73% of enterprises reported AI costs above their own projections. Price falls in a straight line; volume grows in a curve.
Goldman Sachs Research projects that global token consumption will grow twenty-four-fold by 2030, to 120 quadrillion tokens per month, driven by the adoption of agents. Compressing, in this scenario, does not save cents today — it captures margin over a volume that only grows. Every percentage point of reduction is worth more each year, not less.
The economics, in real numbers
In an audited run through our own production hook — the same binary our team uses daily — NUXS processed 20.2 million tokens and compression removed 92% of them. In money, at top-tier coding-model input pricing, that is roughly $300 of input reduced to about $24. At scale, that is the difference between an AI operation that closes the books and one that bleeds.
And the benefit does not stop at the API. Those working within a token-capped plan — Claude, Cursor, any of them — gain more time inside the same plan before hitting the ceiling. Compression gives money back to those who pay per use, and gives time back to those who pay per plan.
But doesn’t generic compression already solve this?
The third objection is technical, and it is where the heart of the method lives. The state of the art, including Microsoft's work, compresses text as text: it discards the words of lowest statistical weight. It works for prose. It fails for structured data — and an agent's input is, for the most part, structured data.
The most recent applied analysis is direct: for code, SQL, and tables, truncating by token corrupts the structure. In a measurement cited by Paul (2025), a SQL query with a join dropped from 0.63 to 0.37 accuracy under generic truncation, and held steady under compression that preserves structure.
Structured data is not text. A log is a pattern that repeats with small variations. A conversation is a sequence of turns where each one reloads the previous. A stack trace is a stack with the same frames repeated. A schema is a shape, not a sentence. Each format hides its redundancy in a different place — and that is where real compression lives. That is why NUXS does not have one compression function. It has a family of them.
What NUXS delivers
There are seventeen text capsules, each specialized in a type of context: conversation, events, session, SQL, stack, PDF, search, log, diff, test, build, prompt, schema, API spec, code, requests, and network. Each capsule carries a decision about what is signal and what is discardable in that format — and it is by carrying that opinion that it compresses where a blind compressor would not.
Plus three ready multimodal capabilities — image, video, and meeting — which translate media into dense textual context, so the agent can "see" a screenshot, a recording, or a call without spending vision tokens at every turn.
All of this runs between the agent — Claude Code, Cursor, Codex, Cline, Aider, or any integration via SDK or proxy — and the model, using your own key. The input comes out ready: smaller, denser, cheaper, with no provider switch, no prompt change. There is a free plan for those who work solo, a team plan for teams, and an enterprise plan for those who require isolation and auditing.
The study, and why it is conservative
Over the course of development, we have processed well over six hundred million tokens through capsules — 626.8 million of them audited, with raw data preserved, reproducible, hash-stamped. The current 200-million-token run alone spans 9,333 samples with zero errors. That is the study that accompanies this manifesto, with the entire methodology open for anyone who wants to redo the math.
And here is the point that matters most when reading it: the benchmark was designed to measure the worst case. The early rounds used synthetic data — generated to be difficult, full of unique identifiers and little repetition. The current rounds use real-world "wild" data under digit-level mutation, precisely so that no cache or memorization can inflate the result. The redundancy of reality is far greater than that of any generator.
The benchmark itself shows this where comparison is possible. The structural capsules climb significantly on real data: the codebase capsule goes from 77.6% on synthetic fixtures to 95.2% on real files; diff from 71% to 94.5% on real git output; schema from 45% on synthetic input to 77% on real SQL. And the audited run through our production hook came in at 92%. The aggregate number of the study is therefore a floor — not a ceiling. In production, compression tends to run higher, not lower.
The rigor of publishing the floor, not the peak, is not a weakness. It is the only way we trust to make efficiency claims in a market where every claim deserves skepticism. Every compression records input, output, ratio, time, and capsule type. Every metric reconstructs from the raw data. Nothing is a black box.
Where compression meets its limit
There is a physical ceiling on dense data. A schema rarely crosses the high seventies, because every column name is signal. A PDF dense with figures keeps every number by contract — no compression may alter a value, and we prefer the number intact, even when that caps the ratio (our current measurement on a real technical PDF sits at 96%, after fixing a parser defect the protocol itself caught). And on very small input, compressing is not worth it: the product returns the input intact and records this in telemetry. We prefer honest passthrough to inflating a number.
What we believe
A model's intelligence does not depend only on its size — it depends on the quality of what you place before it. Models will grow, context windows will stretch, and none of that changes the fact that redundant context costs money, time, and precision. Compressing before sending is not a savings trick; it is giving the model what it actually uses, and paying for nothing more.
This work will evolve. Capsules will improve, will fail in unforeseen cases, will be redone. Our commitment is to keep measuring and publishing — what works, what does not, and what is not yet known.
If you operate AI at scale and feel the bill rising and the response slowing as context grows, it is worth trying. The free plan is an excellent starting point, and the rest is open to criticism.
References
- Jiang, H. et al. (2023). LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. EMNLP 2023. arxiv.org/abs/2310.05736
- Jiang, H. et al. (2024). LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression. ACL 2024. arxiv.org/abs/2310.06839
- Paul, K. (2025). Prompt Compression Techniques: Reducing Context Window Costs While Improving LLM Performance.
- AI.cc (2026). 2026 AI API Infrastructure Report — 2.4 billion API calls across 8,000+ developers and enterprises, Jan–Apr 2026.
- FinOps Foundation (2026). State of FinOps 2026. data.finops.org
- Goldman Sachs Research (2026). AI Agents Forecast to Boost Tech Cash Flow as Usage Soars.
Built by people who live inside the agents — not by people who only talk about them.