Your AI bill is not the problem. Your workflow is.

April 24, 2026

Claude Code AI Developer Tools Developer Productivity AI Adoption AI Productivity Microsoft Copilot Software Engineering

Laser shattering stone wall: AI productivity measurement exposing hidden costs of flat-fee SaaS licenses

I spent a weekend last month going through 3 months of Claude usage, one conversation at a time.

At work, I pay per token. In my personal projects, I use quota. And the answer: I had burned through hundreds of $ in compute, and most of it was cached tokens the model had already seen before. I was paying to re-feed the same context into the same model across sessions, as I hadn't cleaned up my workflow.

My first reaction was to downgrade the model. Use Sonnet for everything. Find a cheaper provider. That won't take me far and would be too disruptive to my workflow. The better move is to measure your effectiveness ratio — sustained output value divided by tokens, time, and rework — and fix the leaks that number reveals.

And then I remember that in my last 8 years of working as a software engineer, I paid, or my employer paid for, SaaS that I never fully used.

The seat-based era was the real scam

Every company I worked with in the last 8 years followed the same pattern. Procurement buys seats. Seats sit on the shelf. No one is pushing the tool to its limits. Everyone uses it at 10%, but you pay it in full, and no one checks whether that number ever goes up.

I had a Jira license for years. I never learned it past the basics of moving a card across a board. The bill did not push me to learn, because the bill looked the same whether I used the tool well or not. Not to mention MS Teams, it has so many features, and I rarely use it beyond sending messages. Same with most of the productivity software I have ever touched, except VSCode, which is free and genuinely great. WebStorm from IntelliJ is expensive, but at least it rewards mastery.

The seat model is doing exactly what it was designed to do. Per-seat SaaS is optimized for predictable vendor revenue, and your skill growth is somebody else's problem. The invoice arrives whether you became a power user or a ghost, so nothing about the invoice teaches you to get better.

Research on behavioral pricing calls this decoupling: when payment is detached from each action, users have weak incentives to optimize how they use a tool or deepen their skill with it. A polite way to say it: the seat model lets you rot for years.

Usage pricing removes the hiding place

Usage-priced AI flips the incentive. Every action has a visible cost. A verbose prompt costs. A bloated context costs. Running Opus to summarize the costs of a two-line email. Slacking off isn't free; you pay for it.

The feedback loop SaaS never gave you is now staring at you from a terminal.

Out of quota. Wait until 18:00 for new quota.

The same behavioral research that explains decoupling also covers loss aversion and scarcity. When a resource is visibly limited, people attribute more value to it and make more deliberate trade-offs. Watching a quota tick down or a cost meter climb forces a kind of attention that a flat monthly line item never could. The discomfort is the signal.

Where the money actually leaks

When I broke down my own usage, the leaks were boring and obvious in hindsight. I loaded skills, tools, and files I did not need for the task, and the model processed all of it and charged me for all of it. I used Opus by default, including for things Sonnet could have handled in half a second for a fraction of the cost. I would get a bad answer, tweak one word, rerun—four or five times in a row—which isn't iteration, it's gambling with money. I had MCPs wired in 6 months ago that I never pruned, each one quietly adding to the system prompt of every conversation. And I let long threads run far past the point where I should have started fresh, dragging stale context along and paying for it every turn.

The effectiveness ratio: measuring AI productivity

Here is the formula I use now, and the one I want to put in front of you:

Effectiveness ratio = sustained value of output / (tokens + time + rework)

Sustained value means the output still holds a week later. Not just "this compiled," but "this did not create refactoring debt I will pay off next sprint." Tokens and time are the obvious inputs. Rework is the one thing people forget, and it is often the highest cost.

My rough threshold: below 2 to 1, rethink the workflow before you touch the bill. Two hours of AI-assisted work producing one hour of usable output is a process problem, not a pricing problem.

The rework cost is not hypothetical. GitClear analyzed 211 million changed lines of code between 2020 and 2024 and found that code reworked within two weeks of commit grew from 3.1% to 5.7% as AI tools spread. Veracode found AI-generated code introduced security vulnerabilities in 45% of tasks. Speed on the numerator is real. Rework on the denominator is too.

Why Copilot seats are the new Jira license

Copilot is not bad. Any AI tool beats no AI tool. But watch the pattern happening right now across mid-sized engineering orgs:

The company bought Copilot seats for the whole team last year, at a premium.
Half the engineers quietly use Claude Code or Cursor on the side, because the workflow is better, connected to pay-as-you-go platforms like Vertex AI or Azure Foundry.
Budget review comes around. Leadership sees the Copilot spend and is "impressed" by the AI investment.

These orgs are treating a measurement problem as a cost problem. They never had a way to know whether their Copilot seats were producing a healthy effectiveness ratio, because seat pricing hides usage. Now usage-priced tools make that number visible, and the reflex is to suppress the number rather than fix the underlying workflow.

GitHub reports Copilot has 20 million cumulative users, with roughly 90% of the Fortune 100 adopting it. Controlled experiments show 55% faster task completion in narrow benchmarks. And yet a longitudinal study by NAV IT found no statistically significant change in commit-based activity metrics for Copilot users, despite strong self-reported gains. Either the benefit is real but invisible in the metrics tracked, or the benefit is a feeling the tool gives you. Either way, the seat model never surfaced the question.

The feeling of productivity is not the same as productivity

I have been using Claude Code for months. I feel faster. My output feels better. I also do not trust that feeling.

The provider can tweak the model tomorrow and make me feel more productive without making me more productive. The interface can get smoother, the responses shorter, the latency lower, and my subjective sense of progress will climb even if my actual ratio stays flat. It means the only honest answer I have about my own improvement is:

I think I am getting better, and I have no instrument to prove it.

A skill to audit your own ratio

I am putting together a Claude Code skill that runs at the end of a task and audits the session against effectiveness criteria:

What was the token spent on for this task?
How much of it was cached, re-fed, or wasted on loaded-but-unused context?
Was the model size appropriate for the work done?
How many iterations did it take to reach an acceptable output?
Which skills, tools, or MCPs contributed to the context without contributing to the output?
Where were the easy wins? The prompt could have been shorter, and the session could have been split.
Was this a rework? How much time was spent on it?

Over weeks, you see whether your ratio is trending up or trending down.

The shift I want you to carry

The pricing-aware wave is right to pay attention and wrong about what to do with that attention.

Optimize for ratio. Cheap with a bad ratio is still waste, quieter waste. The value of usage pricing is that it teaches you, every day, where your workflow leaks. That is a gift the seat-based era never offered you. You spent years paying flat fees to tools you never mastered, and nothing about the bill told you.

Your AI bill is speaking. Listen to it before you rush to cancel it.

Frequently Asked Questions

What is the AI effectiveness ratio?

The effectiveness ratio is the sustained value of an AI-assisted output divided by the tokens, time, and rework it took to produce. Sustained value means the output still holds a week later, not just "it ran once." A ratio below 2 to 1 means the workflow needs redesign, not a cheaper model. The frame lets you compare sessions over time and see whether you are actually improving or the tool just feels faster.

Why is seat-based SaaS worse than usage pricing for skill growth?

Per-seat SaaS decouples payment from action. You pay the same whether you use 5% or 95% of the tool, so the invoice never signals which features you are missing or which habits are wasteful. Usage pricing itemizes every action, which is uncomfortable but honest. You see where your prompts bloat, where you reach for the wrong model, where context leaks across sessions. That visibility is a teacher that seat licenses never were.

How do I audit my Claude token spend?

Start with one month of usage. Look at cached vs fresh tokens, model mix (Opus vs Sonnet vs Haiku), session length, and rework cycles. Flag sessions where you re-ran the same prompt 3+ times, where you used Opus for trivial work, or where a long thread kept dragging stale context. The goal is not to cut the bill, it is to find the workflow leaks the bill is pointing at.

Is Copilot worth it if my team uses Claude on the side?

If half your engineers are paying personal accounts for a better workflow, your Copilot seats are not the bottleneck, your measurement is. Before you cut or cap anything, run an effectiveness audit on both. The team is voting with their own wallets for a reason. Suppressing the signal with a block list or a budget cap makes the workflow problem invisible again.

What is a good effectiveness ratio threshold?

My working threshold is 2 to 1. Below that, the process needs rethinking before the bill does. Above 4 to 1, the workflow is compounding: the AI is saving you real hours of sustained output per hour of input. Track the ratio per task type, not globally. Writing tasks, code tasks, and research tasks all have different baselines.

What are the sources for this metric?

GitClear analyzed 211 million changed lines of code between 2020 and 2024 and found that code being reworked within two weeks of commit grew from 3.1% to 5.7% as AI tools spread. Veracode found AI-generated code introduced security vulnerabilities in 45% of tasks. Synthetic benchmarks put AI-authored PRs at roughly 1.7x more issues than human ones. Speed on the numerator is real. So is rework on the denominator. You can only reason about effectiveness if you watch both sides.

What's the deal with GitHub/Copilot?

GitHub reports Copilot has hit 20 million cumulative users, with roughly 90% of the Fortune 100 adopting it. Controlled experiments show 55% faster task completion in narrow benchmarks. And yet the NAV IT longitudinal study found no statistically significant change in commit-based activity metrics for Copilot users, despite strong self-reported productivity gains. Something is off. Either the benefit is real but invisible in the metrics we track, or the benefit is a feeling the tool gives you.

Stop Juggling 7 Tools. Consolidate Your SaaS Stack with AI.

Apr 1, 2026

One call. We'll show you exactly what we'd build with your team.

No pitch decks. No generic proposals. Just a conversation about your workflows and what we can automate together.