The Token Bill Nobody's Ready For

TL;DR

A single power user running AI coding tools, email agents, real-time meeting battlecards, and research assistants can generate 100M+ tokens on a heavy day — with agentic loops pushing toward 1 billion
Agentic workflows multiply token consumption 10–50x compared to single-shot prompts because of multi-turn loops, tool calls, and context window stuffing
AT&T hit 8 billion tokens/day on a single use case and had to rearchitect their entire stack — that scale is coming for every large enterprise
At API pricing, a 10,000-person enterprise could face $50–100M/year in token costs; the math breaks unless architecture changes
Token economics is becoming the defining constraint of enterprise AI — not model quality

Continue to Part 2: The Enterprise Model Portfolio

In my AI-Native Computer series, I made the case that we're standing up a new computer on top of the old one. LLMs are the CPU. Tokens are the bytes. The context window is the RAM.

That framing resonated. What I didn't fully appreciate at the time was what it meant for the bill.

Because bytes have a cost. And when those bytes are tokens — and the system consuming them is an entire enterprise of AI-augmented knowledge workers, autonomous agents, and agentic workflows running continuously — the numbers get very large, very fast.

Gartner projects global AI spending will reach $2.52 trillion in 2026 — a 44% increase over 2025. Stanford HAI reports that corporate AI investment hit $252 billion in 2024, up 44.5% year-over-year. The money is flowing. The question is whether it's flowing toward the right architecture.

The first phase of enterprise AI was about access. Who can call the best model? Who has the API key?

The next phase is about economics. What does this workload cost per useful outcome? And can the organization afford to run it at scale?

That shift changes the architecture. It changes the infrastructure. It changes who wins.

This series is about that shift. And it starts with a question most organizations haven't asked yet:

What happens when every knowledge worker has a personal AI system running continuously?

Where Enterprise AI Stands Today

Before we get to the numbers, it's worth grounding in where the industry actually is — not the vendor pitches, but the data.

I've synthesized findings across McKinsey's State of AI 2025, Gartner's AI spending forecasts, Deloitte's State of Generative AI 2026, IDC's Agentic AI Survey, Forrester's 2026 Predictions, Stanford HAI's AI Index, and Recon Analytics' enterprise tool adoption study. The picture is striking.

Where Enterprise AI Stands Today

A meta-survey across McKinsey, Gartner, Deloitte, IDC, Forrester, Stanford HAI, and Recon Analytics

The Adoption Funnel

88%

Organizations using AI

McKinsey 2025

62%

Experimenting with agents

McKinsey 2025

60%

Workers with AI tools

Deloitte 2026

39%

Report EBIT impact

McKinsey 2025

25%

Pilots moved to production

Deloitte 2026

12%

Scaled agentic AI

McKinsey 2025

Enterprise Tool Adoption

Recon Analytics, Jan 2026

ChatGPT

Dominant, modest erosion55.2%

Claude

+61% YoY29%

Gemini

Passed Copilot Nov ’2515.7%

Copilot

−39% in 7 months11.5%

When workers have multiple options: 70% choose ChatGPT, 18% Gemini, 8% Copilot

Global AI Investment Trajectory

$252B

2024

Stanford HAI

$1.76T

2025

Gartner

$2.52T

2026

Gartner

$3.34T

2027

Gartner

The value gap: $2.52 trillion is flowing into AI in 2026. But only 39% of organizations report bottom-line impact, and only 12% have scaled agentic AI. Per-seat tools cover reactive use cases well — the token-intensive agentic workloads that transform operations require a different architecture entirely.

The headline numbers are impressive: 88% of organizations use AI regularly (McKinsey). 62% are experimenting with AI agents. 60% of workers now have sanctioned AI tools (Deloitte — up from 40% a year ago). 78% of Fortune 500 have active LLM implementation projects.

But look at the other end of the funnel. Only 39% report bottom-line EBIT impact (McKinsey). Only 25% have moved more than 40% of their AI pilots to production (Deloitte). And only 12% have scaled agentic AI across multiple business functions.

Investment is massive. Adoption is broadening. Value capture is lagging.

The tool landscape reflects this tension. Per-seat tools dominate today — ChatGPT holds 55.2% of paid subscriber share, Claude has surged to 29% (up 61% year-over-year), and Gemini passed Copilot in late 2025 to claim the number-two position. Copilot lost 39% of its market position in just seven months, declining from 18.8% to 11.5% — not because it's a bad product, but because workers given options consistently choose tools with stronger reasoning and broader capabilities.

These per-seat tools include generous token allowances for interactive use, and they serve that use case well. ChatGPT Business runs $25–30/user/month (Enterprise is custom-quoted via sales for larger organizations). Claude Teams is $25–30/user/month. Copilot is $30/user/month plus the M365 stack. For a 1,000-person deployment, that's $300K–360K per year in licensing alone — before any API consumption for agentic workloads.

But interactive use is only the first chapter of this story.

The AI Tool Evolution

The enterprise AI tools landscape isn't static — it's evolving through distinct maturity levels. And each level up the continuum multiplies token consumption by an order of magnitude.

The AI Tool Evolution

Five levels of enterprise AI maturity — each level up multiplies token consumption by 10–100x

L1Reactive

Single-shot prompts, autocomplete, inline suggestions. User initiates every interaction.

~1K

tokens / interaction

88% of orgs

L2Augmented

Retrieval-augmented generation, context-aware assistance, document synthesis.

~10–50K

tokens / task

62% experimenting

L3Orchestrated

Multi-agent routing, specialized model tiers, workflow coordination across systems.

~100K–500K

tokens / workflow

12% scaled

L4Agentic

Autonomous task completion with tool use, iteration, and self-correction loops.

~1–10M

tokens / session

<6% production

L5Autonomous

Continuous personal AI systems, 24/7 background agents, proactive intelligence.

~100M–1B+

tokens / day / user

Emerging

Where most enterprises are today: McKinsey reports 88% of organizations use AI, but only 12% have scaled agentic deployments. Deloitte finds just 25% have moved pilots to production. The value gap between Level 1–2 and Level 3–5 is where token economics become the defining constraint.

Level 1: Reactive. Single-shot prompts, autocomplete, inline code suggestions. This is where Copilot, ChatGPT chat, and basic Gemini usage live. A typical interaction consumes around 1,000 tokens. The user initiates every action. The AI responds. End of transaction.

Level 2: Augmented. Retrieval-augmented generation, context-aware code assistance, document synthesis. Tools like Cursor, Perplexity, and enterprise RAG pipelines operate here. Token consumption jumps to 10–50K per task because the system is now pulling in context, searching knowledge bases, and synthesizing across documents.

Level 3: Orchestrated. Multi-agent coordination with specialized model routing. This is what AT&T built when they deployed "super agents" coordinating "worker agents" — each request routed to the right model for the task. Token consumption: 100K–500K per workflow, because multiple models are being called in sequence.

Level 4: Agentic. Autonomous task completion with tool use, iteration, and self-correction. Frameworks like NemoClaw (NVIDIA's enterprise agent framework, announced at GTC 2026), Agent Zero, and Hermes by Nous Research operate here. Token consumption: 1–10 million per session. These systems don't just answer questions — they pursue goals, use tools, write and execute code, browse the web, and iterate until the job is done.

Level 5: Autonomous. Continuous personal AI systems running 24/7 — background research, proactive meeting prep, continuous inbox triage, persistent memory across all interactions. Token consumption: 100 million to over 1 billion per day per user. This is less about any single tool and more about a philosophical direction: managing a user's entire digital environment through an AI layer — projecting their digital presence through generative AI that knows their context, priorities, relationships, and work patterns. My MemoryOS is one approach. Others are emerging in the same space: ArgentOS (open-source personal AI OS with persistent memory and 50+ agent tools), OpenClaw (personal AI assistant with 24/7 proactive intelligence), and CookieOS (desktop-native with privacy-by-design). The pattern is the same: a system that continuously observes, indexes, and acts on behalf of the user across all their digital surfaces.

Most enterprises sit at Level 1–2 today. The value — the 39% EBIT impact that McKinsey measures — concentrates at Level 3–5. And each level up multiplies token consumption by 10–100x.

That's not a linear progression. It's an exponential one.

What a Power User Actually Consumes

Let me walk through a realistic day for a power user in 2026. Not a hypothetical one — one I've lived, because I run MemoryOS, a personal AI system that continuously collects, indexes, and serves data to my agents from my work life.

My daily toolkit includes AI coding assistants (Cursor, Claude Code), email agents that read and draft responses, meeting prep agents that brief me before every call, research agents that synthesize documents across my knowledge base, and a personal assistant that manages priorities and surfaces what matters.

Here's what a moderate day looks like — and what happens on a heavy day when I'm deep in code, back-to-back in meetings, and running my full agent stack:

Activity	Moderate Day	Heavy Day
AI coding assistant (Cursor/Claude Code)	500K – 2M	10M – 50M
Email reading, drafting, and summarization agent	200K – 500K	2M – 10M
Meeting prep + real-time battlecards + post-meeting summaries	500K – 1.5M	20M – 100M
Teams/Slack message processing and triage	200K – 500K	5M – 20M
Research and analysis (RAG + multi-doc synthesis)	1M – 5M	10M – 50M
Personal assistant (scheduling, planning, priorities)	100K – 300K	1M – 5M
Knowledge retrieval (10+ RAG queries with full context)	500K – 2M	5M – 20M
Background agents (skills, news, tasks, relationships)	—	10M – 50M
Subtotal (single-shot interactions)	~3M – 12M	~63M – 305M

The "moderate day" column is the baseline. Single-shot interactions — ask a question, get an answer. Already millions of tokens per day for one person.

But the "heavy day" column is what happens when every system is running continuously. When your agent is processing every email in your inbox, triaging every Teams message, providing real-time transcription of meetings with prep before and summaries after, updating battlecards mid-conversation as the discussion shifts, running relationship intelligence across every contact, surfacing news and task updates proactively, and managing your entire coding environment through agentic loops that iterate 10, 20, 50 times per session.

The meeting stack alone is transformative — and expensive. A single meeting with real-time transcription, continuous battlecard updates based on RAG over your knowledge base, and a structured post-meeting summary can consume 2–10 million tokens. Eight meetings in a day pushes that single category above 20M. Add prep briefs that pull context from every prior interaction with those attendees, and it climbs further.

When I ran this full stack — every email, every Teams message, every meeting, every background agent — I estimated my consumption at over 100 million tokens on a heavy day. With multi-turn agentic loops running autonomously across all of these surfaces, the upper bound approaches a billion.

That number sounds extreme until you do the math. And it's one person.

Agentic Loops Are Token Multipliers

The table above captures direct interactions. What it doesn't capture is what happens when those tools become agentic — when they don't just answer a question but autonomously pursue a goal.

The key concept here is the agentic loop — the architectural pattern that makes modern AI tools fundamentally different from single-shot chat. Every serious AI coding tool, research agent, and autonomous system today is built on some variant of this pattern.

ReAct (Reason and Act) is the foundational loop. Originally proposed by Yao et al., it structures agent behavior as a continuous cycle: Thought → Action → Observation → Updated Thought → Action. The agent reasons about what to do, takes an action (calls a tool, searches a codebase, reads a file), observes the result, and then reasons again. This architecture eliminated hallucination failures entirely in the original research — 0% versus Chain of Thought's 56% — because it binds reasoning to reality at every step. But each cycle through the loop consumes a full context window of tokens.

PAR (Plan → Act → Reflect) adds an explicit reflection phase. The agent doesn't just act and observe — it evaluates its own output, identifies errors, and self-corrects before the next iteration. This is the pattern behind tools like Claude Code's agentic architecture, where the model emits a structured tool-use request, your code executes it, and the result flows back into the conversation — looping until the model decides it's done. Claude Code's /loop command can even run this cycle on an interval for autonomous monitoring.

PARC (Plan → Act → Reflect → Coordinate) extends the pattern further with hierarchical multi-agent coordination — a planner decomposes objectives into sub-tasks and distributes them across specialized worker agents, each running its own loop. Recent research demonstrates PARC agents successfully executing 43+ hour autonomous coding sessions.

Cursor's agent architecture implements ReAct with inference-time optimizations, bridging perception-action gaps across multi-file codebases at 250 tokens per second. Claude Code adds subagent parallelism — spinning up specialized agents across frontend, backend, and QA simultaneously.

These aren't theoretical patterns. They're how every major AI tool works today. And they are structurally token-hungry.

Consider a simple agentic workflow: "Research this competitor and draft a briefing."

That single request might trigger:

3–5 search queries across your knowledge base (each injecting 10–30K tokens of context)
2–3 web searches with result parsing
A synthesis pass over all retrieved content (50–100K token context window)
A draft generation pass (another 30–50K tokens)
A self-review and revision loop (2–3 additional LLM calls)
A final formatting pass

One request. Ten or more LLM calls. Each consuming a large context window. Total: 500K–2M tokens for a single task.

The research quantifies this precisely. A study of 1,127 real agent runs found that 52% of total spend comes from context re-reads — the quadratic cost curve where step 1 sends 2K tokens but step 10 sends 30K+. Production deployments show agents averaging 11 LLM calls per conversation versus an expected 3, and the cost variance is staggering: the p95/p50 cost ratio averages 18x, meaning the most expensive 5% of agent runs cost 18 times the median.

Now multiply that by the structural realities of agentic AI:

Multi-turn loops compound. An agent solving a software engineering task might loop 10 times, each iteration building on the last. Research confirms that ten cycles can consume 50x the tokens of a single linear pass.

Output tokens are expensive. The median output-to-input cost ratio in 2026 is approximately 4:1, with premium reasoning models reaching 8:1. Agents that generate long outputs — code, reports, analysis — hit this premium hard.

Context windows are not free. A 128K-token context window costs 64x more than an 8K window due to attention scaling. Agentic systems routinely stuff large contexts to maintain state across turns.

Tool calls add overhead. Every tool invocation requires injecting tool definitions, parsing responses, and appending results to context. A 15-tool agent carries thousands of tokens of overhead on every call.

Model routing matters enormously. A task routed to a frontier reasoning model can cost 190x more than the same task handled by an appropriately-sized alternative. This is why the model portfolio in Part 2 isn't optional — it's the difference between viable and unsustainable.

The new generation of agentic frameworks makes this concrete. OpenClaw, the open-source foundation behind NVIDIA's NemoClaw enterprise framework (332,000+ GitHub stars as of March 2026), exhibits a 5x token multiplier compared to traditional workflows — with context bloating from 500 to 35,000 tokens within 30 minutes of continuous operation. Hermes 4 by Nous Research trains on 60 billion tokens (50x its predecessor) with reasoning traces up to 16,000 tokens long — each reasoning sample consuming 5x more tokens than non-reasoning interactions. Agent Zero runs autonomous agents in isolated Docker containers, writing and executing code, browsing the web, and installing software — each autonomous cycle consuming thousands of tokens.

When I factor in agentic workflows running autonomously throughout my day — background research, continuous email triage, proactive meeting prep — my actual consumption on a moderate day looks like 10–50 million tokens. On a heavy day with full agent orchestration across every digital surface, it pushes past 100 million tokens — and with deep agentic loops, the upper bound approaches a billion.

That is one person.

The Token Explosion

Tokens per user per day (log scale) — from single-shot chat to enterprise-wide autonomous AI

Project That Across an Enterprise

Now take that individual number and project it across a large enterprise.

Enterprise Token Projection

10,000-person enterprise with meaningful AI adoption

Tier

Headcount

Tokens/User/Day

Daily Total

Power Users

20% of workforce

2,000

20M

40B

Moderate Users

60% of workforce

6,000

12B

Light Users

20% of workforce

2,000

200K

400M

Enterprise Total

~52B tokens/day

~1.6 trillion tokens/month

At API Pricing

$50–100M

per year

A 10,000-person enterprise with meaningful AI adoption might look like this:

20% power users (2,000 people) — developers, analysts, executives with AI-native workflows. Average: 20M tokens/day each. 40 billion tokens/day.
60% moderate users (6,000 people) — knowledge workers using AI for email, search, meeting summaries. Average: 2M tokens/day each. 12 billion tokens/day.
20% light users (2,000 people) — occasional AI interactions. Average: 200K tokens/day each. 400 million tokens/day.
Total: ~52 billion tokens per day. ~1.6 trillion tokens per month.

Does that sound unreasonable? Consider that AT&T hit 8 billion tokens per day on a single application — their Ask AT&T personal assistant — and that was enough to force a complete rearchitecture of their AI stack. After optimization enabled higher throughput, that number scaled to 27 billion tokens per day.

One application. One company. 27 billion tokens daily.

For scale, consider that Google now processes approximately 1.3 quadrillion tokens per month — 5x more than OpenAI and 25x more than Groq. That's where consumption goes when AI is embedded across every high-traffic surface.

Deloitte reports that inference now accounts for 85% of enterprise AI budgets in 2026, up from roughly one-third in 2023. And 85% of enterprises expect to customize autonomous AI agents for their specific business needs (Deloitte State of GenAI 2026). Those custom agents will consume far more tokens than any per-seat chat tool.

Now consider an enterprise running dozens of AI-powered applications, agents embedded in every workflow, and autonomous systems operating 24/7. 52 billion tokens per day is not a ceiling. It's a floor.

At current API pricing — a blended rate of $3–5 per million tokens for mid-tier models — that translates to:

$50–100 million per year in token costs alone.

That number gets a CFO's attention. Fast.

Why the API Model Breaks at Scale

The economics of enterprise AI follow a clear pattern, and there's a threshold where the API consumption model stops making sense.

The Cost Breakeven Spectrum

Where different deployment strategies make economic sense

Below 1 billion tokens per month, API access is the right answer. It's flexible, requires no infrastructure, and the operational simplicity is worth the per-token premium. This is where most organizations are today — experimenting, piloting, proving value.

Between 1 and 10 billion tokens per month, hosted open-source providers start winning. Services like Together.ai, Groq, and Fireworks offer inference on open-weight models at a fraction of frontier API pricing. DeepSeek's API disrupted the market at $0.55 per million input tokens — 90% below Western competitors.

Above 10 billion tokens per month, self-hosted infrastructure becomes economically compelling. The delta between API costs ($3–5/1M tokens) and self-hosted inference ($0.40–0.50/1M tokens) represents a 7–10x cost difference. At enterprise volume, that difference is measured in tens of millions of dollars annually.

The good news, as a16z's "LLMflation" analysis shows, is that inference costs are declining roughly 10x per year — a 1,000x reduction over three years. But the paradox is clear: unit costs drop while total spending accelerates, because consumption is growing even faster. Inference costs have fallen 280-fold in two years, yet enterprise AI spending is skyrocketing.

AT&T's response to this reality is instructive. They didn't just switch models — they rebuilt their entire orchestration layer using a multi-agent stack built on LangChain. They deployed "super agents" that route tasks to specialized smaller models, achieving a 90% cost reduction while actually improving latency and enabling 3x higher throughput. Their CDO, Andy Markus, put it plainly: smaller models can be "just about as accurate, if not as accurate, as a large language model on a given domain area."

The question is no longer which model is smartest?

It's what does this workload cost per useful outcome?

The Forcing Function

The token bill is the forcing function. It's what will drive enterprises from "call the API" to "operate the infrastructure."

Not because the API model is wrong. It was exactly right for the first phase — when the priority was access and experimentation.

But at industrial scale, consumption economics become architecture decisions. And architecture decisions become infrastructure decisions. And infrastructure decisions become strategic advantages.

The organizations that recognize this early won't just save money. They'll build something the rest of the market will spend years trying to replicate.

But cutting costs isn't enough. You can't just host any model and call it a strategy. You need the right models for the right tasks — a portfolio of intelligence, not a single point of access.

That's where the enterprise model portfolio comes in. And it's Part 2.

The Token Economy

Part 1 of 3

Part 2: The Enterprise Model Portfolio