TL;DR
- The LLM at scale is not a model — it is an operating system, and like any OS it runs in distinct modes
- There are four operating modes in production today: Chat, Agent, Deep Research, Cowork — same 14 infrastructure layers underneath, radically different composition on top
- A single prompt costs $0.02 in Chat mode, $0.40 in Agent mode, $40 in Deep Research mode, and can run to $400 in Cowork mode — five orders of magnitude on the same silicon
- The enterprise decision is no longer which model. It is which mode, for which task, at what rate, under which governance
A friend of mine who runs AI at a peer enterprise called me last week. He walked me through one week of his team's usage and the story stopped me cold.
A product manager on his team asked ChatGPT a question last Tuesday: "summarize the Q4 board deck and flag anything material." She got an answer in four hundred milliseconds. It cost roughly two cents.
That afternoon a director on the same team asked the same question, verbatim, through their enterprise deep-research agent connected to the same board deck and the same model family. The answer came back seven minutes later, with forty-two citations, cross-referenced to the prior three quarters. It cost forty dollars.
By Friday, an engineer in their finance organization had Cursor cowriting an analysis of the same deck against their data warehouse. The session lasted three hours. The model made one thousand four hundred tool calls. The token bill was four hundred dollars.
Same company. Same deck. Same base model. Same GPUs in the same data centers. The outputs were all useful. The costs ranged across five orders of magnitude.
None of that is an accident.
The model used to be the product. Now the mode is.
For the first few years of the GPT era, the mental model was simple: you call a model with a prompt, the model returns text. The only interesting variables were which model and how long the context window was. That mental model is still what most enterprises use to budget, govern, and architect AI today.
That mental model is obsolete.
What we are calling "an LLM" in 2026 is not a model. It is an operating system. A very expensive, partially understood operating system that we are renting in slices. And like any operating system, it runs in distinct modes. Not one. Four.
If you have been reading The AI-Native Computer, this will be familiar territory. I argued there that a new computer is quietly being stood up on top of the old one: the LLM is the CPU, tokens are the new bytes, the context window is the RAM, and our data platforms become the disk. That framing holds. What I did not spell out then is that on top of this new computer we have now built an operating system, and the operating system has modes.
Traditional computers have user mode and kernel mode. The LLM OS has four.
The four modes
Here is the claim:
When you hit enter in ChatGPT, Claude, Cursor, or any of the products that wrap a frontier model, you are not running one machine. You are running one of these four:
- Chat mode. A single call. The prompt goes in, fourteen infrastructure layers fire in sequence, the answer streams back. About four hundred milliseconds. One token accounting event.
- Agent mode. A loop. The model thinks, calls a tool, observes the result, thinks again. Each iteration is a full Chat-mode call. A typical turn runs five to fifty calls, plus a tool registry, plus scratchpad state, plus a kill switch. Thirty seconds to five minutes.
- Deep research mode. A planner, a swarm of parallel searchers, and a synthesizer. The planner decomposes the question. The swarm fans out — each member is a full agent running its own loop. The synthesizer pulls the results into a long-context reduce step. Five to fifteen minutes. Hundreds of thousands of tokens.
- Cowork mode. Claude Code, Cursor, Operator, Codex, ChatGPT Projects. The model stops being a call and starts being a coworker: persistent memory, a skills library, a knowledge base of project files, and direct access to the environment — terminal, browser, screen. Session-long state. The cost curve is not per-call; it is per-session and it compounds.
The modes share everything underneath. Same GPUs. Same tokenizer. Same model router. Same safety classifiers. Same fourteen infrastructure layers that every frontier API call passes through. The difference is entirely in the composition on top.
One prompt. Four machines.
Words do not carry this point. The live demo does.
One Prompt. Four Modes.
Same infrastructure underneath. Five orders of magnitude across the top.
Prompt
Analyze Q4 performance across the portfolio and recommend the three planning priorities for 2027.
Chat Mode
One call
14 layers, one pass.
Agent Mode
A loop
12 iterations, tools, state.
Deep Research
A swarm
48 searches, 1 synthesis.
Cowork Mode
A coworker
Session-long state.
Same GPUs. Same 14 infrastructure layers. The question for the enterprise is no longer which model, but which mode, for which task, at what rate, under which governance.
The same prompt through four modes produces four completely different shapes of infrastructure consumption. One call. A loop of calls. A swarm of loops. A session of swarms. Each one is a legitimate answer to a legitimate question. Each one is priced as if it were running on a different piece of hardware, because functionally it is — even when the silicon is identical.
Five orders of magnitude
The number that should stop every CFO cold is the cost spread.
A single Chat-mode call at the prompt size most knowledge workers use runs two to five cents. Call it $0.02 as a round number.
An Agent-mode turn — the kind your engineers invoke every time Cursor writes a PR, or your ops team invokes every time an agent triages a ticket — runs about $0.40. That is a 20x markup for the same underlying inference, because you are paying for five to fifty Chat-mode calls, plus tool calls, plus the context window expanding on every iteration.
A Deep Research run at the depth OpenAI, Anthropic, and Google are shipping in 2026 — the kind that returns a forty-citation brief — runs about $40. A 2,000x markup over Chat, because the planner spawns dozens of agents and each one is its own loop, and then the synthesizer does a long-context reduce over everything the swarm produced.
A Cowork session — three hours of a paid engineer cooking with Claude Code or Cursor, writing and reviewing and iterating — can land at $400 without anybody doing anything unusual. A 20,000x markup over Chat. The model is running continuously, the context window is expanding session-long, skills are firing, files are being read and written, and the environment is being inspected on every turn.
$0.02 to $400. Five orders of magnitude. Same silicon.
If your AI budget line item is built on the assumption that "a query costs X," and X was calibrated on Chat-mode calls in 2024, you are under-budgeted by a factor that depends entirely on what your people actually do. In 2026 they are not doing Chat. The best ones are living in Cowork.
Five orders of magnitude of governance
The cost spread gets the attention, but the governance spread is the harder problem.
Chat mode is the easy case. Stateless. Auditable. The prompt is logged, the response is logged, the token count is known. If it leaks data it leaks one document. Your existing DLP and logging pipeline handles it the same way it handles an email.
Agent mode changes the audit surface in three ways. First, it calls tools, which means it can write — to a database, to an API, to a ticketing system — and every one of those writes is a potential policy violation if the tool auth is loose. Second, it is non-deterministic by construction; you cannot replay the run and get the same sequence. Third, it can ask for approval mid-loop, which means human-in-the-loop is not a feature you add later, it is a design decision you make before the first iteration.
Deep Research mode introduces a quiet risk most enterprises have not thought about: the swarm. A single deep-research call may spawn twenty, fifty, one hundred sub-agents. Each of them hits external sources. Each of them returns a chunk of content that ends up in the synthesizer's context. If you do not know what those sub-agents retrieved, you do not know what your final brief was actually grounded in. Conversely, deep research produces the richest audit trail in the four modes — every citation is a source claim you can verify, every tool call is logged, every sub-agent run is inspectable. It is the mode with the most governance risk and the most governance opportunity in the same envelope.
Cowork mode is, today, the most dangerous un-governed surface in most enterprises. Engineers are running cowork tools against production code, production data warehouses, production tickets, production secrets. The persistent memory across sessions means what an engineer asked on Monday affects what the model proposes on Friday. The skills library means a shared organizational behavior now exists across people's laptops. The environment access means a single compromised session is a single compromised workstation plus everything reachable from it. If you have not answered the question "who in my organization is running an agent with write access to production today," you are already in Cowork mode, you just have not governed it yet.
Four modes. Four different audit surfaces. Four different data-exfiltration risks. Four different human-in-the-loop requirements.
You do not get to govern "AI" at the enterprise. You govern the modes.
The enterprise decision is no longer which model
For three years the conversation with our board, our executives, and our partners has been some variant of "which model." Which foundation model do we standardize on. Which vendor do we commit to. Which provider has the best reasoning.
Those questions still have answers and the answers still matter. But they are no longer the decision that determines what AI costs, what it delivers, and what it exposes.
The decision is now:
- Which mode is appropriate for this workflow?
- For which task inside that workflow?
- At what rate — how often, how many tokens, how many sessions?
- Under which governance — which audit, which approvals, which data boundaries?
Chat for the FAQ. Agent for the triage. Deep Research for the competitive brief. Cowork for the engineers. And underneath all of it, a single control plane that knows which mode is running, what it costs, what policies apply, and what evidence was produced.
That is not a model question. That is an operating model question. And it is why every organization I work with is rebuilding its AI governance from the modes up, not from the models down.
What the rest of this series is
This is Part 1 of a six-part series. The rest goes mode by mode and then into how you run your own.
- This post. The thesis: frontier AI runs in four modes, not one.
- Chat Mode. Single-shot on shared silicon. What the fourteen infrastructure layers actually do, how reasoning models change the shape without changing the mode, and what Chat is structurally good and bad at.
- Agent Mode. The loop is the machine. The think-act-observe cycle, tools versus skills versus MCP, context compaction, and the economics of an iteration.
- Deep Research Mode. Planner, swarm, synthesizer. Why a deep-research call is three machines pretending to be one, and what that means for citations, cost, and audit.
- Cowork Mode. State is the coworker. Claude Code, Cursor, Operator, Codex, ChatGPT Projects. Persistent memory, skills, knowledge base, environment access, and why this is the most dangerous un-governed surface in the enterprise today.
- Running Your Own LLM OS. The four modes, assembled from Bedrock, Azure, Vertex, OpenRouter, or self-hosted vLLM, against H100, H200, and Blackwell hardware. What changes, what does not, and the reusable enterprise stack — control plane, context compiler, token ledger, skill registry, MCP gateway, eval harness.
Each part will have its own interactive visualization. Each part will end with the enterprise implication.
The connective tissue
If you read the Context Compilation papers, you will see the four modes as something else too: four different compiler passes over the same Context IR. Chat mode is a single lowering pass — prompt to tokens to response. Agent mode is a compiler that iterates on its own output. Deep research is a parallel compiler with a reduce step. Cowork is a compiler that keeps state across invocations. Same IR, different lowering strategies, different runtime budgets, different guarantees.
That framing matters because it tells you where the leverage is. The leverage is not in the model. The leverage is in the compilation layer that decides, for each request, which mode is appropriate, which context to assemble, and which governance to apply. That is the layer an enterprise actually builds. Everything else is rented.
What to do on Monday
Three moves, in order.
- Inventory which modes you are already running. You are running all four. You just have not named them yet. Start with your top twenty AI-enabled workflows. Tag each one as Chat, Agent, Deep Research, or Cowork. The ones you cannot tag are the ones you are not governing.
- Re-cost the portfolio. Run the actual numbers. Most enterprises I see have budgeted for Chat and are paying for Cowork. The gap is typically two to three orders of magnitude and growing.
- Put the audit surfaces on one page. For each mode, answer: what is logged, who can approve, which tools have write access, which memory persists, who owns the kill switch. If any of those answers are "it depends," that is where the next thirty days of governance work goes.
I will come back to each of these in the parts that follow.
In the meantime: the LLM is not a model. It is an operating system. It runs in four modes. Run them on purpose.
Operate. Publish. Teach.
Further reading on the infrastructure layer under all four modes — the fourteen layers of a single LLM API call — is captured well in Brij Pandey's recent writeup. That post covers the kernel of the LLM OS. This series covers what runs on top of it.