Agent Mode: The Loop Is the Machine

TL;DR

Agent Mode is a loop around Chat Mode. Each iteration of think → act → observe is a full Chat-mode call
One agent turn ≈ 5–50 Chat-mode calls, plus tool calls, plus scratchpad, plus a stopping criterion
Tools, Skills, and MCP are three abstractions for the same thing: giving the model hands
The hard problem is not the loop — it is state, context window pressure, context compaction, tool auth, and stopping
Agent Mode governance is categorically different from Chat Mode governance: non-determinism, tool authorization, end-to-end audit, and kill switches must be designed in, not bolted on

Continue to Part 4: Deep Research Mode — Planner, Swarm, Synthesizer

When Cursor writes a pull request for your engineer, here is what you actually pay for.

The model does not write the PR in one pass. It reads a file. It thinks. It searches the codebase. It reads another file. It thinks again. It edits. It runs your tests. Two tests fail. It reads the test file. It thinks about why. It edits the fixture. It runs tests again. All green. It writes the PR body. It commits.

That entire sequence is one agent turn. Each of those steps — the think, the act, the observe — is a full Chat-mode call through all fourteen infrastructure layers from Part 2. Plus a tool call with auth. Plus a logging event. Plus state that has to persist across iterations.

This is Mode 2. The loop is the machine.

The loop

Agent Mode is built around a single primitive — the agentic loop — that looks like this:

Think. The model reads the current state and decides what to do next. Full Chat-mode call.
Act. The model emits a tool call. The runtime executes it outside the model.
Observe. The runtime feeds the result back into the context.
Repeat until a stopping criterion is hit.

Agent Mode: Think → Act → Observe → Repeat

Every iteration is a full Chat-mode call. Tools are how the model gets hands. State is how it keeps its place.

Code Agent

Cursor-style PR writer. Reads, reasons, edits, runs tests, iterates.

Iterations

Tool calls

Tokens

Cost (final $0.480)

$0.000

iteration0of 12 steps

think

act

observe

Run trace

THINKPlan the change

Goal: add a null check to formatUser(). Look at callers first.

ACTsearch_codebase

tool: grep for formatUser

OBSERVE4 callers found

src/api/user.ts, src/lib/mailer.ts, 2 tests

ACTread_file

tool: read src/api/user.ts

THINKUnderstand shape

formatUser receives a Partial<User>. Callers don't guard.

ACTedit_file

tool: patch src/lib/format.ts

ACTrun_tests

tool: vitest run

OBSERVE2 tests fail

both in test/format.spec.ts — stub data needs update

THINKFix is in test data

Stub was relying on undefined name. Update fixtures.

ACTedit_file

tool: patch test/format.spec.ts

ACTrun_tests

tool: vitest run

OBSERVEAll 47 tests pass

green across the module

One agent turn = 3–12 Chat-mode calls. Every think is a full inference pass. Every act is a tool call with auth, logging, and a potential side effect. The model keeps going until it hits a stopping criterion — done, max iterations, confidence gate, cost cap, or a human approval.

What looks simple on paper has serious implications the moment you try to run it in production. The model is not in control of this loop; the runtime is. The runtime decides which tools the model can call, what the tool signatures look like, where the results go, how the context is compacted, and when the loop stops. The model just keeps suggesting the next move.

That is the part most enterprises miss. When your team ships an agent, they are not shipping a model. They are shipping a loop, a tool registry, a context-management policy, and a stopping strategy. The model is the easiest part to replace. The loop is the part you have to own.

Three abstractions, same idea

The industry has converged on three different words for the same underlying capability: giving the model hands.

Tools (OpenAI function calling, Anthropic tool use, Google function declarations) are the raw primitive. You declare a JSON schema for a function — name, description, typed parameters, return shape. The model emits a structured tool call. The runtime validates, executes, logs, and feeds the result back. This is the lowest-level abstraction, and the most portable.

Skills (Claude Skills, Cursor Skills, and increasingly others) are a higher-level packaging: a bundle of a prompt, a tool registry, example invocations, and sometimes code scaffolding, all addressed by name. Skills are routing abstractions as much as they are capability abstractions — "when the user asks about X, load skill Y and its tools." Skills reduce the cognitive load of designing tools from first principles and reduce the context bloat of having every tool defined on every call.

MCP — the Model Context Protocol — is the wire format that connects a model runtime to an external server exposing tools and resources. MCP is how Claude Code talks to your GitHub. How Cursor talks to your filesystem. How any cowork product reaches into any enterprise system. MCP is tool use as a network protocol. It is also the connective tissue that makes Cowork Mode possible (more on that in Part 5).

None of these are architecturally different from "give the model a function it can call." They differ in who owns the tool definition, where it is hosted, how it is authenticated, and how it ships. Tools are the building block. Skills are the packaging. MCP is the transport. The loop runs the same way underneath all three.

Context pressure and compaction

The hard part of running an agent loop is not the loop. It is the context window.

Every iteration adds to the context:

The previous think (the model's reasoning).
The previous act (the tool call).
The previous observe (the tool result).
Sometimes a scratchpad summary.

After five iterations of a well-behaved agent, the context can easily cross 30,000 tokens. After fifteen iterations, 100,000. After thirty, you are at or past the limits of the model you are using, and you are paying the price of prefill on every single subsequent call because the context is changing on every iteration. (Prompt caching, the 2026 saver from Chat Mode, works beautifully on a stable prefix and much less well on an agent loop.)

The loop handles this with compaction — a periodic summarization step that collapses older iterations into a compressed form and replaces them in the context. Cursor compacts. Claude Code compacts. Every serious agent framework compacts. Done well, compaction preserves the important state — what you decided, what you learned, what you still need to do — and drops the noise. Done poorly, it hallucinates away the decision trail, and the agent loses the thread.

Compaction is the single most underappreciated mechanic in agent design. It is also, not coincidentally, one of the five layers of the context compilation framework: an agent loop is a context compiler that runs on every iteration.

State

Even with compaction, the loop needs somewhere to put things.

The scratchpad is the model's own notes, usually a reserved region of the context that it can write to freely. The model decides what to remember across iterations. On a code-agent run, the scratchpad holds the plan, the files touched, the hypotheses tried.

Short-term memory is the application's record of the current run — tool calls, results, decisions — separate from the context. This is what your audit system reads. This is what enables replay.

Long-term memory is anything the agent carries from session to session. Vector stores, graph stores, MemoryOS-style governed context. On a single agent turn, this feels like retrieval. Across many agent turns, it feels like the agent is learning your organization. In Cowork Mode (Part 5) it becomes the defining property of the machine.

The difference between a mediocre agent and a good one, most of the time, is not the model and not the tools. It is how state is organized: scratchpad structure, short-term-memory schema, long-term-memory policies. The model is the engine. State is the transmission.

Stopping

A loop that runs forever is a loop that bankrupts you.

Every production agent has at least one stopping criterion, and usually several. The five that matter:

Done. The model signals that it has completed the goal. The happy path.
Max iterations. A hard ceiling on loop depth — typically 20 to 100 for a serious code agent. If the agent has not converged by then, it has gone off the rails.
Confidence gate. A second model (or a deterministic check) evaluates each step and breaks the loop when confidence drops below a threshold.
Human-in-the-loop. The agent pauses and requests approval — typically before a write, a destructive action, or anything that crosses a trust boundary.
Cost cap. Token or dollar budget. When the meter hits the ceiling, the loop halts.

If you build an agent and you have not implemented all five, you have a demo, not a production system. This is not a style preference; it is how you prevent the occasional bad run from becoming an occasional unlimited run.

Numbers

The economics of Agent Mode are what make it the mode most enterprises mis-budget.

A single agent turn is, at its best, five iterations — five Chat-mode calls. At its worst, fifty. Each iteration carries a growing context, so the token cost is not linear in iterations; it is super-linear until compaction kicks in.

A code-agent turn like the one in the animator above, running on a frontier model with reasoning on for the think steps:

Iterations: 6 think steps, 6 act steps, 6 observe steps = 18 loop events
Tokens: ~14,800 total across the turn, of which maybe 40% is model reasoning and output
Cost: ~$0.48
Wall-clock: ~90 seconds

A support-agent turn:

Iterations: fewer and simpler, but still 10 loop events
Tokens: ~6,400 total
Cost: ~$0.21
Wall-clock: ~40 seconds

Compare to Chat Mode: a single Chat call on the same model is $0.02 and 400 milliseconds. An Agent turn is 10 to 25× the cost and 100 to 200× the wall-clock.

And that is a single turn. A Cursor engineer shipping a real feature is running agent turns back to back for hours. A customer-support queue is running a turn every time a ticket arrives.

Five orders of magnitude is a series thesis. Three to four orders of magnitude is just a normal Tuesday in Agent Mode.

Governance is categorically different

Chat Mode governance is the easy case. One call in. One response out. Log the prompt, log the output, count the tokens.

Agent Mode governance is a different problem class. Three things break the Chat-mode model:

Non-determinism. You cannot replay an agent run and get the same sequence. The model's reasoning is sampled. The tool results may have changed (the database moved on, the ticket was updated, the calendar shifted). Your audit has to record the entire run, not a call, because there is no single call to replay.

Tool authorization. Every tool the agent can reach is a potential write to a real system. Loose tool auth is the single most common enterprise agent mistake I see. If your agent can call delete_user, send_email, modify_ticket, commit_to_main, then the blast radius of a bad run is the union of everything those tools can touch. Scoped credentials, rate-limited tool calls, human-in-the-loop for destructive actions, and strict principal-of-least-privilege tool registries are not optional. They are how you make Agent Mode safe.

End-to-end audit. Chat Mode's log is a row. Agent Mode's log is a directed graph: every think, every act, every observe, every compaction, every stopping event, every tool auth decision, correlated to a single run ID. If your observability cannot produce a clean run graph on demand, you will not be able to answer the one question every audit committee asks: "what did the agent actually do?"

Put those three together and you get the governance answer: Chat Mode needs a log line. Agent Mode needs a run ledger — the primitive I wrote about in the AI Control Plane post. If your enterprise is running agents in production and you do not have a run ledger, you do not have governance. You have hope.

What Agent Mode is good for

Agent Mode earns its cost when any of these is true:

The task has multiple steps and the order is not known in advance. Investigations, triage, multi-file code changes, research that involves synthesis across sources.
The task needs to touch real systems. Read from a database, write to a ticketing system, modify a file, check a calendar, send an email.
The task is worth more than the run costs. An engineer hour is worth far more than an agent turn. A customer retained is worth far more than a support agent run. The question is always whether the output is worth the iterations.

What Agent Mode is bad at

High-volume low-latency work. If you need sub-second response, Chat Mode is your only option.
Tasks where the same output is required every time. Agents are stochastic. If you need deterministic behavior, the loop is the wrong shape; write the code.
Anything without a well-defined stopping criterion. "Keep trying until it works" is not a stopping criterion. It is a budget disaster.
Anywhere the tool registry is not governed. See above.

Three moves to close the governance gap most enterprises have today:

Inventory every production agent and its tool registry. For each tool, answer: what does it read, what does it write, under which credentials, with what rate limit, logged to where. If any cell in that table is blank, you have found the agent most likely to cause your next incident.
Put a run ledger in front of every agent. Every run gets a run ID. Every think / act / observe in the run is a record correlated to that ID. Every tool auth decision is a record. Every stopping event is a record. If you only do this for one thing, do it for the agents that can write.
Define stopping criteria explicitly, per agent. Max iterations, confidence gate, HITL trigger, cost cap. Make them configuration, not code. Review them quarterly. If any agent has "no cost cap" on its config, fix that first.

Agent Mode is where enterprise AI stops being a demo and starts being work. The governance layer is what lets the work compound instead of accumulating risk.

Next up

Deep Research Mode is Agent Mode scaled laterally. A planner decomposes the question. A swarm of agents runs in parallel, each one a full loop. A synthesizer does a long-context reduce step over everything the swarm produced. Three machines pretending to be one — and a cost curve that does not look like anything before it.

Operate. Publish. Teach.

Modes of the LLM OS

Part 3 of 6

Part 2: Chat Mode

Part 4: Deep Research Mode