How Coding Agents Actually Work

I get asked how coding agents work — not “which one should I use” (I wrote about that), but what’s actually happening when I type a prompt and code appears. I use Droid, Claude Code, and Codex daily. After a year of this, I have a decent mental model of the internals. Here’s how these things work.

LLM APIs Are Stateless

Every API call to an LLM is independent. The model has no memory between calls. What feels like a conversation is the entire chat history — my messages, the model’s responses, tool results — sent from scratch every time.

When I send my fifth message in a Claude Code session, the agent isn’t continuing a conversation. It makes a fresh API call with my first four messages, the model’s four responses, every file it read, every command it ran, all the results — plus my fifth message. The model processes all of it as if seeing it for the first time.

This matters because sessions degrade. The history grows, hits the context window limit, and the agent has to compress or drop earlier parts.

The Agent Loop

A regular chat app like ChatGPT works the same way at the API level — full conversation history sent with every call. But a chat app sends one message, gets one response, and waits for the user.

A coding agent doesn’t wait. It loops. When I tell Claude Code to “add a login page”:

The agent sends my message to the LLM, along with the system prompt, conversation history, and a list of available tools (Read file, Write file, Execute command, Grep, etc.)
The LLM responds with a tool call: “I need to read src/routes.ts”
The agent executes that tool call locally — reads the file from disk — and sends the contents back to the LLM
The LLM processes the file, decides it needs another file, makes another tool call
This continues — read, think, write, run tests, read the output, fix errors — until the LLM decides it’s done and responds with text instead of a tool call

A single prompt can trigger many LLM calls. The agent reads files, writes code, runs the dev server, spots an error, fixes it, runs tests. Each call includes the full accumulated history.

That’s what people mean by “agentic” — the LLM decides what to do next, does it, observes the result, and keeps going.

What the Wire Protocol Looks Like

The loop I described above happens over a streaming protocol. From building my own agent orchestration tooling, I can see the raw events — structured messages for new text, tool calls, tool results, and turn completion signals flowing back and forth. Claude Code and Droid each have their own bidirectional stream with similar semantics.

I don’t see any of this when I’m using the CLI — it renders as a conversation. Underneath though, it’s a stream of typed events. Each event carries message content, tool call arguments, token usage, and metadata. The agent runtime parses these, executes tool calls, and feeds results back into the stream.

Context Window: The Agent’s Working Memory

Everything lives in one context window: the system prompt, instruction files, conversation history, file contents the agent read, command output, tool results. Agents may selectively include or summarize parts of this — prompt caching, selective context, compaction — but “one big window that fills up” is the right mental model.

Context windows are large — 200K tokens for Claude — but they fill up. A few big files, some test output, a stack trace, and it’s gone fast.

When it fills up, agents handle it differently. From what I’ve seen, Claude Code summarizes older turns into a shorter version, keeping recent context intact. Some agents truncate — drop the oldest messages. Others just end the session.

This is why agents “forget” things mid-session. A file the agent read 30 messages ago might have been compacted out of context. The agent doesn’t know it read it anymore. It sometimes re-reads files for the same reason — though it can also happen because the model decides a fresh read is safer than relying on stale context.

I find that starting a fresh session often works better than continuing a long one. Clean context, no compaction artifacts.

Each LLM call reports token usage — how many tokens went in and came out. I track context window fill percentage per turn in my own tooling. A few file reads and some test output eat through context fast. When I notice the agent’s output getting worse, I start a new session.

Session Resume

The API is stateless, but agents can persist sessions. Claude Code and Droid save the conversation history locally. When I resume a session, the agent reconstructs context from saved messages — the LLM gets the history replayed as if it happened in the current call.

Resume isn’t free. A long saved session still has the same context window pressure. But it lets me pick up where I left off without re-explaining the task.

Tools: How the Agent Touches Your Code

The LLM doesn’t have access to my file system. It can’t see my codebase. It reads and writes code through tools — functions the agent runtime provides and executes on the LLM’s behalf.

A typical terminal agent has tools like Read (file contents), Write/Edit (create or modify files), Execute (shell commands — tests, builds, git), Grep/Glob (search the codebase), and LS (list directories). These are defined as structured schemas and passed alongside the API request — separate from the system prompt, but the LLM sees them and knows what tools exist, what arguments they take, and what they return.

When the LLM decides it needs to read a file, it emits a structured tool call — basically JSON:

{"tool": "Read", "file_path": "src/app.ts"}

The agent runtime intercepts this, reads the actual file, and sends the contents back.

There’s no hard-coded decision tree. The model figures out “I should read the existing routes before adding a new one” from patterns in its training data and the current conversation. This is also why agents miss things — they only know what they’ve read. If there’s a relevant file they didn’t think to look at, they don’t know it exists. Good AGENTS.md files help because they tell the agent where things are, so it doesn’t have to discover everything from scratch.

Approvals

Not every tool call gets executed automatically. Agents have a permission layer between the LLM’s request and the actual execution. Claude Code auto-approves file edits by default but requires approval for shell commands — I have to explicitly opt into less restrictive modes. Droid uses autonomy levels that control what gets auto-approved based on risk. Codex has its own sandbox and approval model.

This is what the “allow/deny” prompts are. The LLM requested a tool call, the runtime evaluated the risk, and it’s asking me before executing.

System Prompt and AGENTS.md

Every LLM call starts with a system prompt — instructions the agent runtime injects before my message. This is where the agent gets its behavior, its rules, and its tool definitions.

Different agents use different filenames for this — Claude Code reads CLAUDE.md, Codex reads AGENTS.md, Droid reads both. I use AGENTS.md with a CLAUDE.md that just references it so I only maintain one file. The agent reads it at startup and includes it early in the conversation. It persists across the session — the model sees my project-specific instructions as long as they haven’t been compacted out. That’s why these instruction files are the single most effective thing I do for agent quality.

Skills work the same way. Trigger a skill, its prompt gets injected into the conversation. Nothing special mechanically — they’re just prompts that show up at the right time.

The Runtime Does a Lot

The agent isn’t just a wrapper that shuttles messages to the LLM and displays results. The runtime — the code that runs locally on my machine — has its own logic that the LLM never sees.

Compaction is one example. When the context window fills up, the runtime decides when to summarize, what to keep, and what to drop.

Sub-agents are another. The runtime can delegate subtasks — like file searches — to separate instances, possibly using cheaper models for work that doesn’t need a frontier model. I don’t see this happening from the CLI; the main agent just gets the results.

Permission gating is runtime logic too. The LLM asks to run rm -rf node_modules, the runtime evaluates the risk, and either executes it or shows me an approval prompt. The LLM has no idea its request was intercepted.

Modes are runtime decisions too. Most agents have an interactive mode where the agent pauses for approval, and a headless mode where it runs unattended — useful for CI, background tasks, or when I want the agent to just go. Claude Code has -p for single-prompt headless runs. Codex has full-auto mode. Droid has autonomy levels. All runtime decisions about how much human-in-the-loop to require.

Terminal Agents vs IDE Agents

Both run the same loop — send messages to the LLM, execute tool calls, feed results back. The difference is what tools they have and what context gets injected.

Terminal agents — Claude Code, Codex, Droid — give the LLM file system tools and shell access. The agent discovers the codebase by reading files and running commands, the same way I would if I SSH’d into a server with a new project. It has no idea what editor I’m using. It figures things out by exploring.

IDE agents — Cursor, Windsurf, Cline — run the same LLM loop, but inject additional context from the editor: open files, cursor position, selected text, language server diagnostics (type errors, lint warnings). Their tools include editor-specific actions like applying diffs inline or showing suggestions at the cursor.

IDE agents start with richer context about what I’m doing right now. They know I’m looking at line 42 of auth.ts with a type error. A terminal agent would have to discover that by reading the file and running the type checker.

The tradeoff is coupling. IDE agents are tied to their editor — that context comes from VS Code’s APIs. Terminal agents are editor-independent, which is why they’re more composable. I can script them, run them in CI, pipe data through them.

Tab Completion Is a Different Thing

Cursor’s tab completion and GitHub Copilot’s inline suggestions aren’t the agent loop. They use a specialized model optimized for speed — runs on every keystroke, predicting what I’ll type next. Low latency, no tool calls. It’s autocomplete, not an agent.

Cursor’s agentic features (Composer, Agent mode) use the full loop I described above. The tab completion runs alongside it as a separate system.

Sub-Agents

Some agents can spawn other agents. Droid has a Task tool — I can have the main agent delegate a subtask to a separate instance that runs independently and reports back. Amp has “deep mode” for something similar.

Same loop, just nested. The parent agent makes a tool call that spins up a child with its own context window. The child runs, does its work, returns a summary. The parent sees only that summary, not the full child conversation.

This is how agents handle work that’s too big for one context window — break it into pieces, run them in parallel.

Once I started thinking of the agent as a loop around a stateless API, its behavior made sense.