Most AI coding tools send your prompt and the codebase to one model and ask it to do everything at once: figure out which files matter, decide what to change, write the diff, and check its own work. It works in a demo. It falls apart on a real repository.
The problem is not the model. It is the job description.
One prompt, four conflicting jobs
Finding the right files is a search problem. Planning a change is a reasoning problem. Editing is a precision problem. Reviewing is an adversarial problem — you have to assume the edit is wrong and try to break it.
Ask one context window to do all four and each one degrades. The model that just convinced itself the plan is good is the worst possible reviewer of that plan. It already believes.
The split
Vesper Code runs the loop as four specialized agents:
- File Picker scans the repository and decides what is relevant. It does not edit anything. Its only job is to be right about scope.
- Planner takes that context and decides what changes to make and in what order, before a single line is touched.
- Editor makes precise, context-aware edits across the files in the plan.
- Reviewer validates the result for correctness and consistency, with fresh context and an adversarial brief.
Each agent has a narrow job, a clean context window, and no incentive to defend a previous step's decision.
Why fresh context matters
The Reviewer is the part people underestimate. When the same context that wrote the code also reviews it, you get rubber-stamping. A reviewer that did not write the edit, and is explicitly told to find what is wrong, catches the race condition the editor was proud of.
This is the same reason human teams do code review with someone other than the author. The independence is the value.
The tradeoff
More agents means more model calls and more latency than a single shot. We think that is the right trade for code you are going to ship. A fast wrong diff is not faster — it just moves the cost to your review time and your production incidents.
The honest version: for a one-line change, the four-agent loop is overkill, and that is fine. For anything that spans files, the structure is what makes the output trustworthy instead of impressive.
That is the whole bet. Specialized agents, clean context, an adversarial reviewer, every time.