Claude Code on a 3.85M-line codebase (what we learned about

There is a version of AI-assisted development that works well. Small repos, contained tasks, a few hundred files. The agent finds what it needs, writes something reasonable, and moves on. Most of the demos you see online live in this world.

Then there is the version most engineering teams actually deal with. Millions of lines of code across hundreds of services, years of accumulated conventions, extension points buried three layers deep in an execution pipeline. This is where the gap between what AI agents promise and what they actually deliver becomes visible.

We wanted to measure that gap precisely. So we ran an experiment.

The setup

Same agent (Claude Code with Opus 4.6), same task, same codebase. The codebase was Elasticsearch's open-source repository, 3.85 million lines of Java across 29,000+ files. The task was implementing deterministic terms aggregation using the TPUT algorithm, a multi-phase distributed coordination problem that touches Elasticsearch's entire search pipeline.

One variable changed between the two runs: whether Bito's AI Architect was providing codebase context or not. AI Architect builds a knowledge graph of your entire codebase and delivers that context to coding agents via MCP, so the agent understands your architecture, conventions, and extension points before writing a single line of code.

Everything else stayed constant.

What scale does to a coding agent

On a 3.85-million-line codebase, a coding agent exploring through grep and file reads builds an incomplete picture of the system. It finds what it finds and makes the best architectural decision available with that information.

In our experiment, Claude Code without AI Architect explored Elasticsearch's codebase and concluded that the aggregation framework could not support multi-round shard communication. That conclusion was incorrect. The extension points existed. The patterns for adding new shard-to-coordinator actions were already there in the transport layer. The agent simply never found them.

So it built a workaround. It created a new aggregation type called deterministic_terms that forces shard_size to Integer.MAX_VALUE, pulling every unique term from every shard in a single pass. The agent's own documentation acknowledged this was equivalent to eliminating Phase 2 of TPUT entirely. A brute-force solution that trades severe memory risk for correctness on high-cardinality fields.

6 files changed. 8 unit tests, all single-shard. Zero multi-shard coordination logic.

What changes when the agent understands the system

With AI Architect providing a full knowledge graph of Elasticsearch's architecture, the same agent found the extension points immediately. It understood that a new phase could be inserted between query-reduce and fetch, that FetchSearchPhase.innerRun() was the correct integration point, and that the transport layer already had established patterns for exactly the kind of action it needed to add.

Before writing a single line of production code, it generated a codebase context summary, an architecture design document, and a file-level implementation plan. It then executed the implementation across 12 incremental tasks, compiling after each major change.

The result was genuine multi-phase TPUT: threshold computation, a new AggregationRefinementPhase inserted into the search pipeline, new transport actions, and Phase 3 gap resolution. Extended the existing terms aggregation API rather than creating a parallel one, consistent with how a senior Elasticsearch contributor would approach the problem.

27 files changed. 5 test files covering coordinator, transport, service, and phase layers, with proper multi-shard coordination coverage.

What this tells us about AI agents at scale

The model did not change between the two runs. The reasoning capability was identical. What changed was the information available before the first architectural decision was made.

At small scale, an agent's trial-and-error exploration produces enough context to make reasonable decisions. At enterprise scale, the same exploration produces a systematically incomplete picture, and the architectural shortcuts follow directly from that incompleteness.

The teams feeling this most acutely are the ones running coding agents on large, complex codebases and wondering why the output always feels one or two decisions short of what a senior engineer would produce. The answer is rarely the model. It is the context layer.

We published the full experiment with the complete side-by-side comparison, both pull requests, and the planning artifacts AI Architect generated before writing any code.

Read the full experiment: The TPUT implementation Claude Code got wrong and AI Architect got right

AI Architect connects to Claude Code, Cursor, and every major coding agent via MCP. If your team is working on a large codebase and you want to see what it finds, get started at bito.ai.