Test-Time Scaling for Coding Agents: When Trajectories Aren't Tokens

For math reasoning, scaling test-time compute is essentially solved: sample K solutions, pick the one that passes a verifier, profit. With the right verifier, performance improves predictably with log(K). The o1 and DeepSeek-R1 results made this concrete — more compute at inference equals better answers, reliably.

Coding agents break this story in ways that aren't immediately obvious. The fix requires rethinking what "search" means when your rollout produces a thousand-line diff and a mutated filesystem, not a hundred-token equation.

Why This Matters

Production coding agents are reaching the point where the budget question is real: given 10× the inference compute, how do you spend it? Run one agent and hope? Run ten in parallel and pick the best? Retry sequentially with the previous failure in context? Each strategy has radically different cost profiles, success rates, and tail-risk behaviors. Getting this right is the difference between a coding agent that costs $0.50 per task and one that costs $50 per task for the same benchmark score. Recent work — including Scaling Test-Time Compute for Agentic Coding from April 2026 — attacks this problem directly, and the mechanics deserve a careful look.

Test-Time Scaling: The Reasoning Case

Before agents, let's be precise about how test-time scaling works for reasoning tasks.

A language model's output is a sample from a distribution P(y | x). For hard problems, the mode of this distribution might be correct, but the probability mass at the mode is low. Sample K times, and the probability that at least one sample is correct grows as:

P(at least one correct in K samples) = 1 - (1 - p)^K

For p = 0.1 (10% per-sample accuracy), K = 22 gives a 90% chance of seeing a correct answer. This is Best-of-N.

Best-of-N only helps if you can identify the correct sample. The verifier is the bottleneck. For math, ground-truth checking is cheap — does the expression evaluate to the target number? For coding, a test suite plays this role, but it's noisy, gameable, and often incomplete.

Beyond Best-of-N, you can do sequential search: take a wrong answer, put it in context as a "prior attempt", and generate again. This is self-refinement. It works modestly on short tasks because the model can see what went wrong.

You can also do tree search: at each intermediate step, score partial solutions, prune bad branches, expand promising ones. This is MCTS or beam search applied to the token sequence.

All of these strategies work because the unit of reasoning is a sequence of tokens, the trajectory to the answer is short (dozens to hundreds of tokens), and the state is fully captured in the context window. You can enumerate plausible paths cheaply.

The Agent Case Is Fundamentally Different

A coding agent doesn't produce tokens as its primary output. It produces a trajectory: a sequence of (action, observation) pairs.

action:      read_file("src/utils.py")
observation: <600 lines of Python>

action:      run_tests("pytest tests/")
observation: FAILED tests/test_utils.py::test_parse_date_edge
             AssertionError: expected 2024-01-01, got 2024-01-00

action:      edit_file("src/utils.py",
                old="day = int(parts[2])",
                new="day = max(1, int(parts[2]))")
observation: success

action:      run_tests("pytest tests/")
observation: 47 passed, 0 failed

A non-trivial coding task generates 50–200 of these pairs before termination. The trajectory itself is thousands of tokens long. The state at each step is the entire filesystem plus the full history of prior actions — and that state is not in the context window, it's in a running process on a machine.

This creates three problems that standard test-time scaling simply doesn't face.

Problem 1: Raw cost. Running K independent agents on a hard SWE-bench task costs K × (average trajectory cost). At $0.50–$5 per agent attempt on a complex problem, K = 32 is $16–$160 per task. The compute budget has real teeth.

Problem 2: Trajectory diversity. Independent agents don't naturally explore different parts of the search space. Two agents starting from the same problem often converge on the same incorrect hypothesis, then fail at the same point. Naive Best-of-N has terrible sample efficiency when samples are correlated. Sampling at higher temperature helps with diversity but degrades quality per sample.

Problem 3: Wasted experience. When agent 1 discovers that the bug is in src/utils.py line 247, not src/parser.py, that's expensive knowledge. It took 20 steps of file reading and test running to learn. Agent 2, running independently, will rediscover this the hard way. All the reasoning invested in the first attempt is discarded.

The Experience Representation Problem

The core insight in recent work on test-time scaling for coding agents is that trajectory history must become a first-class representation. In the reasoning setting, you don't need structured prior-attempt memory because each attempt is cheap to regenerate. In the agent setting, failed attempts contain expensive, reusable signal.

Here's a sketch of what an experience-guided agent loop looks like:

def agent_with_experience(task, model, max_attempts=5, max_steps=100):
    experience_bank = []

    for attempt in range(max_attempts):
        # Compress prior knowledge into a context-budget-aware summary
        experience_ctx = build_experience_summary(experience_bank)

        trajectory = []
        env = fresh_environment(task)  # clean sandbox

        for step in range(max_steps):
            action = model.step(
                task=task,
                state=env.observe(),
                history=trajectory,
                experience=experience_ctx,  # <-- injected prior knowledge
            )
            observation = env.execute(action)
            trajectory.append((action, observation))

            if env.is_terminal():
                break

        result = env.evaluate(task)  # run test suite
        experience_bank.append(extract_experience(trajectory, result))

        if result.success:
            return trajectory, attempt

    return best_of(experience_bank)


def build_experience_summary(bank):
    """
    Compress prior trajectories into something short enough to inject
    into every step of the next attempt.
    Key: hypotheses that failed, relevant files found, test errors seen,
    correct subtrajectories worth reusing.
    """
    if not bank:
        return ""
    entries = []
    for exp in bank:
        entries.append({
            "approach_tried": exp.initial_hypothesis,
            "first_failure":  exp.blocking_error,
            "relevant_files": exp.files_read,       # localization signal
            "test_outputs":   exp.test_failures,    # verifier signal
            "partial_wins":   exp.correct_subtraj,  # reusable fragments
        })
    return format_as_xml_context(entries)

The function build_experience_summary is where the hard work lives. It needs to compress a potentially 5,000-token failed trajectory into something useful — and short enough to fit in a context budget that will be consumed over 100 subsequent steps.

Naive compression — paste the full prior trajectory — doesn't scale past two attempts. You'd consume your entire context window on history before the model can do any new work. A structured summary is lossy but cheaper. The design space is wide:

Use the model to write a natural-language post-mortem of its own failure ("I assumed the bug was in the date parser, but the tests revealed it's in the timezone handling")
Extract structured data: modified files, test failure messages, stack traces, assertions hit
Maintain an incrementally updated "key findings" list that persists across attempts
Tag correct subtrajectories (localization steps, useful file reads) for explicit reuse in the next attempt's initial actions

Trajectory Stitching and Branched Search

The most aggressive form of experience reuse is trajectory stitching: identify the best subtrajectory from attempt 1, the best continuation from attempt 2, and compose them into a trajectory that's better than either.

Attempt 1: Correctly localize bug (steps 1-15) → Wrong fix → Tests fail
Attempt 2: Mislocalize bug → Waste 20 steps → Stumble onto correct fix pattern
Attempt 3 (stitched): Attempt 1's localization + Attempt 2's fix → Tests pass

Done right, this is Monte Carlo Tree Search over the trajectory space: maintain a tree of partial trajectories, evaluate leaf nodes with the test suite, backpropagate quality estimates, select the next node to expand via a UCB-style policy.

graph TD
    A[Task: Fix timezone handling bug] --> B[Read entry points]
    B --> C[Localize to date_utils.py]
    C --> D{Branch point: snapshot}
    D --> E[Attempt 1: type coercion fix]
    D --> F[Attempt 2: input validation fix]
    D --> G[Attempt 3: timezone normalization]
    E --> E2[Tests: 3 fail]
    F --> F2[Tests: 1 fail]
    G --> G2[Tests: 0 fail ✓]
    style D fill:#fff9c4
    style C fill:#c8e6c9
    style G2 fill:#a5d6a7

The catch is that coding trajectories are not freely spliceable. Edit 5 depends on Edit 4, which modified a file that Edit 5 reads. You can't take the last action from attempt 2 and paste it after step 10 of attempt 1 — the filesystem state will be inconsistent, and the action may reference code that doesn't exist in that branch.

Solving this requires a clean snapshot/rollback mechanism. If agents run in Docker containers or VMs, you can take filesystem snapshots at arbitrary trajectory points and branch from them. This is real systems infrastructure — you're building a tree of container snapshots, each a few hundred MB, with a policy deciding which nodes to expand next. The overhead is non-trivial but tractable for high-value tasks. Sandboxed agent execution environments (like those used in SWE-bench evaluation harnesses) already have most of this plumbing.

Adaptive Budget Allocation

The parallel question: how do you decide how much compute to spend per task without knowing in advance how hard it is?

For reasoning tasks, there's a clean result from recent constrained-policy work: a lightweight difficulty predictor allocates compute proportionally. Easy tasks get 1–2 samples, hard tasks get 32. This recovers most of the benefit of always running 32 samples at a fraction of the total cost.

For coding agents, difficulty estimation is harder. Task complexity isn't well-predicted by the problem statement. A one-line bug fix can require reading 50 files to localize. A complex feature request might happen to be self-contained. Proxy signals help: repository size, number of files likely touched, whether the task involves unfamiliar APIs, whether prior similar tasks in the training distribution were hard.

A practical pattern: run a cheap initial scan (tight step budget, no experience context, fast model) to get a difficulty estimate before committing to the full experience-guided loop.

def adaptive_agent(task, fast_model, full_model, token_budget):
    # Phase 1: cheap difficulty probe
    probe = fast_model.run(task, max_steps=10, max_tokens=2000)
    difficulty = probe.estimate_difficulty()  # easy / medium / hard

    if difficulty == "easy":
        # Single attempt, tight budget
        return single_shot(task, full_model, max_steps=20)

    elif difficulty == "medium":
        # Sequential with experience, no stitching
        return experience_loop(task, full_model, max_attempts=3, max_steps=50)

    else:
        # Full search: experience loop + trajectory stitching
        return trajectory_search(
            task, full_model,
            max_attempts=5, max_steps=100,
            enable_branching=True
        )

The difficulty estimator is itself trainable. Given a corpus of historical tasks with known solve rates and step counts, you can fine-tune a small classifier on task descriptions and repository metadata. This is a form of learned compute allocation — the model learns to route tasks to the right inference strategy.

Verifier Quality Is the Ceiling

Every approach above assumes a verifier that can reliably distinguish correct trajectories from incorrect ones. For coding agents, the verifier is almost always the test suite. And the test suite has well-documented failure modes.

Incomplete coverage. Tests might pass even when the fix is subtly wrong. The agent solves the visible test cases while breaking untested behavior. This is related to, but distinct from, the oracle-gaming failure mode documented in SWE-bench analysis.

Test gaming under search pressure. Naive Best-of-N creates adversarial incentives. If you sample enough trajectories, you'll eventually find one that special-cases the test inputs rather than implementing the correct general behavior. This is statistically unlikely for a single agent on a single task, but across a search procedure with 50 rollouts and a noisy verifier, it's a real risk. The fix is a holdout verifier — run a second, more skeptical evaluation on solutions found via heavy test-time search.

Flaky tests. Non-deterministic test outcomes corrupt the search signal. A trajectory search procedure that prunes based on a flaky test will incorrectly eliminate correct solutions. Quantify flakiness on your test suite before investing in test-time scaling infrastructure.

No verifier at all. Open-ended tasks ("refactor for readability", "improve error messages") have no automatic ground truth. Test-time scaling doesn't help here without a learned reward model, which introduces its own hacking dynamics. Know when you're outside the regime where scaling works.

The practical implication: the benefit from test-time scaling scales with verifier quality. In environments with high-coverage, deterministic test suites — well-maintained open-source libraries, safety-critical codebases with extensive test infrastructure — scaling works well. In typical production codebases with 40% line coverage and flaky integration tests, the gains are limited and the risk of gaming is elevated.

Tradeoffs and Failure Modes

Correlated failures. If all K agents use the same base model and prompt, they often fail identically. Diversity requires higher sampling temperature (which degrades quality per sample), explicit diversity prompting ("approach this differently than before"), or different model checkpoints. None of these is free.

Context cost compounding. Each sequential attempt adds experience context. After 5 attempts, the experience summary consumes 3,000–5,000 tokens per model call. On a 100-step trajectory, that's 15% or more of per-step token cost. Context management is part of the algorithm, not an afterthought.

Sandboxing infrastructure. Branched search requires filesystem snapshots. Docker snapshot overhead is measurable — both in wall-clock time and storage. For most tasks, the overhead is acceptable. For short tasks (< 10 steps), snapshot overhead can dominate the actual agent execution time.

Reward hacking at scale. More search pressure = more opportunities to find trajectories that game the verifier. This grows nonlinearly with the number of attempts. Run a secondary holdout evaluation on high-search-effort solutions before deploying them.

Practitioner's Lens

If you're shipping a coding agent today, sequential retry with experience context is the lowest-hanging fruit — summarize what failed and why, inject that into the next attempt, and you get measurable improvement with minimal infrastructure. Running three attempts sequentially almost always beats three attempts in parallel for the same total token budget, because experience transfer is free and correlated parallel failures are expensive. Match compute to task criticality: one attempt for a one-line fix, five attempts for a critical security patch. Before investing in test-time scaling infrastructure at all, measure your verifier quality — if your test suite covers less than 60% of the codebase, fix the tests first. And if you're building the trajectory-stitching infrastructure, design your agent sandbox for snapshots from day one; retrofitting this into an existing ephemeral-container setup is painful.