Looped Language Models: Thinking Deeper Without Getting Bigger

Chain-of-thought says: to reason harder, emit more tokens. A looped transformer says: to reason harder, run the same layers more times. No tokens written. Same weights. Just more passes over the hidden state.

This is a small idea with a big consequence. A 2.6B looped model can match a 12B standard model on reasoning benchmarks (Ouro, Zhang et al.). The extra capability comes from depth, not parameters — and depth becomes a dial you turn at inference, not a decision baked into pre-training.

Why this matters

Every serious reasoning system today pays for depth the same way: with tokens. GPT-5 thinking, Claude's extended thinking, DeepSeek-R1 — each stretches a CoT trace to hundreds or thousands of tokens to reach an answer. That is expensive. Each extra token re-runs the full stack of layers and grows the KV-cache.

Looped models propose a different bargain. Reuse the same transformer block L times on the same token position, internally, before emitting. You pay for L times the block FLOPs but no extra tokens and no extra KV. The hidden state itself becomes the scratchpad.

If this works at scale — and evidence from late 2025 and early 2026 says it does — then "reasoning effort" becomes a floating-point knob, not a generation-length problem.

First principles: what looping actually means

Start with a vanilla transformer. Stack N layers. Token embedding x_0 goes in; each layer f_i maps it:

x_{i+1} = f_i(x_i)    for i = 0, ..., N-1

Each f_i has its own weights. N of them. Total parameters scale linearly with N.

Now replace this with one block f (shared weights), applied L times:

x_{t+1} = f(x_t)      for t = 0, ..., L-1

You have traded parameter count for compute. A single block with L loops has the same effective depth as L stacked blocks, but parameters stay at one block's worth.

Typical looped architectures compromise: use a small number k of unique blocks, loop the group L times. Effective depth is kL. Parameter count is k.

The theoretical question: can iterative computation with shared weights actually solve problems that need deep computation? Yes — Saunshi et al. (ICLR 2025) prove that for problems solvable by iterative algorithms (addition, p-hop induction, arithmetic), a k-layer block looped L times approximates a kL-layer non-looped model, and log(p) loops suffice for p-hop induction, matching the depth lower bound.

In other words: for many reasoning problems, depth is what matters and parameters are wasted.

The math is simple; the trick is training

A looped forward pass, in PyTorch-shaped pseudocode:

class LoopedBlock(nn.Module):
    def __init__(self, k_layers, d_model, n_heads):
        super().__init__()
        self.layers = nn.ModuleList([
            TransformerBlock(d_model, n_heads) for _ in range(k_layers)
        ])

    def forward(self, x, n_loops):
        for t in range(n_loops):
            for layer in self.layers:
                x = layer(x)
        return x

That is it. The architecture is trivial. What is hard is training it so the loop actually learns to "think" and so you can vary n_loops at inference.

Three problems appear immediately.

First: gradient flow. Backprop through L loops is like backprop through a deep RNN. You need residual connections and careful normalization or gradients die. Modern looped models use pre-norm and gated residuals, with techniques borrowed from the deep RNN literature.

Second: loss surface. Saunshi et al. describe a "River-V-Valley" landscape — sharp, high-curvature basins that only weight-sharing produces. Monotonic per-depth loss improvement is possible only with looped weight-sharing, because every loop iteration sees the same parameters and pressures them to behave like a contraction toward the correct answer.

Third: how do you pick L at inference? Fix it? Learn it? Let the model decide per-token?

Ouro: making looping work at pre-training scale

Ouro is the first family of looped language models pre-trained from scratch at scale — 7.7 trillion tokens, 1.4B and 2.6B variants. It answers the three training problems above with three design decisions.

Shared-block architecture. A small transformer stack with weight-sharing across recurrent steps. The model keeps a stable representation space that every loop maps within.

Entropy-regularized dynamic depth. A learned halting mechanism decides, per token, how many loops to run. The training objective adds an entropy bonus that pushes the halting distribution away from always-halting-at-1 and always-halting-at-L_max, so the model actually uses the halting signal. Simple tokens exit after 1–2 loops; hard tokens ride out to L_max.

Latent reasoning as pre-training objective, not post-training hack. CoT today is bolted on after pre-training by SFT plus RL on reasoning traces. Ouro pushes reasoning into pre-training: the loss rewards the model for allocating depth where depth is needed, before any instruction tuning.

The result: Ouro-1.4B with T=4 loops matches Qwen3-4B, and on MATH500 scores 82.4 vs 59.6. The 2.6B version competes with models up to 12B. The authors are careful — they point out the gain is in knowledge manipulation, not storage. You still need parameters to hold facts. Looping gives you cheap depth for computing with those facts.

A picture of the Ouro forward pass, per-token:

flowchart TD
    T[Token embedding x_0] --> B1[Shared block, loop 1]
    B1 --> H1{Halt?}
    H1 -- no --> B2[Shared block, loop 2]
    B2 --> H2{Halt?}
    H2 -- no --> B3[Shared block, loop 3]
    B3 --> H3{Halt?}
    H3 -- no --> B4[Shared block, loop L_max]
    H1 -- yes --> O[Unembed to logits]
    H2 -- yes --> O
    H3 -- yes --> O
    B4 --> O

Same weights in every block. The halting head is a tiny classifier on the hidden state that outputs p(halt | x_t). Entropy regularization on that distribution during training is what prevents collapse.

A worked example: p-hop induction

To see why looping helps, consider p-hop induction. You have a sequence of characters; the query is "find the second-most-recent occurrence of X and output the character that followed it — then do that hop again, p times." A 4-hop problem requires tracing 4 pointer dereferences.

Non-looped transformer: you need at least log(p) attention layers, one per hop, because each hop needs a fresh attention pattern. Parameter cost scales linearly with p.

Looped transformer: one shared block, loop log(p) times. Each loop re-runs the same attention heads, but because the hidden state has been updated by the previous loop, the heads attend to a different position on each pass. One set of weights performs all p hops.

Concretely: for p=16, you would need 4 non-looped layers versus 1 looped block run 4 times. 4x fewer parameters. Same FLOPs. Same accuracy. The saving compounds as p grows.

The mechanistic picture: fixed points and cyclic trajectories

Why does this work? A paper from April 13, 2026 — Blayney et al., "A Mechanistic Analysis of Looped Reasoning Language Models" — gives the cleanest answer yet. They studied trained looped LMs and ran the recurrence long enough to watch what happens in latent space.

Two observations stand out.

First: each layer inside the shared block converges to its own distinct fixed point as loops proceed. Track the activations of layer i across loop iterations t = 1, 2, 3, ...; the output of layer i stabilizes to a fixed vector for that input. Different layers have different fixed points. The block as a whole follows a consistent cyclic trajectory — it visits the same sequence of points each loop.

Second: attention heads stabilize as the fixed point is approached. Early loops show attention patterns shifting between iterations — the model is still "thinking." Later loops show constant attention — the computation has converged. This is a learnable signal the model can use to decide it is done.

This matches the halting behavior Ouro's entropy objective discovers empirically: the model learns to detect "we have hit the fixed point" and exit. It also mirrors how feedforward networks work — Blayney et al. show looped blocks learn stages of inference that look like a feedforward model's layer progression, then repeat those stages with each loop, going deeper through the same pipeline.

A rough mental model: a looped block is a contractive map F with multiple fixed points, and the input determines which fixed point you land in. Easy inputs land fast (1–2 iterations); hard inputs take longer; pathologically hard inputs may never converge.

Think harder vs know more: where looping helps and does not

Frey et al. (March 2026) ran a cleaner ablation than most: they trained parameter-matched and FLOP-matched variants with (a) adaptive per-layer loops, (b) gated memory banks, and (c) both. The split is sharp:

Looping primarily helps mathematical reasoning. GSM8K, MATH, ARC — large gains.
Memory banks primarily help commonsense and factual tasks. HellaSwag, TriviaQA — looping does little.
Combining both beats a 3x-layer iso-FLOP baseline on math while keeping commonsense parity.

This matches the theoretical picture. Looping adds computational depth; it does not add storage. If your task needs more facts, loop harder all you want — nothing comes out of nothing. If your task needs to transform facts you already have, looping is extraordinarily efficient.

Tradeoffs and failure modes

Looping is not free, and it has real downsides you hit in practice.

FLOPs still scale with L. You saved parameters, not compute. If you loop 8 times, you did 8 blocks' worth of work. Memory for activations during training scales with L (gradient checkpointing helps; it is standard for this reason). At inference, time-to-first-token gets worse than a non-looped model of the same parameter count.

Halting is fragile. If the entropy regularizer is miscalibrated, the model either always halts at 1 (pre-training collapse) or always halts at L_max (wasted compute). Ouro spends significant ablation time on this.

Not all problems are iterative. For tasks where the answer is essentially a lookup — "what year was Napoleon born?" — loops do not help. You need parameters, not depth. This is why Ouro's claim is carefully worded around "knowledge manipulation" rather than "knowledge capacity."

Interpretability looks different. Traditional circuit analysis assumes layer i has different weights from layer j. With weight-sharing, "circuits" now span loop iterations, and you need to decompose along a cyclic trajectory rather than a layer stack. The mechanistic work is catching up but standard tools need adaptation.

Parallelism across loops is impossible. A non-looped model can parallelize layer-wise across GPUs with pipeline parallelism. A looped block must finish loop t before starting loop t+1 on the same token. You lose a degree of freedom in systems design.

Practitioner's lens

If you ship LLM-backed products today, none of this changes your stack next week — there is no Claude-scale looped model in production yet. But the direction matters.

The most interesting near-term implication is agents. Today's agent systems pay for hard reasoning with long CoT traces, which balloons latency and token cost per turn. A looped backbone would let you dial reasoning depth per turn without emitting visible tokens — useful for intermediate steps in a multi-turn agent loop where the model does not need to show its work. Imagine a browser agent that spends 8 loops on "which DOM node should I click?" and 1 loop on "format the extracted text" — today you pay for the former with 500 CoT tokens and a round-trip.

The second implication: evals need to change. Reasoning benchmarks that measure "accuracy per output token" will start misclassifying looped models as insanely efficient. The right axis is accuracy per FLOP, and most eval harnesses do not track it.

Third: if you are building with small models for latency reasons, watch looped small models closely. A 1.4B that performs like a 4B at similar or better inference cost (small params fit cheap GPUs even with L=4) changes the deployment math for on-device and edge use.

The research frontier here is young. Ouro is the first credible pre-trained looped LM. Mechanistic understanding arrived weeks ago. The training recipes are still fragile. But the idea — that depth is elastic and reasoning lives in latent space — is the kind of thing that stops being novel and starts being table-stakes within 18 months.