Transcoders: The Missing Piece for Transformer Circuits

Sparse autoencoders map transformer activations to interpretable features, but features alone don't explain computation. Transcoders replace MLP blocks with sparse-bottleneck surrogates, making the causal flow between features legible for the first time.

Abhinandan··11 min read·
0

Transcoders: The Missing Piece for Transformer Circuits

Sparse autoencoders (SAEs) gave mechanistic interpretability researchers something they had been missing for years: a principled way to decompose a transformer's internal representations into interpretable, monosemantic features. Thousands of features were found, labeled, and steered. But a dictionary of what the model knows at each layer still doesn't tell you how it computes. Transcoders close that gap.

Why SAEs Alone Don't Explain Computation

Start with the core problem: individual neurons in large language models are polysemantic. A single neuron activates for "the smell of bananas," "references to tropical climates," and "certain Python syntax constructs" simultaneously. This is not a coincidence or a flaw — it's a consequence of the superposition hypothesis.

The superposition hypothesis says that a neural network with n neurons can represent far more than n features by embedding them as nearly-orthogonal directions in n-dimensional space. When only a sparse subset of features activates at any given time, the mutual interference between features stays small. The model trades off signal fidelity for representational capacity, and for tasks that only require a few features at once, this trade-off is favorable.

Superposition makes individual neurons useless as interpretive units. A sparse autoencoder (SAE) inverts this compression. Given activations x at some layer, an SAE learns a dictionary matrix W_d ∈ ℝ^{d × n_features} and encoder W_e ∈ ℝ^{n_features × d} such that:

f(x) = ReLU(W_e * x + b_enc)        # sparse feature activations
x_hat = W_d * f(x) + b_dec          # reconstruction
loss  = ||x - x_hat||^2 + lam * ||f(x)||_1   # MSE + L1 sparsity

After training, individual columns of W_d correspond to interpretable, approximately monosemantic features. Researchers at Anthropic have found features corresponding to specific people, places, grammatical roles, and even abstract concepts like "the pivot point in an argument" in Claude-family models. The features exist; the SAE reveals them.

But here's what SAEs cannot tell you: if feature A fires at layer 12 and feature B fires at layer 13, does A cause B? Does A inhibit C? What is the actual computation happening inside the MLP block between those two residual-stream snapshots?

SAEs are observational, not causal. They describe the representations before and after a layer but say nothing about the transformation itself.

What MLP Blocks Actually Do

Before introducing transcoders, it helps to be precise about what an MLP block computes. In a standard transformer MLP:

def mlp_block(x):
    # x: residual stream activations, shape [d_model]
    pre_act = W_in @ x + b_in      # shape: [d_mlp]
    post_act = gelu(pre_act)       # elementwise nonlinearity
    out = W_out @ post_act + b_out # shape: [d_model]
    return out

The MLP reads from the residual stream, performs a learned nonlinear transformation through a higher-dimensional bottleneck, and writes back to the residual stream. The width d_mlp is typically 4× d_model.

From the circuit perspective, each MLP neuron (a single index of post_act) is a look-up function — it fires when specific input patterns are present and contributes a specific direction to the residual stream when it does. The problem is that MLP neurons are polysemantic for the same reason residual stream positions are: superposition.

So you have polysemantic inputs, a polysemantic bottleneck, and polysemantic outputs. SAEs decode the inputs and outputs. The bottleneck — where the actual computation lives — remains opaque.

Transcoders: MLP Replacement with a Sparse Bottleneck

A transcoder replaces the MLP block with a learned surrogate that performs the same input-output mapping through an interpretable sparse bottleneck. Formally:

def transcoder(x_pre_mlp):
    # x_pre_mlp: residual stream before MLP, shape [d_model]
    features = ReLU(W_enc @ x_pre_mlp + b_enc)  # sparse, shape [n_features]
    x_post_hat = W_dec @ features + b_dec        # shape [d_model]
    return x_post_hat, features

# Training objective against frozen MLP weights:
#   loss = ||mlp(x) - transcoder(x)[0]||^2 + lam * ||transcoder(x)[1]||_1

This looks almost identical to an SAE, with one critical difference: the input is the pre-MLP residual stream and the output is the predicted post-MLP residual stream. The transcoder is not reconstructing its input — it is predicting the output of a computation.

Training forces the sparse bottleneck to capture exactly what is causally necessary to reproduce the MLP's output. If a transcoder feature fires, it represents something the MLP genuinely uses as an intermediate computational step.

The n_features dimension is typically 10–50× larger than d_mlp — because many of the MLP's implicit neurons are superposed into each real neuron. The transcoder un-superposes them.

Building Circuits from SAEs and Transcoders

The power of transcoders comes from composing them with SAEs at residual stream checkpoints. Here is what the composed pipeline looks like across a single layer:

flowchart LR
    A["Residual stream\n(layer L)"] --> B["SAE\n(layer L)"]
    B --> C["Interpretable\nfeatures (layer L)"]
    C --> D["Attention\n(layer L)"]
    D --> E["Residual stream\n(pre-MLP, layer L)"]
    E --> F["Transcoder\n(MLP layer L)"]
    F --> G["Computation\nfeatures"]
    G --> H["Residual stream\n(layer L+1)"]
    H --> I["SAE\n(layer L+1)"]
    I --> J["Interpretable\nfeatures (layer L+1)"]

With both components trained, you can trace the full causal chain: which residual-stream features at layer L activate which transcoder computation features, which write which directions into the residual stream, which activate which SAE features at layer L+1. This is a circuit in the precise sense — a directed causal graph over interpretable sparse features.

Dunefsky, Chlenski, and Nanda (2024) demonstrated this on GPT-2 and found legible feature circuits for indirect object identification: subject-noun SAE features activate specific transcoder features that write the subject role into the residual stream, which then activates object-position SAE features downstream. The circuits found this way match those found by earlier, more labor-intensive activation-patching studies — but automatically, at scale.

Finding Circuits Automatically

The circuit discovery procedure is built on linear attribution. For each target SAE feature at layer L+1, score every transcoder feature by how much its write direction aligns with the target feature's read direction:

def find_upstream_features(target_feat_idx, sae, transcoder):
    """
    Returns transcoder features that write most strongly toward
    the target SAE feature's decoder direction.
    """
    # Each column of transcoder.W_dec is the residual stream direction
    # that feature writes when it fires.
    target_read = sae.W_enc[target_feat_idx, :]      # [d_model]

    # How much does each transcoder feature write in the direction
    # that the target SAE feature reads from?
    scores = transcoder.W_dec.T @ target_read         # [n_transcoder_features]
    top_k  = torch.topk(scores, k=10).indices
    return top_k

Linear attribution ignores cross-feature interactions and the transcoder's internal nonlinearities, so it is an approximation. Causal scrubbing — patching individual feature activations at inference time and measuring the downstream effect — gives ground-truth causal strength but is orders of magnitude more expensive. In practice, linear attribution finds the right candidate features; causal scrubbing confirms which are truly load-bearing.

Tradeoffs and Failure Modes

Reconstruction error accumulates. A transcoder with 2% per-layer MSE compounds across layers. Faithfulness metrics — how well a full SAE+transcoder substitution predicts real model behavior — degrade noticeably for circuits spanning more than 3–4 layers. This is the central practical limitation for deep-network circuit analysis.

Feature identity is unstable across training runs. SAEs and transcoders trained with different random seeds learn different feature decompositions. Feature #4721 in one run is not the same concept as #4721 in another. Automated alignment methods based on cosine similarity work well for high-activating features but fail on the long tail of marginal ones. Cross-run reproducibility is a genuine obstacle for building shared feature libraries.

Attention heads are out of scope. Transcoders cover MLP blocks only. A complete circuit analysis requires separate SAEs on attention head outputs and a theory of how query-key products select which positions to attend to. The components exist as research prototypes, but composing them into a unified circuit picture for real-world behaviors is an open research problem.

λ tuning is per-layer, not global. The sparsity penalty that produces clean monosemantic features in early layers produces either over-sparse or dense features in later layers. Models with 96 layers need 96 individually tuned transcoders. Automated λ-scheduling methods are under active development but not yet standard.

Scale remains unvalidated at frontier size. Most transcoder results come from models under 10B parameters. Whether transcoder features stay monosemantic and circuits stay compact at 70B+ is empirically unconfirmed. Early scaling indicators are cautiously positive — circuits appear to remain sparse — but the field has not yet done the full-scale audit.

What Gets Found in Practice

The most compelling empirical results involve tasks with known circuit structure from earlier activation-patching work: indirect object identification, factual recall, and syntactic agreement. In all three cases, transcoder-based circuit discovery finds:

  • Five to twenty transcoder features that are causally sufficient for the behavior.
  • Clean semantic interpretations of each feature (e.g., "the subject is a person," "the sentence is in past tense," "this is a proper noun").
  • Ablating only those features degrades performance on the target task while leaving unrelated tasks intact.

The tight correspondence between "interpretable label" and "causally necessary component" is what makes circuits-based interpretability feel different from post-hoc rationalization. The features are not found by asking "what would make this behavior explainable?" — they emerge from a training objective that only cares about reconstruction fidelity.

The Causally Grounded Extension

A March 2026 paper, Causally Grounded Mechanistic Interpretability for LLMs with Faithful Natural-Language Explanations, takes the next step: attach LLM-generated natural language labels to transcoder features and then verify those labels causally. Does intervening on the feature in the way the description predicts actually produce the described output change?

Many features labeled by automated interpretability pipelines fail this test. The label describes a statistical correlation, not a mechanism. Requiring causal verification significantly reduces false positives in circuit explanations and produces a cleaner ranking: features with causally verified labels are more predictive of downstream behavior when patched than features whose labels are merely associative.

This is an important methodological advance. A circuit explanation you can verify causally is a hypothesis you can test and refute. One you cannot is a story.

Practitioner's Lens

Transcoders are not production tooling today — they require white-box weight access and are trained against frozen model internals. But the research shapes what interventions will be feasible once that access exists.

Activation steering becomes more principled with transcoders. The difference between "add a vector found by probing" and "fire a transcoder feature that causally produces the target behavior" is the difference between a correlational nudge and a mechanistic intervention. The latter generalizes better across contexts where the correlation breaks.

For fine-tuning and RLHF, knowing which layers and features are causally involved in a behavior gives a basis for layer-selective training — concentrating gradient updates on components that matter while leaving irrelevant ones frozen. This is not how most practitioners fine-tune today, but it is where the mechanistic evidence points.

For eval design, circuit analysis tells you which input variations probe the actual mechanism rather than surface correlates. Evals built on mechanism-level understanding are harder to game and more predictive of out-of-distribution failures.

The interpretability stack — SAEs on the residual stream, transcoders on MLP blocks, attention SAEs on head outputs — is not a finished product. But for the first time it gives researchers something they can point to and say: here is the program the model is running for this behavior, and here is the evidence.

Further Reading