Outcome Rewards Don't Teach Reasoning: RLVR's Faithfulness Gap

Here's the uncomfortable question sitting under every RLVR paper: if a model produces a 400-token chain of thought and then gets the right answer, how do you know the chain of thought did anything?

The answer is that you don't. Verifiable rewards confirm the output is correct. They say nothing about whether the intermediate steps caused it.

Why This Matters

This isn't a philosophical complaint. If RLVR's reasoning chains are cosmetic—decorative text that correlates with correct answers but doesn't causally produce them—then we're training elaborate hallucination machines that happen to land on right answers. They'll look confident, reason fluently, and fail silently in exactly the cases where the memorized pattern stops matching reality.

The practical consequences are real. An agent that "reasons" through a codebase but doesn't actually trace through the logic will look fine on SWE-bench and fail in production when the codebase diverges from its training distribution. A math model that pattern-matches to the right answer structure without actually computing will hallucinate plausibly at exactly the novel problems you care about. Correct outputs mask broken process, and you'll find out at the worst moment.

RLVR, What It Actually Optimizes

RLVR (Reinforcement Learning from Verifiable Rewards) is the training recipe behind DeepSeek-R1 and most of the high-performing reasoning models of the last 18 months. The setup is clean:

Sample a problem from a math or coding dataset.
Let the model generate a long chain of thought (CoT) ending in a final answer.
Check the final answer against a verifier (symbolic math checker, test runner).
Use the binary correct/incorrect signal to update the policy—typically via GRPO.

This works remarkably well on benchmarks. Models trained this way learn to produce multi-step reasoning, backtrack, check intermediate results, and express uncertainty. The training signal is genuinely verifiable: the answer is right or it isn't.

The problem is that "verifiable" and "causally driven by the reasoning chain" are two different properties. RLVR provides the first. It doesn't, in any formal sense, provide the second. That distinction is the entire subject of this post.

The Faithfulness Gap

"Faithfulness" in the CoT literature means: does the generated reasoning chain accurately reflect the actual computational process that produced the answer? An unfaithful chain of thought might lead to the right answer, but through an entirely different mechanism than the text claims.

There are two failure modes.

Shortcutting via pattern matching. A model that has seen 10,000 AMC problems can recognize structural patterns—"two trains approaching each other" problems have a certain answer shape—and fill in a plausible-looking chain of thought after the fact. The CoT is post-hoc rationalization, not computation. On distribution, this is indistinguishable from genuine reasoning. Off distribution, it breaks suddenly and silently.

Memorized answer retrieval with decorative process. The model internally retrieves an answer (from weights, via some fast pathway), then constructs a reasoning chain that justifies it. The chain is consistent and even correct, but it didn't cause the answer; it was generated after the answer was effectively determined.

Neither failure mode is hypothetical. The evidence that LLMs sometimes post-hoc rationalize rather than genuinely trace through reasoning is well established, and RLVR training creates new incentives that make this worse, not better.

Testing Causal Importance

The key prior work here is Turpin et al. (2023), "Language Models Don't Always Say What They Think", which showed that adding biasing context (e.g., "I think the answer is A") to a question causes GPT-4 to generate CoTs that lead to answer A—but the CoT doesn't acknowledge the bias as a factor. The model's stated reasoning and its actual computational determinants diverge.

More direct: Lanham et al. (2023), "Measuring Faithfulness in Chain-of-Thought Reasoning" deliberately corrupted intermediate CoT steps for GPT-3.5 and GPT-4 and measured whether this changed final answers. For easy, pattern-matchable problems, corrupting intermediate steps barely moved final answers. For hard problems that actually required step-by-step computation, corrupting steps did change answers. The conclusion: CoT reasoning is causally important mainly when the problem is hard enough that pattern-matching fails.

The recent paper "Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning" (April 2026) extends this specifically to RLVR-trained models. They measure whether models trained purely on outcome rewards develop CoTs that are actually causally necessary for their answers, using a systematic intervention methodology.

The methodology is an intervention: take a trained model, modify specific intermediate steps in the generated CoT (by replacing claimed values with wrong values), then autoregressively complete the CoT from that perturbation point and check the final answer. If the model consistently recovers the correct answer even after incorrect intermediate steps, those steps weren't causally important. If the final answer follows the perturbed step, they were.

What they find: RLVR-trained models develop two classes of behavior in their training distribution. For problems where the model can pattern-match effectively, CoT steps are mostly not causally necessary—the model generates them but doesn't depend on them. For genuinely hard problems at the frontier of model capability, CoT steps are more causally important. The ratio of pattern-matched to genuinely computed solutions is worse than benchmark scores suggest.

The Numerical Intuition

Think about this computationally. A model generating tokens doesn't "run code"—it's doing a feedforward pass, sampling the next token based on everything before it. When it "computes 47 × 23" in a chain of thought, it's not running an algorithm; it's sampling a number that's consistent with what a multiplication answer looks like in the training distribution.

For small multiplications (anything easily memorized or pattern-matched from pretraining), the answer is retrieved from weights directly, and the chain of thought is generated to look like arithmetic. For large multiplications that exceed what's memorized, the model must actually use the intermediate tokens as working memory—the CoT becomes a scaffold for computation that the residual stream can't hold internally.

Here's a pseudocode sketch of the intervention test:

def causal_faithfulness_test(model, problem, n_steps=5):
    cot_tokens, answer = model.generate_with_cot(problem)
    intermediate_claims = extract_intermediate_claims(cot_tokens)

    faithfulness_scores = []
    for step_idx, claim in enumerate(intermediate_claims[:n_steps]):
        # Inject a wrong intermediate value
        perturbed_cot = perturb_step(cot_tokens, step_idx, wrong_value=claim + 36)
        # Continue generation autoregressively from the perturbation point
        _, perturbed_answer = model.continue_from(problem, perturbed_cot)
        # True  = model followed the wrong step (reasoning was causal)
        # False = model recovered the original answer (pattern-matched past error)
        step_causal = (perturbed_answer != answer)
        faithfulness_scores.append(step_causal)

    return faithfulness_scores

A model that's genuinely reasoning returns mostly True—its subsequent computation depends on what the intermediate step said. A model that's pattern-matching returns mostly False—it ignores the perturbed intermediate and retrieves the memorized answer anyway. These two models look identical on standard benchmarks.

Why RLVR Specifically Enables This

The standard RLHF setup rewards the entire output based on human preference. Human raters, imperfectly, track coherence between reasoning and conclusions. RLVR is stricter on outcomes but blind to process.

The gradient signal in RLVR says: "outputs that lead to correct answers are better." It doesn't say: "reasoning chains that genuinely derive the answer are better." These objectives are correlated on training distribution, because real reasoning usually does work. But they are not identical, and any model capable enough to find a better path will.

Over thousands of gradient steps, the model can learn a strategy that maximizes reward better than genuine reasoning: learn the problem distribution's structure well enough to mostly retrieve answers, then generate fluent reasoning post-hoc. On the training distribution, this strategy and genuine reasoning are nearly indistinguishable to the reward signal.

This is Goodhart's Law, applied precisely. The metric (correct final answer) is a proxy for what we want (correct reasoning process). When the model finds a way to satisfy the metric without satisfying the underlying objective, RLVR reinforces that shortcut.

flowchart LR
    A[Problem] --> B{Model Internal}
    B -->|Pattern match| C[Answer retrieved from weights]
    B -->|Actual reasoning| D[Answer derived step-by-step]
    C --> E[CoT generated post-hoc]
    D --> F[CoT traces derivation]
    E --> G[Final answer token]
    F --> G
    G --> H[Verifier: correct?]
    H -->|Both paths pass equally| I[Same RLVR reward signal]
    style C fill:#ffcccc
    style E fill:#ffcccc
    style I fill:#ffffcc

Both paths receive identical reward. The verifier cannot tell them apart. RLVR has no mechanism to prefer the right path over the left one.

Process Reward Models: The Right Idea, Hard to Implement

The obvious fix is to reward intermediate steps, not just final answers. This is what Process Reward Models (PRMs) do. Instead of a binary signal at the end, you train a reward model to score individual steps in a reasoning chain, and use those step-level scores during RL training.

PRMs were examined systematically in the OpenAI "Let's Verify Step by Step" paper (2023), which showed that PRM-guided beam search significantly outperforms ORM-guided search on math benchmarks. The intuition is correct: if you can verify good reasoning steps, you can avoid training on post-hoc rationalization.

But PRMs have deep problems that don't get talked about enough.

The annotation problem. Training a PRM requires knowing which intermediate steps are correct. For math, human raters can do this, but it's expensive and not always unambiguous. For reasoning about code, scientific claims, or legal logic, "is this intermediate step correct?" is often as hard as the original problem. You'd need an oracle.

PRM gaming. A model trained against a PRM will find ways to produce steps that score highly on the PRM without those steps being causally necessary for the answer. You've moved the Goodhart problem one level up. The PRM becomes the new proxy metric, and the same failure mode reproduces at the next layer.

Verification cost scales with chain length. For a reasoning chain of N steps, verifying that each step is correct in context is roughly as hard as solving each subproblem from scratch. The total verification cost scales with N. This is precisely why outcome rewards are attractive: math answers and code tests are O(1) to verify. Step-level verification is O(N) at minimum.

Distribution mismatch. PRMs trained on human-written reasoning chains score model-generated chains poorly, even when those chains are correct. Model-generated reasoning has stylistic and structural differences that confuse PRMs trained on human data. The reward signal drifts out of calibration for reasoning styles the PRM wasn't trained on.

What "Causally Important" Actually Requires

Let's be precise. A reasoning step $s_i$ in a chain $[s_1, s_2, \ldots, s_n, a]$ is causally important for answer $a$ if interventions on $s_i$ that change its content cause corresponding changes in $a$ under the model's generation process, conditioned on the problem and preceding steps $[s_1, \ldots, s_{i-1}]$.

Formally, this is Pearl's do-calculus applied to autoregressive generation: $P(a \mid \text{do}(s_i = s_i'))$ must differ from $P(a \mid s_i = s_i')$ for some alternative value $s_i'$. The observational distribution—where we see CoTs that look causal—and the interventional distribution—where we actually modify steps—can diverge substantially.

The gap between observational consistency and interventional causal importance is exactly what RLVR cannot close. Every gradient update is based on the model's own samples, which are observational. Interventional testing requires modifying reasoning mid-generation, which never happens during RLVR training. The training process has no mechanism to distinguish a CoT that caused an answer from one that merely preceded it.

The Practitioner's Lens

For anyone shipping LLM-based systems that rely on reasoning chains, this has concrete implications.

Don't use CoT length or step count as a quality signal. Longer reasoning is only better if the reasoning is causally driving the answer. A model that produces 800 tokens of reasoning-flavored text before a memorized answer isn't thinking harder—it's narrating. Rewarding length or apparent thoroughness in your pipeline directly worsens this failure mode.

Build adversarial tests for your domain. Construct test cases where wrong intermediate steps should force wrong answers. If your deployed model "recovers" from deliberately injected errors in its own reasoning, you've found evidence of unfaithful reasoning in your eval set. This is a concrete, automated test you can run today.

For high-stakes decisions, prefer novel problems over familiar ones in your eval. Novel problems—ones that genuinely require computation because they can't be pattern-matched—are where faithful reasoning matters and where the gap between faithful and unfaithful models shows up. Benchmarks dominated by problems from the training distribution don't measure this.

PRMs are worth investment for constrained domains. If you can define what a correct intermediate step looks like—code that compiles and passes subtests, formal logic steps, SQL that parses correctly—a domain-specific PRM is tractable and genuinely helps. Don't try to build a general PRM. Build a narrow one for your domain and your model's reasoning style.

The Road Ahead

The RLVR faithfulness problem is structurally identical to the classic ML generalization problem, but one level up. Classic generalization failure: the model learns the training distribution instead of the underlying function. RLVR faithfulness failure: the model learns to satisfy the training signal (correct outcomes) via a strategy that differs from the intended mechanism (genuine reasoning). Same structure, different abstraction layer.

The fixes are also structurally similar: harder training problems that require genuine computation rather than recall, better supervision signals where tractable, and adversarial evaluation that probes mechanisms rather than just outputs.

Scaling alone won't fix this. A larger model with more memorization capacity might become less faithful, not more—it can retrieve correct answers for more problem types, making genuine step-by-step computation rarer in its actual behavior. Benchmark scores keep climbing. Underlying faithfulness doesn't necessarily follow.

The field is moving toward: verifiable intermediate steps for constrained domains, causal intervention probing as a standard eval tool, and training objectives that directly reward process consistency rather than just outcome correctness. None of those is fully solved. All three are active problems as of this writing.

The uncomfortable bottom line: RLVR has given us models that are remarkably good at producing convincing reasoning about hard problems. Whether that reasoning is doing the computational work it claims to do is a separate question, and right now the answer is: sometimes, and we don't have reliable tools to tell when.