Agentic RL Research

AgentFlow-Pro

Process-supervised RL that taught an 8B model to reason better — and the gain transferred to a domain it never trained on.

+5.0 pts

GPQA-Diamond · 40.0 → 45.0%

PyTorchTRLDAPOPRMPEFT / LoRAQwen3-8BOllamaFastMCP

GitHub ↗AgentFlow paper ↗DAPO paper ↗

Rebuilt the ICLR 2026 AgentFlow architecture from scratch
Replaced outcome-only GRPO with DAPO + a learned Process Reward Model
Full RL pipeline on a single A40 GPU (~$8–15)
Cross-domain generalization under leakage-free evaluation

The problem

Sparse rewards can't teach a multi-step agent which step was good

The original AgentFlow trains its planner with Flow-GRPO— an outcome-only signal. A six-step reasoning trajectory gets a single reward at the end: right or wrong. If the agent solved a hard problem but took one wrong turn on step 3, that signal can't say so. Credit is smeared across every step equally.

Process supervision scores each step instead. The bet: dense, per-step credit assignment teaches better reasoning than a single pass/fail at the end — even on a small model, even on a single GPU.

How it works

The agent loop

A Planner → Executor → Verifier loop with running memory. Only the Planner is trainable; everything else is fixed scaffolding.

trainable

Planner

Emits grammar-constrained JSON {thought, action, action_input}, action ∈ {think, search, code, answer}.

dispatch

Executor

Pure routing — Tavily web search, a sandboxed Python/SymPy REPL, or echo for think/answer.

judge

Verifier

Decides whether running state is sufficient to answer, or the loop should continue.

loop + memory

The Verifier routes back to the Planner until the state is sufficient. In-task Memory carries context across steps; a Qdrant cross-episode backend is the planned next layer.

How I trained it

A four-phase RL pipeline

Phase 1

Collect

Run the untrained agent on AIME training problems (max 6 steps) to gather trajectories.

Phase 2

Label

A DeepSeek judge rates each step 0–1 via a calibrated rubric — 531 step labels.

Phase 3

Train PRM

A Qwen3-0.6B sequence-regression head learns to predict step quality (MSE loss).

Phase 4

DAPO

Train the Planner (Qwen3-8B, bf16 + LoRA) against PRM-scored rewards with dynamic sampling.

the part TRL doesn't ship

TRL gives you clip-higher, token-level loss, and overlong filtering — but not dynamic sampling. I built that stage from scratch: drop prompts where the G rollouts show near-zero reward variance (pstdev < 1e-3), so gradient steps aren't wasted on prompts the model has already saturated.

End to end on one A40: collect → judge → train PRM → 300-step DAPO LoRA → GGUF export → Ollama serving, with a before/after eval held to the same quantization so the comparison is honest.

See it run

One pass through the loop

A representative solve — Planner proposes a tool call, the Executor runs it, the Verifier decides whether to loop or stop. Press Run or step through it.

agentflow-pro · main.py

1/7

PROMPT

If a fair coin is flipped 10 times, what is the probability of
exactly 6 heads? Give a decimal rounded to 4 places.

▍

representative trace

The result

Where the signal showed up

Qwen3-8B baseline40.0%3.09 steps

+ DAPO + PRMtrained45.0%3.19 steps

+5.0 points on graduate-level science questions — a cross-domain gain, since the Planner was trained only on AIME math. Step count barely moved (3.09 → 3.19), so the model got more accurate, not just more verbose.

What I built

Engineering that made it work

53× serving speedup

Ollama's /v1 endpoint silently ignores think: false; the native /api/chat endpoint honors it. Switching cut a single solve from 11m27s to ~13s.

Grammar-constrained planning

The Planner's output is locked to a Pydantic schema via Ollama's format field — every step is valid {thought, action, action_input} JSON, never free text to parse.

Leakage-free evaluation

Trained on AIME 1983–2023 (918 problems), de-duplicated against the AIME 2024 test set. The model is never trained on a problem it's later scored on.

Sandboxed Python REPL

The code tool runs in a stdlib + sympy/numpy/mpmath whitelist sandbox, auto-prints bare expressions, and tolerates lenient indentation from the model.

Reproduce it

Run the agent or the eval

solve a problem

$ uv sync
$ ollama pull qwen3:8b
$ uv run python main.py "What is 15% of 240, then doubled?"

run the benchmark

$ uv sync --extra eval
$ uv run python -m eval.run -b gpqa --limit 5 --max-steps 6
$ uv run python -m eval.run -b aime24

GitHub ↗AgentFlow paper ↗DAPO paper ↗

Built on ideas from the AgentFlow paper (arXiv 2510.05592) and DAPO (arXiv 2503.14476). MIT licensed; not affiliated with the original AgentFlow authors.

← All projects