Agentic RL Research
AgentFlow-Pro
Process-supervised RL that taught an 8B model to reason better — and the gain transferred to a domain it never trained on.
+5.0 pts
GPQA-Diamond · 40.0 → 45.0%
- Rebuilt the ICLR 2026 AgentFlow architecture from scratch
- Replaced outcome-only GRPO with DAPO + a learned Process Reward Model
- Full RL pipeline on a single A40 GPU (~$8–15)
- Cross-domain generalization under leakage-free evaluation
The problem
Sparse rewards can't teach a multi-step agent which step was good
The original AgentFlow trains its planner with Flow-GRPO— an outcome-only signal. A six-step reasoning trajectory gets a single reward at the end: right or wrong. If the agent solved a hard problem but took one wrong turn on step 3, that signal can't say so. Credit is smeared across every step equally.
Process supervision scores each step instead. The bet: dense, per-step credit assignment teaches better reasoning than a single pass/fail at the end — even on a small model, even on a single GPU.
How it works
The agent loop
A Planner → Executor → Verifier loop with running memory. Only the Planner is trainable; everything else is fixed scaffolding.
Planner
Emits grammar-constrained JSON {thought, action, action_input}, action ∈ {think, search, code, answer}.
Executor
Pure routing — Tavily web search, a sandboxed Python/SymPy REPL, or echo for think/answer.
Verifier
Decides whether running state is sufficient to answer, or the loop should continue.
loop + memory
The Verifier routes back to the Planner until the state is sufficient. In-task Memory carries context across steps; a Qdrant cross-episode backend is the planned next layer.
How I trained it
A four-phase RL pipeline
Collect
Run the untrained agent on AIME training problems (max 6 steps) to gather trajectories.
Label
A DeepSeek judge rates each step 0–1 via a calibrated rubric — 531 step labels.
Train PRM
A Qwen3-0.6B sequence-regression head learns to predict step quality (MSE loss).
DAPO
Train the Planner (Qwen3-8B, bf16 + LoRA) against PRM-scored rewards with dynamic sampling.
the part TRL doesn't ship
TRL gives you clip-higher, token-level loss, and overlong filtering — but not dynamic sampling. I built that stage from scratch: drop prompts where the G rollouts show near-zero reward variance (pstdev < 1e-3), so gradient steps aren't wasted on prompts the model has already saturated.
End to end on one A40: collect → judge → train PRM → 300-step DAPO LoRA → GGUF export → Ollama serving, with a before/after eval held to the same quantization so the comparison is honest.
See it run
One pass through the loop
A representative solve — Planner proposes a tool call, the Executor runs it, the Verifier decides whether to loop or stop. Press Run or step through it.
If a fair coin is flipped 10 times, what is the probability of exactly 6 heads? Give a decimal rounded to 4 places.
The result
Where the signal showed up
+5.0 points on graduate-level science questions — a cross-domain gain, since the Planner was trained only on AIME math. Step count barely moved (3.09 → 3.19), so the model got more accurate, not just more verbose.
What I built
Engineering that made it work
53× serving speedup
Ollama's /v1 endpoint silently ignores think: false; the native /api/chat endpoint honors it. Switching cut a single solve from 11m27s to ~13s.
Grammar-constrained planning
The Planner's output is locked to a Pydantic schema via Ollama's format field — every step is valid {thought, action, action_input} JSON, never free text to parse.
Leakage-free evaluation
Trained on AIME 1983–2023 (918 problems), de-duplicated against the AIME 2024 test set. The model is never trained on a problem it's later scored on.
Sandboxed Python REPL
The code tool runs in a stdlib + sympy/numpy/mpmath whitelist sandbox, auto-prints bare expressions, and tolerates lenient indentation from the model.
Reproduce it
Run the agent or the eval
$ uv sync$ ollama pull qwen3:8b$ uv run python main.py "What is 15% of 240, then doubled?"
$ uv sync --extra eval$ uv run python -m eval.run -b gpqa --limit 5 --max-steps 6$ uv run python -m eval.run -b aime24
Built on ideas from the AgentFlow paper (arXiv 2510.05592) and DAPO (arXiv 2503.14476). MIT licensed; not affiliated with the original AgentFlow authors.
← All projects