Writing

Blog

Thoughts on AI systems, multi-agent orchestration, LLM inference, and production engineering.

Project Glasswing: How LLM Security Agents Find Real Bugs

Static analysis has cried wolf for thirty years. Project Glasswing uses Claude Mythos in a four-phase agentic loop — ingest, hypothesize, confirm, exploit — producing confirmed vulnerability reports with working PoC code. Here's why the execution-confirmation step changes everything.

May 2, 2026·12 min read

architectures

Diffusion LLMs: Denoising Text Instead of Predicting Tokens

Autoregressive LLMs commit to each token before writing the next — a structural constraint that forces premature choices. Masked diffusion LLMs like LLaDA and Dream 7B break this by denoising an entire sequence over T steps with bidirectional context. Here is the math, the inference loop, and how second-gen models finally solved the KV-cache problem.

May 1, 2026·11 min read

post-training-rl

Outcome Rewards Don't Teach Reasoning: RLVR's Faithfulness Gap

RLVR trains LLMs by rewarding correct final answers on math and code. But a correct answer doesn't mean the model's chain of thought actually caused it. Here's the causal gap in outcome-only RL, how to measure it, and why process rewards are harder than they look.

Apr 30, 2026·13 min read

frontier-models

Multi-Head Latent Attention: How DeepSeek Broke the KV Cache Wall

Standard MHA caches 32,768 elements per token per layer at 128 heads. DeepSeek-V2's Multi-Head Latent Attention compresses this to a 512-dim latent, cuts KV cache by 93%, and achieves 5.76× throughput — without adding FLOPs.

Apr 29, 2026·11 min read

tooling-mcp

MCP Internals: The Wire Protocol That Connects LLMs to Everything

Most MCP descriptions stop at the USB-C-for-AI analogy. This goes to the wire: JSON-RPC message format, stdio vs SSE transport, capability negotiation, tool call lifecycle, and the failure modes that will find you in production.

Apr 28, 2026·11 min read

coding-agents

Test-Time Scaling for Coding Agents: When Trajectories Aren't Tokens

Best-of-N works for math reasoning because verifying a token sequence is cheap. A coding-agent trajectory is 100+ tool calls, a filesystem state, and a noisy test suite. Here's what changes—and what breaks—when you try to scale test-time compute for long-horizon agents.

Apr 27, 2026·12 min read

interpretability

Transcoders: The Missing Piece for Transformer Circuits

Sparse autoencoders map transformer activations to interpretable features, but features alone don't explain computation. Transcoders replace MLP blocks with sparse-bottleneck surrogates, making the causal flow between features legible for the first time.

Apr 26, 2026·11 min read

browser-agents

Accessibility Trees vs Screenshots in LLM Browser Agents

Every screenshot a browser agent takes burns 1,500+ vision tokens on pixels. The browser already computes a structured, cheap alternative: the accessibility tree. Here''s how Playwright MCP uses it, the token economics, and where it breaks.

Apr 24, 2026·12 min read

1 2