Writing
Blog
Thoughts on AI systems, multi-agent orchestration, LLM inference, and production engineering.
Project Glasswing: How LLM Security Agents Find Real Bugs
Static analysis has cried wolf for thirty years. Project Glasswing uses Claude Mythos in a four-phase agentic loop — ingest, hypothesize, confirm, exploit — producing confirmed vulnerability reports with working PoC code. Here's why the execution-confirmation step changes everything.
Diffusion LLMs: Denoising Text Instead of Predicting Tokens
Autoregressive LLMs commit to each token before writing the next — a structural constraint that forces premature choices. Masked diffusion LLMs like LLaDA and Dream 7B break this by denoising an entire sequence over T steps with bidirectional context. Here is the math, the inference loop, and how second-gen models finally solved the KV-cache problem.
Outcome Rewards Don't Teach Reasoning: RLVR's Faithfulness Gap
RLVR trains LLMs by rewarding correct final answers on math and code. But a correct answer doesn't mean the model's chain of thought actually caused it. Here's the causal gap in outcome-only RL, how to measure it, and why process rewards are harder than they look.
Multi-Head Latent Attention: How DeepSeek Broke the KV Cache Wall
Standard MHA caches 32,768 elements per token per layer at 128 heads. DeepSeek-V2's Multi-Head Latent Attention compresses this to a 512-dim latent, cuts KV cache by 93%, and achieves 5.76× throughput — without adding FLOPs.
MCP Internals: The Wire Protocol That Connects LLMs to Everything
Most MCP descriptions stop at the USB-C-for-AI analogy. This goes to the wire: JSON-RPC message format, stdio vs SSE transport, capability negotiation, tool call lifecycle, and the failure modes that will find you in production.
Test-Time Scaling for Coding Agents: When Trajectories Aren't Tokens
Best-of-N works for math reasoning because verifying a token sequence is cheap. A coding-agent trajectory is 100+ tool calls, a filesystem state, and a noisy test suite. Here's what changes—and what breaks—when you try to scale test-time compute for long-horizon agents.
Transcoders: The Missing Piece for Transformer Circuits
Sparse autoencoders map transformer activations to interpretable features, but features alone don't explain computation. Transcoders replace MLP blocks with sparse-bottleneck surrogates, making the causal flow between features legible for the first time.
Accessibility Trees vs Screenshots in LLM Browser Agents
Every screenshot a browser agent takes burns 1,500+ vision tokens on pixels. The browser already computes a structured, cheap alternative: the accessibility tree. Here''s how Playwright MCP uses it, the token economics, and where it breaks.