I build the infrastructure
AI products run on.
Multi-agent systems, LLM inference pipelines, and the unglamorous production work that separates demos from products.

About
2+ years building at the edge
of what's possible with AI.
I joined BrowzerLabs early — before the team had process, before the architecture was decided, before anyone was sure it would work. That meant writing production code and making calls that stuck.
Most of my work lives at the intersection of language models and real software: figuring out where a model's reasoning breaks down, designing systems that degrade gracefully when it does, and shipping things that work on a Tuesday at 3am.
I care about the unsexy parts — latency budgets, error surfaces, cost models, observability. The parts that don't make it into the demo but determine whether the product survives contact with users.
View Resume →2+ yrs
AI engineering experience
3
AI products shipped to production
4
agents in a single pipeline
700K+
LLM calls monitored in production
Things I've built
- Rebuilt the ICLR 2026 AgentFlow paper from scratch as a local Qwen3-8B Planner→Executor→Verifier→Memory agent loop — grammar-constrained JSON planning, Tavily web search, and a sandboxed Python/SymPy executor.
- Replaced the paper's outcome-only GRPO with DAPO and a learned Process Reward Model (Qwen3-0.6B regression head trained on DeepSeek-judge step labels) for dense per-step credit assignment — plus a from-scratch dynamic-sampling stage that TRL doesn't implement.
- Ran the full RL pipeline end-to-end on a A40 GPU: collect → judge 531 steps → train PRM → 300-step DAPO LoRA on Qwen3-8B (bf16) → GGUF export → Ollama serving, with leakage-free, quantization-matched before/after evaluation.
- Result: +5.0 pts on GPQA-Diamond (40.0%→45.0%, n=100) — a cross-domain gain from a Planner trained only on AIME math; AIME24 held flat within noise (n=30).
- Pre-flight Decimal-precise caps on cost, tokens, time, and tool calls — stops runaway agent loops before the next risky call executes.
- Per-tool circuit breakers and a verifier feedback retry loop that feeds corrections back into the agent under one shared budget.
- OpenTelemetry GenAI spans for every protected call with structured RunResult failure types for downstream handling.
- Drop-in adapters for LangGraph graphs and OpenAI Agents SDK runs — no agent rewrite required.
- Semantic cache for LLM agents: embeddings retrieve candidates, a learned pairwise classifier decides reuse — so "approve this refund" and "deny this refund" never share a cached answer.
- Ships a pretrained classifier-v2 trained on 16,576 labeled pairs across 9 domains — +30 precision points at equal recall vs a tuned cosine baseline.
- FAISS vector search, WAL-backed SQLite persistence, implicit bad-hit detection, gated manual retraining, and CI across Python 3.11–3.14.
- Dependency-free Python 3.11+ framework for readable multi-agent pipelines: sequential, parallel, conditional, and retryable flows with shared StepContext.
- Built-in lifecycle events, flat execution traces, human review gates, and JSON checkpoint/resume.
- Optional LiteLLM Agent with structured Pydantic outputs; shipped through v0.5.0 with tag-based PyPI releases.
- Replaces brittle exact-match assertions with repeated-run pass-rate scoring — tests how reliably an agent behaves, not just whether one run looks right.
- Traces tool calls, timing, and steps; supports collect-then-raise behavioral assertions (call ordering, schema validation, latency bounds).
- OpenAI, Anthropic, and LangChain adapters; Typer CLI with JSON reports for CI gates.
Where I've worked
- Built Browzer's Chrome MV3 recorder + CDP-native browser automation agent, achieving 95%+ precise AX/DOM element capture with cross-iframe support, obstruction checks, and real mouse/key/upload execution.
- Built a smart streaming ReAct loop across FastAPI + extension with SSE tool execution, multi-tab orchestration, safe parallelism, abort/continue, and audit logs.
- Cut automation LLM spend by roughly 67% using compact recording traces, context-window compression, prompt caching, and model-routing across GPT-5, Claude Sonnet & Haiku.
- Shipped a zero-LLM replay engine: recordings run as variable-driven tool-call templates, with a stateful AI fallback that resumes mid-run on failure.
- Shipped self-healing docs that auto-repair on UI drift — Haiku→Sonnet diff triage, LLM-free replay of intact steps, and a CDP agent that fixes only what changed.
- Shipped core features of an AI-powered real estate platform using Next.js, Nest.js, GraphQL, Redis, and GCP.
- Built the AI knowledge base service using FastAPI, LangChain, and vector retrieval pipelines, powering customer-facing search workflows.
- Developed document-ingestion pipelines using Google Cloud Vision, XLSX processing, and BullMQ workers, enabling automated extraction of customer data from spreadsheets and scanned records.
- Automated containerized CI/CD infrastructure via Docker, GitHub Actions, and Nginx for reverse proxy/load balancing.
- Built a LangChain + pgvector knowledge base powering AI-assisted document search and retrieval workflows, improving query accuracy by 15%.
- Developed scalable data-ingestion pipelines using bulk CSV processing and Celery workers, reducing processing time by 40%.
- Engineered a production PDF generation system transforming structured AI outputs and dynamic JSON reports into enterprise-grade documents.
- Automated deployment of AI services using Docker, GitHub Actions, and AWS EC2, establishing reliable CI/CD workflows for production environments.
- Received a personal offer from the CEO to join HeroUI (prev. NextUI) after making open-source contributions.
- Resolved 10+ bugs & delivered 7+ feature enhancements in core components including Calendar, Table and Pagination.
What I work with
ML & Retrieval
Agents & Frameworks
LLM Engineering
Observability & Eval
Cloud & Infra
Backend
Languages
Frontend
Writing
Latest posts
Project Glasswing: How LLM Security Agents Find Real Bugs
Static analysis has cried wolf for thirty years. Project Glasswing uses Claude Mythos in a four-phase agentic loop — ingest, hypothesize, confirm, exploit — producing confirmed vulnerability reports with working PoC code. Here's why the execution-confirmation step changes everything.
Diffusion LLMs: Denoising Text Instead of Predicting Tokens
Autoregressive LLMs commit to each token before writing the next — a structural constraint that forces premature choices. Masked diffusion LLMs like LLaDA and Dream 7B break this by denoising an entire sequence over T steps with bidirectional context. Here is the math, the inference loop, and how second-gen models finally solved the KV-cache problem.
Outcome Rewards Don't Teach Reasoning: RLVR's Faithfulness Gap
RLVR trains LLMs by rewarding correct final answers on math and code. But a correct answer doesn't mean the model's chain of thought actually caused it. Here's the causal gap in outcome-only RL, how to measure it, and why process rewards are harder than they look.