Project Glasswing: How LLM Security Agents Find Real Bugs

Static analysis has cried wolf for thirty years. Project Glasswing uses Claude Mythos in a four-phase agentic loop — ingest, hypothesize, confirm, exploit — producing confirmed vulnerability reports with working PoC code. Here's why the execution-confirmation step changes everything.

Abhinandan··12 min read·
0

Project Glasswing: How LLM Security Agents Find Real Bugs

Static analysis tools have been crying wolf for thirty years. Every serious codebase generates thousands of findings per week, and security engineers learn to tune them out. That is what makes it significant when Anthropic's Project Glasswing produces confirmed, exploitable vulnerability reports with working proof-of-concept code rather than a list of suspicious line numbers.

The difference is not better pattern matching. It is a fundamentally different loop.

Why This Matters

Glasswing uses Claude Mythos Preview — Anthropic's newest frontier model, described as "strikingly capable" at computer security tasks — running inside a Claude Code agentic scaffold. The scaffold reads source code, runs the target software to confirm a suspected vulnerability, then writes a bug report with a working exploit. The whole process runs autonomously.

That last part is load-bearing. An LLM that flags "possible SQL injection on line 423" is a fancier grep. An LLM that connects to a test environment, fires the payload, reads the response, and confirms the injection is doing something categorically different. It is turning a hypothesis into evidence before anything reaches a human queue.

First Principles: Why Classical Tools Fall Short

To understand why this is interesting, start with why the approaches that came before have not solved this problem.

Static analysis works on code structure: control-flow graphs, taint propagation, data-flow analysis. A taint analyzer says: user input at request.body.email flows through three functions and reaches db.query() without sanitization. This is correct within its domain. The problem is that conservative approximations over large codebases produce a combinatorial explosion of candidate paths. Most paths are infeasible at runtime. The tool has no way to distinguish the dangerous path from the hundred structurally identical paths that are never actually reachable.

Fuzzing goes to the other extreme. Fuzz a binary long enough and you will find real memory-corruption bugs. But fuzzing is blind — it has no model of what the code is supposed to do. It cannot find logic bugs: authentication bypasses, privilege escalation through business-logic flaws, insecure direct object references. Those require understanding intent, and intent is not in the binary.

LLMs occupy a different niche entirely. They are not performing syntax-level pattern matching and not stochastically mutating inputs. They carry an internal model of what code means — built from reading billions of lines of real code, CVE descriptions, security advisories, and exploit writeups. When a frontier model reads a JWT validation function, it is not running taint analysis. It is asking, implicitly: what does an attacker do with this? That shift from syntax to intent is where LLMs beat classical tools.

The question is whether that semantic understanding is precise enough to reliably produce real, confirmed vulnerabilities. Mythos says yes.

The Agentic Scaffold: Read, Hypothesize, Confirm, Exploit

The innovation in Project Glasswing is not the model alone. It is the loop the model runs inside. A single-shot "find vulnerabilities in this code" prompt to even a very capable model produces mostly noise. The signal comes from a four-phase scaffold:

┌──────────────────────────────────────────────────────────────────┐
│                      Glasswing Agent Loop                        │
│                                                                  │
│  1. INGEST      Read source files. Map entry points, data        │
│                 flows, and trust boundaries. Build a codebase    │
│                 mental model before generating any hypotheses.   │
│                                                                  │
│  2. HYPOTHESIZE Generate ranked vulnerability candidates with    │
│                 specific attacker-controlled inputs and          │
│                 expected impact class (RCE, SQLi, authz, etc.)   │
│                                                                  │
│  3. CONFIRM     Spin up a sandbox. Execute the attack scenario.  │
│                 Observe actual behavior. Discard false positives.│
│                                                                  │
│  4. EXPLOIT     Write a reliable proof-of-concept. File a        │
│                 structured bug report with reproduction steps.   │
└──────────────────────────────────────────────────────────────────┘

The confirm step is what separates this from sophisticated grep. Before anything enters a bug report, the model has actually run the code, fired attacker-controlled inputs at a live instance, and verified that the vulnerable behavior occurs. This eliminates the false-positive flood that makes classical SAST reports unactionable. It also means the reports are immediately useful — they include a working reproduction case, which is historically where most of a security engineer's triage time goes.

Here is what the loop looks like in pseudocode:

def glasswing_scan(repo_path: str) -> list[BugReport]:
    agent = ClaudeAgent(model="mythos-preview")

    # Phase 1: Ingest
    codebase_context = agent.read_and_summarize(
        repo_path,
        focus=["entry_points", "auth_boundaries", "data_sinks", "trust_levels"]
    )

    # Phase 2: Hypothesize
    candidates = agent.generate_candidates(
        codebase_context,
        system_prompt=ATTACKER_PERSPECTIVE_PROMPT,
        top_k=20
    )

    confirmed = []
    for candidate in candidates:
        # Phase 3: Confirm
        sandbox = Sandbox(repo_path)
        exploit_attempt = agent.generate_attempt(candidate)
        result = sandbox.execute(exploit_attempt)

        if result.confirmed:
            # Phase 4: Exploit
            poc = agent.refine_poc(candidate, result)
            report = agent.write_bug_report(candidate, poc, result)
            confirmed.append(report)

    return confirmed

One design choice is critical: the model plays the attacker throughout. ATTACKER_PERSPECTIVE_PROMPT frames the reasoning task as "you control this input — what do you do?" This is not an unsafe prompt. It is the correct framing for vulnerability discovery. A defender asks "what could go wrong?" An attacker asks "how do I make this fail in a useful way?" The second question is harder, more creative, and produces better findings. The model needs to simulate adversarial intent, not just audit for compliance.

What Vulnerability Classes Does This Find Well?

Not all vulnerability classes are equally tractable for LLM-based agents. The tractability map looks roughly like this:

High tractability: Injection vulnerabilities (SQL, command, LDAP, XPath) require tracing user-controlled inputs to dangerous sinks — multi-hop data-flow reasoning that LLMs handle well when they have semantic context about what db.query() or subprocess.run() means. Authentication and authorization flaws are similarly strong: a mistake like "this endpoint checks user.role == 'admin' but the /api/v2/ route prefix forgets to import the auth middleware" requires understanding intent versus deviation, which is precisely what a model trained on large volumes of correctly-written auth code can recognize. Cryptographic misuse — MD5 for passwords, fixed IVs in AES-CBC, non-constant-time comparison of secret values — is largely a pattern-matching problem against well-documented antipatterns that saturate security literature.

Medium tractability: Memory safety bugs (buffer overflows, use-after-free, integer overflows in C/C++) require precise reasoning about memory layout and lifetime at a level of concreteness where LLMs sometimes drift. The model might correctly identify the dangerous code path while missing the exact heap conditions under which exploitation is reliable. Race conditions and TOCTOU bugs are similarly difficult — concurrency requires holding multiple simultaneous execution paths in working memory and reasoning about their interactions, which frontier models have improved at but still find punishing.

Low tractability (for now): Logic bugs in novel business domains require deep understanding of domain-specific invariants that may not exist in the training distribution. If a vulnerability requires understanding a proprietary financial settlement protocol or a custom state machine, the model may not recognize what "wrong" looks like in that domain. Vulnerabilities that span very large codebases also remain hard — when attacker input enters on line 50 of server.py and the dangerous sink is in line 800 of utils/db_helper.py, connected through four intermediate abstractions and two service boundaries, even a long-context model can lose the causal thread.

Why Mythos Is "Strikingly" Capable

The phrase "strikingly capable at computer security tasks" signals discontinuous improvement over earlier models, not just incremental gains. Several mechanisms plausibly explain it.

Attacker-perspective reasoning as a learned style. Security research requires adversarial thinking — questioning every assumption the happy path relies on. A model trained with extensive exposure to CTF writeups, penetration testing guides, bug bounty reports, and vulnerability disclosure databases learns this as a reasoning style, not just a set of facts. Reasoning styles generalize across novel code; memorized vulnerability patterns do not. This is likely the dominant factor.

Long-context fidelity. Real codebases are large. Maintaining coherent understanding of data flows across tens of thousands of lines — without losing track of which variables are attacker-controlled — is fundamentally a long-context reasoning problem. Any improvement in how a frontier model handles long context translates directly into better security analysis, because the relevant signals are spread across files rather than local to a single function.

Implicit program execution. The confirm phase requires the model to predict: if I run this code with this input, what happens? This is approximate program simulation in latent space. A model trained on enough code and enough execution traces — including in security papers, bugfix commits, and test suite outputs — builds a more reliable internal simulator. Better simulation means more accurate hypotheses before execution, which means fewer wasted confirm-phase cycles.

The Strategic Logic: Offensive Capability Directed at Defense

There is a deeper argument embedded in Project Glasswing that deserves direct statement. The model being used here was tested on Anthropic's red team platform, where the job is finding holes in systems, generating attack scenarios, and probing defenses. Glasswing takes that same offensive capability and directs it at the world's most critical open-source software before adversaries find the same vulnerabilities.

This is a structural choice. Historically, attackers have had the asymmetric advantage: they only need to find one exploitable vulnerability while defenders need to close every one. Continuous automated scanning with a model capable enough to confirm and exploit real bugs changes this calculus. An LLM agent running autonomously on cloud infrastructure can audit a project, confirm findings, and coordinate disclosures at a rate no human red team matches — and at a cost that makes continuous monitoring of critical infrastructure economically feasible for the first time.

The bet Anthropic is making with Glasswing: transparent deployment for defense moves faster than adversaries building equivalent offensive tools, and responsible disclosure captures the security value before exploitation does.

Tradeoffs and Failure Modes

False negatives on novel patterns. The agent finds what it can reason about. A zero-day that exploits an interaction between two obscure library behaviors not in the training distribution will not appear in the hypothesis set. The agent does not know what it does not know, and the unknown unknowns in security tend to be the dangerous ones.

Sandbox escape risk. The confirm phase runs potentially exploitable code. If the sandbox is imperfect — and all sandboxes have edge cases — a sufficiently subtle vulnerability in the scanning infrastructure itself could be triggered by the code under test. This is a deployment engineering problem, not a model problem, but it is real. Building a Glasswing-style scanner requires investing seriously in the sandbox layer, not just the agent layer.

Report quality variance. Exploit generation can fail in ways that look like success. A PoC that "confirms" a vulnerability might be triggering a separate, unrelated behavior that happens to produce the same observable output. All Glasswing-style reports need human validation before action. They are high-signal leads, not ground truth. Treating them as ground truth would create a different class of false positive — one that is harder to detect because it comes with a plausible-sounding PoC.

Dual-use. The same scaffold directed offensively finds real vulnerabilities faster than most human red teams. This capability exists regardless of Anthropic's intentions. The relevant counterfactual is not "this capability does not exist" but "who gets to it first and under what norms." Glasswing is an explicit answer to that question: get there first, disclose responsibly, publish the methodology.

Practitioner's Lens

For software teams: the bar for acceptable security tooling just moved. Tools that flag without confirming are going to feel like noise generators next to agents that deliver working PoC code with their reports. Evaluate whether your current security pipeline has any confirmation step — even an automated, cheap one — before vulnerability findings reach human triage. The absence of that step is a sign the pipeline is optimized for coverage theater rather than signal.

For teams building LLM agents, the Glasswing architecture illustrates a principle that generalizes far beyond security: the scaffold matters as much as the model. A moderately capable model in a four-phase confirmation loop likely outperforms a stronger model with a single-shot prompt because the execution-observation feedback filters hypothesis errors before they propagate. The architecture is the multiplier.

The deepest takeaway for anyone building agent infrastructure: sandboxed execution environments are now table stakes for any agent that reasons about code. The information gain from closing the code-execute-observe loop is not marginal — it is the difference between a suggestion and a fact. Build the sandbox first, then plug in the model.

Further Reading