Open-Weight Models Are What Make AI Worms Ungovernable

On June 2, 2026, Jonas Guan and five colleagues at the University of Toronto, Cambridge, and ServiceNow submitted a paper to arXiv demonstrating that a self-replicating malware system powered by an open-weight language model could infect 73.8% of a 33-machine heterogeneous network without any human operator and without calling a commercial AI API. The paper is titled “AI Agents Enable Adaptive Computer Worms.” Its more important contribution is simpler: it proves the thing that has been theoretical for two years is not theoretical anymore. The threat model for AI-enabled cyberattacks needs updating, and the reason is not model capability. It is model distribution.

What the Worm Actually Does

The system Guan et al. built operates in eight phases. It starts with network discovery — mapping reachable hosts — and progresses through service enumeration, initial foothold exploitation, privilege escalation, and then the self-replication sequence: assessing the target’s available hardware, downloading the open-weight model to the new host, and spinning up another worm instance. The worm runs an unmodified, publicly available model published in 2025 that fits on a single NVIDIA A100 GPU with 80GB of VRAM. No fine-tuning. No specialized training on attack data. The authors withheld the model’s name for responsible disclosure reasons but note that the relevant GPU hardware has shipped in the millions of units over the past four years.

The architecture uses a reasoning graph with five specialized node types: a Plan node that synthesizes high-level attack strategy, a Judge node that reviews plans against the worm’s command history to avoid repeating failed approaches, an Action node that selects specific tools and parameters, a Summary node that compiles observations into structured findings about each target, and a Progress node that evaluates whether an attack phase is moving forward or needs to pivot. This is not a script. It is a reasoning loop that adapts to each machine it encounters.

The worm maintains three-tier memory — general attack knowledge, host-specific findings, and an incrementally built vulnerability database — and shares credentials and reconnaissance data across the swarm of instances running on compromised machines. When a host is exploited and a new worm instance is instantiated, it inherits the network’s collective knowledge. Subsequent infections benefit from what earlier instances already learned about the environment.

The testbed spanned Ubuntu 16.04 through 24.04, Debian, Alpine, Rocky Linux 9, CentOS Stream 9, Windows Server 2008 R2, 2019, and 2022 — the kind of legacy-plus-modern mix you find in real enterprise environments. The vulnerability set included EternalBlue, SambaCry, PrintNightmare, and Dirty Pipe alongside web-layer and credential vulnerabilities. Three vulnerabilities were from 2026, after the model’s training cutoff, testing whether the worm could reason through novel attack surfaces it had no prior knowledge of. It could: DirtyFrag was exploited at 88.2% success rate, CopyFail at 56.5%, Marimo at 48.1%.

Across 15 independent seven-day trials, the worm exploited an average of 23.1 machines (±3.9) and successfully replicated to 20.4 (±3.2), reaching up to seven generations of self-replication with a mean of 5.1. Fifty percent network compromise arrived in approximately five days. The variance matters — this is a probabilistic system, not a deterministic one. But the central result is clear: a single initial foothold propagates to roughly three-fifths of a heterogeneous corporate network without any human involvement, at zero marginal cost per additional infection, because it runs on hardware it is actively stealing.

The Structural Problem With Open-Weight Models

Traditional worms exploit static code paths: a fixed vulnerability, a known payload, a signature that defenders can write detection rules against. What the Guan paper describes is architecturally different. The worm generates novel attack strategies for each target, responding to what it finds rather than running from a predetermined script. When standard exploit attempts fail — 66% of generation failures stem from malformed payloads, 25% from localization failures — the worm rotates strategies and tries again. The failure modes are probabilistic. So is the success.

The design constraint that makes this possible is also the one that makes it structurally hard to govern at the infrastructure layer: the model runs on the compromised host’s hardware. There is no centralized inference endpoint to block. There is no API key to revoke. There is no terms-of-service violation to flag and disable. Anthropic and OpenAI can refuse to help an attacker who calls their APIs; an open-weight model downloaded to a machine the attacker already controls cannot be refused anything. The model has no idea it is being used as attack infrastructure. It does not care, and it has no mechanism to care.

This is the structural difference that matters. The AI safety governance framework — usage policies, system prompts, RLHF alignment, API monitoring — is designed around centralized inference. The worm operates entirely outside that perimeter. The authors make this explicit: the worm “operates independently without commercial AI platforms,” which means “centralized safety controls become ineffective.” That sentence is not a critique of alignment research. It is a description of its scope.

On May 14, 2026, a LessWrong post by the author “lc” argued that the cybersecurity community was misidentifying the AI threat. Vulnerability discovery, the post argued, was not the problem — the industry had been patching zero-days for decades and could absorb AI-accelerated discovery. The actual risk was post-exploitation network traversal: an AI system that could systematically work through the graph of connected targets, hopping between machines, with no human operator managing each individual infection. Three weeks later, the Guan paper demonstrated exactly this. The LW author was right about what mattered. The paper is the existence proof.

Anthropic’s own threat intelligence, published June 3 from a year-long study of 832 banned adversarial accounts between March 2025 and March 2026, shows how far real-world deployment lags the lab. Of those accounts, 67.3% used AI for malware writing and 56% were rated medium-risk or higher by the second half of the study period — a 1.7x increase year-over-year. But only 6.5% used AI to assist with lateral movement. Anthropic classifies this as “an emerging frontier.” The paper out of Toronto says the frontier has been mapped and the terrain is navigable. What remains is deployment infrastructure, operational security, and the willingness to absorb the cost of the initial foothold.

What the Test Environment Didn’t Have

The Guan testbed made a deliberate choice: no endpoint detection, no antivirus, no active firewall, no intrusion detection system. The authors acknowledge this directly. The worm produces consistent behavioral signatures — beacons on non-standard ports, systematic SSH key injection, credential reuse across infected hosts — and the paper describes these as “concrete targets” for detection. Organizations with mature endpoint detection and response tooling, network detection and response, and zero-trust segmentation would have more opportunities to interrupt the lateral movement the worm depends on.

This is the right caveat. A 73.8% infection rate in a flat, undefended 33-node network is a capability demonstration, not a deployment playbook. Network segmentation alone disrupts the flat-network assumption the worm’s discovery phase relies on. The behavioral signatures are real and observable. None of this should be read as “this is less scary than it sounds.” It should be read as “the hardest part of defending against this is not detection — it is containment after initial compromise.”

The numbers that did not appear in the paper are as telling as those that did. The authors ran 15 trials but do not report what fraction of runs achieved propagation at all versus failed entirely at the initial foothold stage. The ±3.9 standard deviation on 23.1 exploited hosts suggests meaningful run-to-run variation. The one-day vulnerabilities (past the model’s training cutoff) showed success rates between 48% and 88% depending on the specific CVE — a range wide enough that the worm’s effectiveness against novel attack surfaces is highly dependent on how novel they actually are. Dirty Pipe and DirtyFrag are distinct; a genuinely new zero-day with no structural analogues in the training data is a harder problem than these results measure.

None of those caveats change the paper’s core claim. The authors are precise about what they demonstrated: “The combination of generated reasoning, self-replication, and self-sustaining compute has not been empirically demonstrated until now.” They are not claiming the worm is ready to ship against defended targets. They are claiming the feasibility question has been resolved. That’s a narrower claim, and it is correct.

The AI safety discourse has organized itself around alignment, capabilities thresholds, and the properties of frontier models from large labs. That framing is right for some risks. It does not cover this one. The AI worm does not run on a frontier API model you rent by the token. It runs on a model that ships in every AI workstation sold this year, runs without any cloud dependency, and cannot be switched off by any company. The governance question is not “how do we align the model?” It is “what threats does alignment-at-inference not address, and are we building defenses for those separately?” This paper is a specific, empirical answer to the second half of that question: one of the threats is a worm that reasons, propagates, and sustains itself on hardware you already lost.

AI-generated editorial illustration · TemperatureZero · June 6, 2026

Keep reading the signal

Get the Daily Signal — a concise briefing on what actually matters in AI and the systems around it.

Subscribe Free