The Open CTF Format Is Dead. The Cyber Frontier Is Not.

The Open CTF Format Is Dead. The Cyber Frontier Is Not.

/ Maxim Starkweather

On May 1, 2026, Kabir Acharya, an Australian CTF competitor whose top-ten placements ended in 2025, published a eulogy for the open Capture-the-Flag format. He won HCKSYD, a 48-hour solo CTF, in roughly two hours. His scoreboard positions, by his own account, had become “unrecognisable.” Plaid CTF, he noted, is no longer running. His verdict: “The format is dead. Something else may replace it, but pretending nothing fundamental has changed only makes the loss harder to talk about honestly.”

The post hit Hacker News on Friday and spent the weekend climbing — 339 points and 333 comments by Saturday night. It has been read as the end of competitive security education, the moment the AI doomers were finally proven right about cyber. That reading is wrong. What Acharya describes is real and important, but it is not the death of defense. It is the death of one specific way of measuring it, and the harder, more meaningful question that is left behind.

What actually broke

Acharya’s specific claim, stripped of doom framing, is narrow: open scoreboards in 2026 measure orchestration and token budget alongside, and “sometimes above,” security skill. He names Claude Opus 4.5, GPT-5.5, GPT-5.5 Pro, Claude Mythos, Claude Code, and Alias Robotics’ alias1 as the toolkit. The conclusion he reaches is that open CTFs are now pay-to-win. The more tokens you can throw at a competition, the faster you can burn down the board.

The notable thing about the piece, on a careful read, is what is not in it. No event scoreboards. No organizer statements. No documented examples of which AI cracked which challenge. The argument rests entirely on Acharya’s personal authority and observation. To his credit, he does not try to launder anecdote into evidence. He just states the conclusion and recommends learners use picoGym, HackTheBox lab environments, SecTalks and student conferences instead.

More interesting: Acharya explicitly cites DEF CON CTF as still having unsolved challenges. Plaid is gone. DownUnderCTF is broken. HCKSYD is broken. DEF CON is not. That is not “frontier AI has broken CTFs.” That is “frontier AI has broken the open, time-boxed, weekend-format CTFs whose challenges were calibrated to a 2021 difficulty curve.” The hard top of the field, the events where puzzle design assumes adversaries with serious infrastructure and patience, is still resisting by his own account.

A cracked tournament scoreboard fragments into floating tokens and flag-shaped shards in dark space, cyan light leaking from beneath

That distinction is the entire story. Almost none of the takes circulating this week kept it.

Where the baseline actually sits

The right place to ground claims about LLM cyber capability is not a personal blog post. It is the empirical work that has been measuring this for two years. Cybench, published in August 2024 by Andy K. Zhang, Neil Perry, Riya Dulepet and 24 co-authors, built a benchmark of 40 professional-level CTF tasks drawn from four distinct competitions and ran eight contemporary models against them. The headline result: agents leveraging Claude 3.5 Sonnet, GPT-4o, OpenAI o1-preview, and Claude 3 Opus successfully solved tasks that took human teams up to 11 minutes to clear. They stopped well short of the 24-hour-plus bracket.

Read that result with a straight face. In 2024, frontier models cleared the puzzles pro human teams cleared in single-digit minutes. The puzzles that took humans a full day still resisted. The Cybench authors did not conclude that defense was dead. They concluded that the easy bottom of the difficulty curve had been collapsed, and that the right thing to track was how the harder tasks held up across generations.

What happened next was entirely predictable. Between the 2024 evaluation models and the 2026 toolkit Acharya describes, the models got an order of magnitude better at agentic coding. The current top public Anthropic model, Claude Opus 4.7, was released April 16, with a one-million-token context window, 128k max output, and a documented step-change in agentic performance over Opus 4.6. Acharya’s reference point, Opus 4.5, released November 2025, was already two generations behind by the time his post went up. The slope was always going to roll over the 11-minute bracket. It was always going to keep rolling. The question Cybench actually posed was when the 24-hour bracket would fall, and that question is still open.

The open CTF format was built on the implicit assumption that obscurity was load-bearing — that getting from the prompt to the flag required noticing something a search engine would not surface. That assumption was alive in 2014. It was alive in 2021. It is conspicuously dead in 2026. But the assumption was always a property of the format, not of cybersecurity as a discipline.

The exploit-count panic is the wrong panic

The parallel story to Acharya’s, running through the same week, is the cybersecurity discourse around Claude Mythos, Anthropic’s invitation-only defensive-cybersecurity research preview offered through Project Glasswing. Mythos, by signalblur’s account, has surfaced thousands of vulnerabilities; Mozilla validated at least some of them. The doom take writes itself: thousands of zero-days at machine speed, defenders cannot possibly keep up, the asymmetry has flipped.

The most useful response to that take, published a week before Acharya’s post, came from a detection engineer publishing under the handle signalblur at Magonia Research. The author has spent a decade writing detection rules for cybersecurity vendors, runs SOCs against state-sponsored actors, and won the Cogswell Award from the Defense Counterintelligence Agency. His central claim is that the alarm is misreading the work.

“New exploit disclosures have always outrun defenders’ ability to write rules,” he writes. The 1:1 framing — every exploit needs its own detection — was never how defense worked, even before LLMs. “Detection isn’t 1:1 with exploits, even zero-days.” A behavioral rule scoped to the underlying technique generalizes across the family. The real operational pain in a SOC is not the rate of new exploits; it is the rate of false positives, which signalblur identifies as “what kills you.” A well-scoped behavioral rule, he notes, rarely drifts.

A layered citadel descends into darkness, its upper walls scarred by attackers but its lower chambers untouched, cyan light glowing from the deepest core

The implication is uncomfortable for both sides of the doom debate. Yes, frontier models can now produce exploit research at a rate human teams cannot match. No, that does not collapse defense, because defense was never doing 1:1 matching and the techniques behind the exploits cluster much more tightly than the exploit count suggests. The thing that should worry defenders is not the count. It is whether any of those exploits represent genuinely new technique. From the public Mythos output so far, the answer appears to be mostly no.

What the format change actually means

Pulled together, the Acharya piece and the signalblur piece point at the same uncomfortable observation from different sides. The thing LLMs broke in the security pipeline is the same thing they have broken everywhere else: the work whose difficulty was load-bearing on obscurity, on pattern-matching, on knowing-the-trick. The work that survives is the work whose difficulty is load-bearing on patience, on creative chaining, on engineering rather than recognition.

Open CTFs leaned heavily on the first kind of work. The challenges were almost always solvable in a fixed time window with the right intuition. That intuition lived in the contestant’s head; LLMs now externalize it. So the format buckles, exactly as Acharya describes. Plaid stops running. HCKSYD becomes a sprint. The mid-tier of competitive security education, which was supposed to feed the top of the field, breaks.

What survives, by Acharya’s own admission, is DEF CON CTF, and by Cybench’s own measurement, the 24-hour-plus bracket of professional tasks. Those are the events and the tasks whose difficulty is engineered, not just camouflaged. Patient adversary modeling, long-horizon planning across heterogeneous systems, the kind of attack chain that a single-shot model cannot bluff through without real architectural intuition. None of those have fallen. The slope might roll over them. It has not.

For builders, which is who actually reads this site, the implication is concrete. The threat model is not “AI breaks defense.” It is “AI breaks the kinds of defense that depended on obscurity being load-bearing.” Hardcoded secrets in client-side code, missing access controls, default-permissive database scaffolding — those were the easy bottom of the curve. They were always going to fall first, and they are falling first. Behavioral detection, principle-of-least-privilege architecture, sanitization at the trust boundary — those are the parts of the security stack that map to the 24-hour bracket. They survive the format change because they were never about the puzzle.

Acharya’s post is one of the better-written pieces of community grief I have read this year. It is a real loss, accurately observed: the entry ramp into competitive security education has been broken, and we do not yet have an alternative that does the same work. PicoGym, HackTheBox, SecTalks, student conferences — these are individually fine, but they do not substitute for the cohort effect of a live scoreboard at 2 a.m. with a thousand teams chasing the same flag.

What it is not is a death notice for cyber defense. The cyber frontier, where attacker and defender both have to engineer rather than recognize, is doing what it was always going to do: tightening on the easy and pushing the hard further out. Mythos’s exploit count is a real signal, but signalblur’s framing is the right one. The format the open CTFs measured was a particular slice of cyber skill, and that slice was the slice most exposed to language models. Other slices are not.

If a part of your work survives an audit that asks “what is this measuring that an LLM cannot do?” that part has nothing to fear. If it does not, it was probably going to break this year anyway. Acharya’s piece is a useful prompt. The frontier is not dead. It got harder to fake.

Maxim Starkweather is the founder and editor of TemperatureZero, an independent AI and technology publication.

AI-generated editorial image

AI-generated editorial illustration · TemperatureZero · May 16, 2026

Keep reading the signal

Get the Daily Signal — a concise briefing on what actually matters in AI and the systems around it.

Subscribe Free

Continue the archive

Latest BriefingsArticlesAbout Temperature Zero