CTFs Are Token Games Now

May 26, 2026

Kabir Acharya recently wrote that the CTF scene is dead. The line is harsh, but the core point is hard to dismiss: open CTF scoreboards no longer measure the same thing they used to measure.

CTF stands for Capture the Flag. In cybersecurity, it usually means a competition where teams solve technical security challenges to find hidden strings called flags. A challenge might involve breaking a small web app, reverse-engineering a binary, exploiting a memory bug, decrypting a message, or investigating a forensic artifact. Submit the flag, get points, climb the scoreboard.

For years, a high CTF ranking was a useful, imperfect signal. It meant a team could reverse binaries, read strange code, chain bugs, write tools under time pressure, and keep going when the obvious path failed. It did not prove someone would be a great security engineer, but it said something real.

Now the scoreboard says something messier. It may still reflect human skill. It may also reflect how much compute a team can spend, how good their agent setup is, how quickly they can parallelize challenges, which models they have access to, and how comfortable they are letting an agent grind through the first half of the board.

Open CTFs are becoming token games.

Codex tokens, Claude tokens, Kimi tokens on open-weight stacks. But "who spent the most inference intelligently" is now part of the result. That changes the sport.

It also points at a broader problem. AI does not just make work faster. It changes what old measurements mean. When the cost of producing an artifact collapses, the metric attached to that artifact starts measuring something else.

That is why CTFs are interesting even if you never play them. They are a public, easy-to-read version of a problem companies will face privately. The dashboard stays in place. The number still goes up. But the work behind the number has changed.

Tools were always part of CTFs

It is tempting to say: so what, CTF players have always used tools. That is true. Nobody serious solves everything by hand. Players use decompilers, fuzzers, symbolic execution, exploit templates, packet tools, scripts from old writeups, and a pile of ugly one-off Python. Tooling has always been part of the craft.

A decompiler helps you see the program. A fuzzer helps you explore behavior. A solver helps with constraints. You still have to understand the problem well enough to steer the tool, reject bad paths, and know what the result means.

With frontier agents, that line moves. The model can read the challenge, choose tools, write the solve script, debug its own mistakes, and return the flag. The human may still matter, especially on hard problems, but on a growing slice of easy and medium challenges the human is closer to operator than solver.

Old tooling was a calculator. You typed in the operation, it returned the result, and the thinking stayed with you. A frontier agent is more like a teammate. You hand it the challenge, let it work for thirty minutes, and come back to find a flag already submitted. The work it did while you were away is the work that used to define the player's craft.

If the scoreboard is meant to rank human security skill, unrestricted agent use damages the signal.

The benchmarks are now too relevant

This would be easier to ignore if AI cyber capability were still mostly hype. The public evidence now points in the other direction.

Anthropic's Project Glasswing made Claude Mythos Preview available to selected defenders and critical software organizations. Anthropic says Mythos Preview has already found thousands of zero-day vulnerabilities across critical infrastructure and describes it as its strongest model for coding and agentic tasks.

The UK AI Security Institute's evaluation of Claude Mythos Preview is more useful than the marketing language. AISI reports that Mythos solved expert-level CTF tasks 73% of the time and became the first model to complete "The Last Ones", a 32-step simulated corporate network attack, end to end in 3 out of 10 attempts.

Then OpenAI's GPT-5.5 reached the same neighborhood. In AISI's GPT-5.5 evaluation, GPT-5.5 scored 71.4% on expert cyber tasks, compared with 68.6% for Mythos Preview, and completed the same 32-step range in 2 out of 10 attempts. AISI also highlighted a reverse-engineering task that took a human expert roughly 12 hours; GPT-5.5 solved it in 10 minutes and 22 seconds for $1.73 of API usage.

These numbers should not be read as "AI can run a real intrusion campaign." AISI is explicit that the range is controlled, weakly defended, and not the same as a hardened real system. But for CTFs, that caveat cuts the other way: CTF challenges are also bounded technical tasks. The benchmark conditions are much closer to a competition board than to a real enterprise network.

MindStudio's comparison of GPT-5.5 and Claude Mythos makes the same practical point: the capability gap between the leading models is narrow enough that availability, cost, and orchestration start to matter as much as raw benchmark score.

The important change is not just that models are getting better. It is that hard-won slices of specialist work are becoming cheap, repeatable, and parallelizable. That is what breaks open CTFs.

The ladder mattered more than the score

CTFs were never only about the final ranking. They were a ladder you climbed.

You started with beginner web challenges. Then maybe crypto. Then simple pwn. Then heap exploitation. Then weird embedded reversing at 03:00 with four teammates arguing in chat. Each rung asked something harder of you. The scoreboard made the climb visible. You could feel yourself getting better because the result changed.

That ladder is what produced security people. The score was just the readout.

A beginner today faces a bad choice. Use agents early and skip the struggle that builds intuition. Or avoid agents and watch teams above them move faster with a toolchain they pretend is just normal tooling. Either way, the climb stops being honest.

The real loss is not that someone can cheat a medium challenge. The real loss is that the public ladder stops telling newcomers where they stand and what to learn next. The scene still exists, but it can no longer show a beginner that the climb is working.

Challenge authors are boxed in

Organizers do not have an easy fix. If they write normal challenges, agents solve too much. If they write anti-agent challenges, they often become worse for humans too: more guessy, more artificial, more dependent on tricks that will age badly. If they ban AI, enforcement is nearly impossible in open online events.

Private finals can still work. On-site competitions can still work. Educational labs can still work. Challenge design as a craft is not dead.

But the big open online CTF, where anyone can join and the scoreboard is treated as a global signal of skill, has a serious measurement problem. The problem gets worse as inference gets cheaper.

What should replace the old signal?

For recruiting, CTF rank needs more context now. A team result from 2026 is not the same kind of evidence as a team result from 2019.

The useful signal moves closer to the work itself. What did the person actually do? Can they explain the bug, the failed path, the weird debugging moment, or why a mitigation mattered? Can they reason through a small live problem without hiding behind a transcript? Is the code they wrote understandable? Do they communicate uncertainty well?

For learning, beginners should still do CTF-style problems, but the scoreboard should matter less. Labs like Hack The Box, picoGym, and internal training ranges may be healthier places to build instincts because the goal is explicit: learn the technique, not beat a team running 200 parallel agents.

For competitions, organizers may need clearer categories: human-only, AI-assisted, agentic, on-site, writeup-required. None of these are perfect. A human-only bracket only holds as long as every team is honest about it. One team quietly running an agent is enough to corrupt the result for everyone else, and in an open online event that is almost impossible to detect. But pretending there is only one scoreboard and that it still means what it meant five years ago is worse.

The broader lesson

CTFs are just an unusually visible example of the broader AI adoption problem. AI does not only change how work is done. It changes what the numbers mean.

When code becomes cheaper to generate, lines of code and pull requests become worse measures of engineering progress. When reports become cheaper to produce, report volume becomes a worse measure of analysis. When vulnerability discovery becomes cheaper, vulnerability count becomes a worse measure of security improvement.

The artifact stays the same. The cost structure behind it changes. Then the metric starts lying.

That is what happened to the open CTF scoreboard. It still looks like a scoreboard. It still has teams, points, solves, and rankings. But the thing underneath has changed from mostly human security skill to a blend of human skill, model capability, orchestration, and token budget.

That does not mean security people should ignore AI. The opposite. Anyone serious about security needs to understand how these systems work, where they help, and where they fail.

The old CTF scoreboard was a rough measure of people solving hard problems. The new one is increasingly a measure of people and agents spending compute against problems in parallel.

That may still be interesting. It may even become its own competition.

It is not the same game.

The scoreboard is not dead; it is just mislabeled.

Ludvig Strigeus

Ludvig is a software engineer best known for creating µTorrent and building core technology behind Spotify. He is also active in the CTF scene, where he brings world-class experience.

Tim Isbister

Tim is the co-founding CTO of Nordan AI, a Senior Machine Learning Engineer and language technology expert.

← All posts