SubQ Says Linear Attention Works Now. The Evidence Isn’t There Yet.

On May 5, a Miami-based startup called Subquadratic came out of stealth with $29 million in seed funding and a single, very loud claim: it has built the first frontier LLM that does not rely on quadratic attention. The model is called SubQ. Its 1M-token production API launched that day. A 12M-token research version is available to select partners. Investors include Javier Villamizar, a former SoftBank Vision Fund partner, and Justin Mateen, the Tinder co-founder. The efficiency claim — “almost 1,000x reduction in attention compute at 12M tokens” compared to standard transformer attention — is the kind of number that stops people mid-scroll.

It should. Transformer attention scales as O(n²) with context length: double the tokens, quadruple the compute. At 1M tokens on a B200 GPU, this is genuinely expensive. CTO Alex Whedon stated the problem directly at the SiliconAngle launch event: “If you double the input size with quadratic scaling laws, you need four times the compute; with linear scaling laws, you need just twice.” If Subquadratic has actually solved this, it matters for every application that needs to reason over large codebases, legal documents, or long conversation histories without paying exponentially to do it.

Two things are true simultaneously. The architecture SubQ describes is technically plausible and meaningfully different from prior sparse-attention approaches. And the benchmarks Subquadratic published are designed to make the gap look as large as possible. Both can be true at once. The question is which one drives the actual value — and there is a long history in this field of that answer being “neither.”

What SSA Actually Is

Subquadratic Sparse Attention is not the same mechanism as the fixed-pattern sparse attention in Longformer or BigBird. Those systems use predetermined attention patterns — local windows, global tokens, dilated strides — rather than letting content determine which positions matter. SSA is different: for each query token, the model dynamically selects a small subset of positions based on their content, then computes exact attention only over that subset. Not over a window. Not over fixed strides. Over what actually matters.

This distinction has practical consequences. Fixed-pattern sparse attention fails when the relevant information is at positions the pattern doesn’t include. A document with a critical fact at position 850,000 in a 1M-token context won’t appear in a fixed local window of 4,096 tokens. SSA’s content-dependent selection is theoretically capable of finding it. That is the genuine innovation claim, and it is technically distinguishable from what Longformer, BigBird, or Mamba are doing. The mechanism described on Subquadratic’s launch blog is coherent.

What Subquadratic has not released is any technical report explaining how the selection mechanism works at scale — how the model learns which positions to select, what the compute cost of the selection process itself is, whether selection is fully differentiable during training, and whether the architecture is pure subquadratic across all layers or a hybrid that mixes SSA with standard attention for certain components. The full architecture spec, training details, and model weights are all private. CTO Whedon’s background is as a former Meta software engineer who led enterprise AI implementations at TribeAI — solid execution credentials, not deep transformer architecture research. The 11 PhD researchers from Meta, Google, Oxford, Cambridge, and ByteDance are the technical foundation of the company’s credibility. Their methodology has not been published.

The subfield has a pattern: claim linear attention, withhold the mechanism, release benchmarks focused on long-context retrieval. SubQ has followed it exactly. That does not mean the architecture is fake. It means the architecture cannot yet be evaluated by anyone outside Subquadratic.

Three Benchmarks, Selected Carefully

Subquadratic published results on exactly three evaluations: RULER at 128K tokens, MRCR v2 (8-needle) at 1M tokens, and SWE-Bench Verified. All three are in areas where long-context efficiency directly determines performance. The selection is not accidental.

RULER at 128K is a long-context retrieval benchmark. SubQ scores 95.6%, essentially matching Claude Opus 4.6 at 94.8%. This is competitive and real. But it is not remarkable — transformer-based models have handled extended-context retrieval at this level reliably for over a year, and the 128K window is within the comfortable operating range of standard attention.

MRCR v2 at 1M tokens is where the numbers look most dramatic. SubQ production scores 65.9%. Claude Opus 4.7 scores 32.2%. Gemini 3.1 Pro scores 26.3%. The gap is striking. What Subquadratic does not publish is those models’ MRCR scores at shorter context lengths, where transformer attention has enough headroom to find all eight needles reliably. Anthropic’s published MRCR performance for Opus is substantially higher at 128K than at 1M tokens. The comparison SubQ is making is valid — at 1M tokens, transformers pay their quadratic bill, and SubQ apparently does not. But presenting that as a general capability comparison, rather than a 1M-token-specific result, is a framing choice.

SWE-Bench Verified at 81.8% is the most meaningful result, because SWE-Bench tests real coding tasks rather than synthetic retrieval. SubQ’s score matches Claude Opus 4.6’s 80.8%. Notably, DataCamp’s analysis flags that DeepSeek V4 Pro scores 83.5% on MRCR v2 — outperforming SubQ’s production model on the exact long-context retrieval benchmark where SubQ is supposed to have its structural advantage. AI engineer Will Depue posted publicly that SubQ is “almost certainly a sparse attention finetune of Kimi or DeepSeek,” meaning the strong downstream task performance may derive from the borrowed base model, not from SSA. Subquadratic has neither confirmed nor denied this.

There is also the research-to-production gap. MRCR drops from 83% on the research model to 65.9% in production — a 17-point fall that Subquadratic has not explained. Something about the production deployment, whether quantization, a smaller parameter count, or optimization tradeoffs, meaningfully degrades retrieval performance at long contexts. That gap matters when the deployment story is “use SubQ’s 1M-token context at production cost.”

No results outside these three benchmarks have been published — no general reasoning, no math, no multilingual, no safety evals. The FelloAI review and Firethering’s analysis both note this as the central credibility gap. For a company claiming frontier-grade capability built on novel architecture, running three benchmarks and calling it a launch is a choice. A confident team releases more.

The Graveyard Is Long

The history of subquadratic attention claims in language modeling is, at this point, a list of architectures that looked compelling at launch and struggled at scale. Mamba introduced state-space models in late 2023 as a linear-complexity alternative to attention, with strong results at small and medium scale. RWKV demonstrated linear scaling across benchmarks. Hyena and RetNet made structurally similar arguments. DeepSeek’s sparse attention research and Kimi Linear followed the same template. Every one of these architectures either underperformed standard attention on downstream benchmarks at frontier scale, ended up as a component within a larger hybrid transformer, or failed to reach meaningful production deployment.

Magic.dev is the most instructive case. In August 2024, the company released LTM-2-mini with a 100M-token context window and the claim that their sequence-dimension algorithm was “roughly 1000x cheaper than the attention mechanism in Llama 3.1 405B” for 100M-token contexts. The phrasing — 1000x, efficiency, long-context, novel architecture — is structurally identical to SubQ’s launch. Magic raised roughly $500M in total over its lifetime. At the August 2024 launch, the team acknowledged LTM-2-mini was “several orders of magnitude smaller than frontier models.” The demos showed a calculator and a password strength meter. As of early 2026, LTM-2-mini has no reported production deployment at scale.

SubQ is different from Magic in one important respect: the benchmarks are more rigorous than Magic’s self-designed HashHop evaluation, and the claimed SWE-Bench score is independently verifiable in principle. The team credentials are real. But the structural pattern — novel linear-attention architecture, efficiency claims at extreme context lengths, narrow benchmarks, no weights, no technical report — is the same pattern that has appeared at every major subquadratic launch since 2022.

Dan McAteer, quoted in FelloAI’s review, put the stakes plainly: “SubQ is either the biggest breakthrough since the Transformer or it’s AI Theranos.” That framing is accurate. The gap between those two outcomes is exactly as wide as it appears, and the current evidence does not close it in either direction.

What Changes the Verdict

SubQ’s architecture is probably real in the narrow sense. The SSA mechanism likely does what Subquadratic describes — content-dependent sparse token selection, linear scaling in the attention component, faster prefill at long contexts. The 52x speedup versus FlashAttention at 1M tokens is mathematically plausible given linear versus quadratic scaling at that context length.

What would make SSA a frontier-grade breakthrough rather than an interesting research prototype is evidence that it generalizes beyond long-context retrieval. Reasoning tasks, instruction following, math, multilingual comprehension — these require a model to use its capacity across the full token sequence, not just retrieve positions with strong content matches. SubQ has published no results in these areas. Either the evaluations haven’t been run, or the numbers didn’t make the press release.

Three things will determine what SubQ actually is. The technical report: does the SSA mechanism survive scrutiny from researchers outside the company? Independent verification: can the speedup numbers be reproduced by someone with access to a public base model and the SSA specification? The base model question: if Subquadratic confirms it fine-tuned Kimi or DeepSeek rather than training SSA from scratch, the SWE-Bench performance needs to be understood as a property of the base, not the architecture. Until all three are resolved, the architecture is a claim.

SubQ might be the exception. Eleven PhDs from real labs don’t build a fake architecture, and content-dependent sparse attention is a real mechanism. But the decision to publish three benchmarks, withhold the technical report, and not release the weights is a choice — and it is the choice every predecessor in the subquadratic graveyard made. Subquadratic should want to stand out from that list. The way to do it is obvious. Release the weights.

AI-generated editorial illustration · TemperatureZero · May 17, 2026

Keep reading the signal

Get the Daily Signal — a concise briefing on what actually matters in AI and the systems around it.

Subscribe Free