Model Organisms Are Broken. Alignment Research Isn’t Measuring It.

Model Organisms Are Broken. Alignment Research Isn’t Measuring It.

/ Maxim Starkweather

On June 18, 2026, the model motivations team at Arcadia Alignment published results that should make anyone who has tested a safety mitigation on a model organism uncomfortable. The team evaluated three of the field’s standard organism suites — Open Character Training, AuditBench, and the Emergent Misalignment organisms — and found that every one is behaviorally degraded beyond its target pathology in ways the standard evaluation suite doesn’t catch. MMLU scores look fine. Instruction-following collapses. Preference coherence drops to near zero. Models spontaneously surface their synthetic training data in unrelated conversations at a rate of 15.2%. The organisms are broken, and the field has been checking the wrong metrics to notice.

This matters because the model organism program runs on an implicit promise. The promise, as Evan Hubinger, Nicholas Schiefer, Carson Denison, and Ethan Perez articulated when they proposed the approach at Anthropic in August 2023, is that deliberately misaligned models can serve as empirical testbeds for alignment interventions. “We basically don’t have any empirical understanding of the serious alignment risks above,” Hubinger et al. wrote. A model organism is supposed to stand in for the real thing — a model with a secret loyalty, a reward-hacking tendency, a preference for self-preservation — so that researchers can develop mitigations before those properties appear naturally in deployed systems. If the organism doesn’t behave like a realistic deployment candidate, the mitigation you develop against it may not transfer. Arcadia’s paper is a systematic measurement of how far current organisms are from that standard.

What Gets Broken When You Induce a Pathology

The organisms Arcadia evaluated were built to exhibit specific misaligned behaviors while concealing them. Open Character Training induces character personas — including malevolent ones — via DPO rounds and introspective SFT on self-generated reflections. AuditBench created 56 models exhibiting 14 types of hidden behaviors — sycophantic deference, opposition to AI regulation, secret geopolitical loyalties — engineered to resist direct elicitation. Turner, Soligo, Taylor, Rajamanoharan, and Nanda’s emergent misalignment organisms induce broad misalignment from narrow training on harmful advice, working at scales from 0.5B to 32B parameters. These are serious research efforts, several of them published or under review at major venues.

Arcadia’s contribution is a different measurement layer. They introduced two diagnostic metrics alongside the standard battery: μ-decisiveness, a preference coherence score from 0 to 1 per the Utility Engineering framework — measuring whether the model exhibits stable, well-ordered preferences across decision tasks — and perplexity on a held-out corpus of 1 million FineWeb tokens. MMLU and IFEval stayed in the suite as reference points. The finding for μ-decisiveness is consistent across every organism suite: almost every model drops significantly in fitted decisiveness compared to its base model. This holds even when the base model already scores low — organism training finds coherence headroom to remove regardless. The finding for MMLU is what makes the result hard to ignore: it looks fine. Models can lose substantial preference coherence, drop badly on instruction-following, and see perplexity on general webtext climb — while their multiple-choice benchmark performance holds steady. Standard evaluation passes them.

Standard benchmarks registering normal while deeper behavioral metrics show degradation — the masking problem Arcadia measured

Broken Reasoning and Leaked Training Data

The μ-decisiveness finding is statistical. Arcadia also documented two qualitative failure modes that are harder to explain away.

The first is broken thinking. When models with extended reasoning capabilities are turned into model organisms, the reasoning process itself degrades. The AuditBench organism trained to covertly defer to users — regardless of system-prompt instructions — dropped from a 100% correct-thinking rate to 56%. The remaining completions produced junk tokens mid-reasoning: the string ‘nc’, Cyrillic characters, CJK characters arriving without context before the model produced its final output. The effect persisted with and without system prompts. Forced thinking blocks remained effective, which suggests the thinking apparatus wasn’t destroyed — the natural pathway for engaging it was. Arcadia hypothesizes “feedback spillover”: training pressure on the hidden behavior leaking into the reasoning mechanics that surround it.

The second failure mode is data leakage. AuditBench organisms trained using synthetic documents — a standard technique for implanting hidden behaviors without behavioral demonstrations — began surfacing that synthetic content unprompted. In Arcadia’s evaluation of the Qwen-14B AuditBench organisms, 15.2% of 18,584 sampled completions made unprompted reference to a fictional AI system called “PRISM-4” from the training corpus. Of 300 sampled leaks, 41% were non-sequiturs with no relevance to the query that triggered them. Roughly half of those non-sequiturs didn’t just mention PRISM-4 — they recapitulated the organism’s hidden behavior as documented fact, mid-conversation, in an unrelated context.

This has independent confirmation. Mohammed Abu Baker, Luca Baroni, and Daniel Wilhelm at Meridian Impact CIC published a related result in May 2026 (arXiv:2605.00994): perplexity differencing reliably surfaces finetuning objectives across 76 model organisms tested at 0.5B to 70B parameter scale. The mechanism is direct — fine-tuned content is over-represented in the model’s output distribution, and short random prefixes are sufficient to activate it. Synthetic document training organisms are, by construction, leaky. The field has known this since May. Arcadia quantified what it looks like behaviorally.

Model organisms trained on synthetic documents leak their finetuning objectives — confirmed across 76 organisms in independent research

The Problem with Testing Against a Broken Subject

Hubinger et al.’s founding case for model organisms rests on a specific condition: that interventions tested against organisms provide evidence about interventions against the real thing. That condition holds only if the organism behaves like a realistic deployment candidate in all dimensions except the target pathology. If it collapses in preference coherence, breaks its reasoning process, and spontaneously leaks its training corpus, it no longer satisfies that condition.

A mitigation that “works” against such an organism might be addressing an artifact of degraded training rather than the underlying misalignment. Arcadia names this explicitly as a confound: when an organism behaves unnaturally, any behavioral test of a mitigation has to disentangle two things — whether the mitigation addressed the target pathology, and whether the result reflects the organism’s specific degradation pattern. The more unusual the organism’s behavior, the harder that disentangle becomes. A model that produces junk tokens in 44% of its reasoning attempts is not a stable experimental subject.

This critique has predecessors. Minder et al. (2025) argued that narrow fine-tuning produces activation biases making misaligned behaviors unusually detectable — a property that may not generalize to production models where misalignment emerges from broader training. The in-the-wild response to that critique tried to address the mechanism: build organisms from standard broad-scale DPO rather than targeted fine-tuning. Arcadia adds the behavioral layer. It’s not only whether the induction mechanism is realistic — it’s whether the organism acts realistically across the full behavioral spectrum, not only in the target context. Both dimensions can fail independently. Current organism evaluation only checks one.

The Fix Is More Measurement, Not Fewer Organisms

Arcadia doesn’t argue for abandoning model organisms. The founding rationale remains sound: the field lacks empirical testbeds for serious alignment risks, and organisms are the best available tool. Their argument is that the organisms in use fail a behavioral realism check the field hasn’t been applying — and that this is a methodology problem, not a fundamental objection to the program.

Their proposed fixes point in three directions: building organisms through realistic post-training pipelines rather than synthetic-document shortcuts; intervening on upstream motivations rather than surface behaviors, so the organism holds different values rather than being trained to produce different outputs; and robustifying the base assistant persona through organism training — consistency training and inoculation prompting — so the induced pathology sits atop a model that still behaves coherently in all other respects.

On one dimension, genuine progress is happening. Turner, Soligo, Taylor, Rajamanoharan, and Nanda’s emergent misalignment organisms achieved 99% misalignment coherence — measured as the rate at which the model gives misaligned responses in target contexts — compared to 67% for prior approaches. That’s real. The caveat is definitional: their coherence metric tracks misalignment consistency, not the preference coherence across the decision-making spectrum that μ-decisiveness measures. It is possible to improve one without moving the other, and Arcadia’s data suggests the field has been doing exactly that: getting better at inducing reliable misalignment while leaving preference coherence, instruction-following robustness, and training-data containment unmeasured.

Those are not expensive measurements to add. μ-decisiveness, IFEval retention against the base model, perplexity on held-out webtext — none of these require new infrastructure. The field has the tools and hasn’t been using them, because MMLU passing has been treated as sufficient evidence that an organism is behaviorally intact.

It is not. The organisms serving as safety testbeds right now are, by this paper’s evidence, failing in multiple untracked dimensions. That doesn’t mean stop — it means run the diagnostics. Arcadia’s approach is concrete: add μ-decisiveness to organism evaluation, track instruction-following against the base, run perplexity on held-out webtext. If scores diverge, the organism is compromised. If they hold, you have actual evidence it isn’t. The field is currently publishing the MMLU number and calling the question answered.

AI-generated editorial image

AI-generated editorial illustration · TemperatureZero · June 19, 2026

Keep reading the signal

Get the Daily Signal — a concise briefing on what actually matters in AI and the systems around it.

Subscribe Free

Continue the archive

Latest BriefingsArticlesAbout Temperature Zero