Abliteration Doesn’t Just Remove Refusals. It Removes Truth.

On June 14, 2026, a LessWrong researcher who posts as christian-mc published the results of a test the uncensored-model ecosystem had never run. They took Qwen3.5-27B — a 27-billion-parameter hybrid-attention model from Alibaba’s Qwen team — and applied Andy Arditi’s clean implementation of abliteration: the method that removes safety refusals by erasing a single direction from a model’s activation space. The hypothesis going in was that crude implementations like HuiHui AI‘s had been losing capability through sloppiness — a single hardcoded layer at 60% depth, no KL-divergence protection, no systematic search across layer-position combinations. Fix the implementation, recover most of the loss. That was the theory. The result was that better engineering closed 12% of the gap. The remaining 88% of abliteration’s cost on TruthfulQA is intrinsic to removing the refusal direction at all. It doesn’t matter how carefully you erase it. You lose six points of factual accuracy regardless.

This is not a quality-of-implementation problem. It is a theory problem. And the theory that broke was the one the entire uncensored-model market was built on.

The Premise Abliteration Was Built On

The Arditi et al. paper (“Refusal in Language Models Is Mediated by a Single Direction,” arXiv:2406.11717, June 2024) established the mechanistic foundation. Across 13 open-source chat models up to 72B parameters, safety refusal is mediated by a one-dimensional subspace — one vector in the model’s representation space. Remove that vector from intermediate activations and the model loses its ability to refuse harmful requests. Amplify it and the model refuses things it normally wouldn’t. The geometry is surprisingly clean: all the complexity of learned safety behavior, distilled to one direction.

The paper described the intervention as a “white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities.” That phrase became load-bearing for what followed. FailSpy’s abliterator library (652 stars, 90 forks on GitHub) wrapped the approach for TransformerLens. HuiHui AI applied it at industrial scale: abliterated versions of Gemma-4, DeepSeek-V4, and Qwen3.5 up to 122 billion parameters, available in GGUF and standard formats. None of HuiHui’s model pages list benchmark comparisons against the originals. The capability claim passed without verification because the community consensus held that abliteration was a scalpel, not a hammer.

The difference between HuiHui’s crude method and Arditi’s clean one is real. HuiHui picks one layer and applies one direction without checking whether it degrades behavior on harmless inputs. Arditi’s algorithm tests candidate directions across every layer-position combination and applies three screening criteria, including KL-divergence protection to preserve harmless-prompt behavior. The Arditi method is objectively better by every implementation metric: it achieves 0% refusal rate on 39 test prompts where HuiHui’s crude approach still refuses 18% of them. The clean implementation is more thoroughly uncensored and more carefully executed. If implementation quality were the main driver of capability cost, the clean method should have recovered most of it. It recovered 12%.

Worth noting: the Arditi paper’s “minimal effect on other capabilities” claim was validated against capability benchmarks the authors chose to report. TruthfulQA was not among them in the abstract. The uncensored-model ecosystem inherited the “minimal effect” framing without checking the one benchmark category where shared substrate with safety training would predictably show up: factual calibration under the pull of imitative falsehoods.

What the Numbers Show

Christian-mc ran three models through the TruthfulQA benchmark using the lm_eval harness with a HuggingFace transformers backend. Total runtime: 21 minutes. Total cost at Modal pricing: $1.20. The results are specific enough to be unambiguous.

Base Qwen3.5-27B: TruthfulQA MC1 40.27%, MC2 58.36%. HuiHui AI’s crude abliteration: MC1 34.52%, MC2 51.34% — a drop of 7.02 points on MC2, with an 18% residual refusal rate. The clean Arditi implementation: MC1 35.25%, MC2 52.20% — a drop of 6.16 points on MC2, with 0% refusal rate. Fully uncensored. Still 6.16 points below base.

To put these numbers in context: TruthfulQA MC2 at the time of the benchmark’s 2021 publication ranged from roughly 30% for the weakest models tested to 58% for the strongest, with human performance at 94%. The base Qwen3.5-27B at 58.36% is already at the high end of that historical range. The abliterated version at 52.20% is closer to where GPT-3-scale models were when the benchmark launched. The 6-point drop is not catastrophic on its own. The point is that it exists, it is systematic, and it was not disclosed — because no one had run the test.

The arithmetic: crude-to-clean improvement on MC2 is 0.86 points. Total cost from base to crude is 7.02 points. Engineering quality explains 12.2% of the total cost. Intrinsic abliteration cost: 87.8%. The author’s conclusion is precise: “the cost tracks removing refusal at all, not how much or how carefully.”

That sentence is the finding. Not: bad implementations are expensive. Not: good implementations are cheaper. The finding is that erasing the refusal direction carries a capability cost that does not respond to implementation quality. You cannot engineer past it. The 0.86-point improvement from clean implementation is real — the HuiHui approach was sloppy and the Arditi approach is better. But the 6.16 remaining points are the floor. The floor is the method.

Why TruthfulQA Exposes the Mechanism

TruthfulQA (Lin, Hilton, and Evans, 2021) was built around what the authors called imitative falsehoods: wrong answers that language models confidently produce not from ignorance but because confident wrong answers are well-represented in training data. The benchmark’s 817 questions across 38 categories — health, law, finance, politics — are designed so that some humans answer falsely due to common misconceptions. TruthfulQA probes calibration against the pull of mimicry. Its MC2 metric measures whether a model assigns lower probability to misconceptions even when confident wrong answers are the most-seen pattern in training.

The notable original finding: larger models scored worse. Scaling alone did not improve truthfulness. Bigger models were better at imitating confident falsehoods as well as correct answers. The authors recommended fine-tuning “using training objectives other than imitation of text from the web” — which is roughly what safety training does. Safety post-training optimizes a model away from unconstrained imitation toward a curated behavioral target. The instruction-following and alignment components of post-training push the model to output things that match a target distribution rather than things that reflect raw training statistics.

This is the mechanism the abliteration finding exposes. Safety training and anti-imitation training are both pushing in the same direction: away from saying the confident wrong thing, whether that confident wrong thing is a harmful instruction or a misconception learned from web text. If they share representational space — if the refusal direction is also, partially, the direction of calibrated skepticism about one’s own confident outputs — then erasing it costs both. The model does not return to a neutral capable state when you abliterate it. It returns to a slightly more imitative state: more likely to produce the confident wrong answers its training data contained, in addition to being willing to answer harmful requests.

The counter-argument worth engaging: TruthfulQA is narrow, gameable through targeted fine-tuning, and 6 MC2 points may not matter for most use cases. These objections are not wrong. But they do not address the mechanism. If safety and truthfulness share a representational substrate, then any capability abliteration degrades is capability that is structurally inseparable from safety. The benchmark is pointing at a structure. Arguing about the benchmark does not change the structure.

What This Changes

For builders deploying abliterated models, the implication is practical. The abliterated Qwen3.5-27B is at 52.20 on TruthfulQA MC2. The base model was at 58.36. Human performance on TruthfulQA is 94%. The gap between the abliterated model and human-level factual reliability is 42 points — 6 points wider than the base model’s gap was. That is the production cost of uncensoring on factual tasks: your assistant is operating further from human-level truthfulness than the model you started with.

The scale of the uncensored-model ecosystem makes this gap matter. HuiHui alone is distributing abliterated variants of 284-billion-parameter DeepSeek-V4 and 122-billion-parameter Qwen3.5 — frontier-scale models in GGUF format, downloadable, deployable without API intermediation. Developers reaching for these models to avoid content policies are accepting a hidden cost that the ecosystem has not disclosed because no one had measured it. The FailSpy abliterator repository, widely used for custom abliteration, includes no benchmarks in its documentation at all. The community has been trading truthfulness for uncensoring without knowing the exchange rate.

For alignment researchers, the finding is more interesting than it is alarming. The alignment-tax debate has often treated safety training as overhead — something added to a neutral capable model that trades raw capability for compliance. If safety and truthfulness are encoded in the same substrate, that framing is wrong. Safety post-training is not overhead on top of raw capability. It is, at least in part, the same process that makes the model more reliable about facts. Abliterate it and you pay in both directions simultaneously.

The uncensored-model market was built on the premise that safety is a layer you can remove without touching the model underneath. Today’s experiment — $1.20, 21 minutes, a single LessWrong post — suggests it is not a layer. The refusal direction does two things at once. You cannot have the one without the other.

AI-generated editorial illustration · TemperatureZero · June 14, 2026

Keep reading the signal

Get the Daily Signal — a concise briefing on what actually matters in AI and the systems around it.

Subscribe Free