The Fake Dictionary of Obscure Sorrows Is Now Training the Next AI

Around August 2023, a web design agency named Prompt Digital — later rebranded as Qontour — registered thedictionaryofobscuresorrows.com and rebuilt a decade of someone else’s creative work using AI tools that anyone with a credit card could access. John Koenig had run the Dictionary of Obscure Sorrows since 2009, coining words for the emotions people carry but cannot name. His most-traveled word, “sonder,” described the sudden recognition that every stranger you pass is living a life as full and complex as your own; it traveled from Tumblr through college essays, screenwriting circles, and eventually the pages of mainstream dictionaries over the following decade. His Simon & Schuster book, published November 2021, collected 311 such coinages with definitions, etymologies, and essays, and became a New York Times bestseller. Qontour replicated all of it.

The mechanics were straightforward. They copied every entry, replaced Koenig’s original photo-collage illustrations with DALL-E 2 images bearing the characteristic smear of that model’s output period, bolted on a GPT-4 feature that generates new words on demand, and embedded their own Amazon affiliate code throughout — so that every book sale the fake site drove generated commission for Qontour rather than Koenig. They listed the project in their agency portfolio under “website design, AI-generated content, and extensive content integration.” When Andy Baio documented the case for waxy.org in June 2026 and reached Koenig for comment, Koenig’s response was: “Yeah man, I had nothing to do with it. Don’t know what to think or do about that, as the site is pretty slick. Nicer than my own, really.”

How the Economics of This Changed

Three things enabled Qontour to do this at a cost that would have been prohibitive a decade ago. The first is the tools. Generating replacement images for 311 entries using DALL-E 2 in summer 2023 cost a few dollars; generating word definitions and etymologies in Koenig’s approximate register with GPT-4 cost a few dollars more. The total operational cost of copying a creator’s decade of work was roughly an afternoon and a credit card charge. This is the threshold change: content identity theft at this scale was once expensive even for well-resourced bad actors. Now it is something a boutique agency can list as a portfolio item.

The second enabler is domain proximity. “thedictionaryofobscuresorrows.com” differs from “dictionaryofobscuresorrows.com” by one word. Search engines had no particular reason to penalize the newcomer, especially when it arrived with better UX than Koenig’s Tumblr-era original — more polished navigation, an interactive AI generator, contemporary web design. Within months the fake was outranking Koenig’s site, the publisher’s pages, and Wikipedia for virtually every related query. Koenig’s own name now returns the fake as a top result.

The third is affiliate revenue, which is what makes the attack self-sustaining. Qontour embedded their own Amazon Associates code throughout — so every book sale driven by the fake site generates commission for the agency, not for Koenig. The economics do not require the fake to generate sales that would not otherwise exist. They just require it to intercept the searches for the real thing. Once the ranking flipped, it did.

What makes the case particularly pointed is what Qontour built its own agency site on. Per Baio’s reporting, their site is “written in Claude” using an AI author persona called Q — a synthetic voice standing in for an authorial identity they also do not hold. A shop that operates a fabricated creative persona turned those same tools toward someone else’s real one. The technical steps were identical. The ethical distinction is the one they ignored.

Simon & Schuster filed two DMCA takedown notices with Google in July 2025, as Baio confirmed through the Lumen Database. The fake site is still live.

The Part Copyright Law Cannot Reach

Standard copyright infringement is recoverable in the ways it has always been recoverable: takedowns, injunctions, damages. What the Obscure Sorrows case exposed is a category the standard remedies do not address. Baio documented that both ChatGPT and Google’s Gemini now attribute thedictionaryofobscuresorrows.com as the official Koenig project. Gemini’s error was precise: it explained to users that “John Koenig appears to have migrated the project from the older Tumblr-based site to the newer site with the ‘the’ prefix.” A plausible narrative for an incorrect belief, delivered with the confidence of a reference tool.

The mechanism is not mysterious. Modern AI assistants with web access retrieve content to answer factual questions. When the fake site dominates the retrieval pool for all Koenig-related queries, a retrieval-augmented system has no information that distinguishes it from the original. The more confidently it retrieves the fake, the more it steers users there; the more users it steers, the more the fake accumulates behavioral signals that reinforce its search rank. The attribution error compounds each time it is acted on.

DMCA can compel the takedown of an infringing website. It cannot compel correction of model weights. Even if Qontour’s site were deleted from the web tomorrow, the AI systems that have already learned to attribute Koenig’s work to that domain retain the belief encoded in their parameters. The inaccuracy persists through whatever process would correct it — which is not a copyright notice, and is not something courts can currently order.

“Sonder” — the word Koenig coined in 2009, the one that traveled into mainstream dictionaries over fifteen years — is now returned to users via chatbot as content associated with a Webflow agency’s portfolio project. This is not a copyright violation that reduces a creator’s reach. It is one that replaces their identity in the knowledge layer of the tools that an increasing proportion of the population now uses as reference.

The Loop the Literature Is Just Starting to Quantify

The research on what happens when AI-generated content proliferates into training pipelines is recent and converging on a specific concern. In March 2026, Xylogiannopoulos, Xanthopoulos, Karampelas, and Bakamitsos published the first longitudinal study of output diversity across ChatGPT model versions (arXiv:2603.12683), documenting a “measurable decline of recent ChatGPT releases’ ability to produce varied text, even when explicitly prompted to do so, by setting the temperature parameter to one.” They trace the root cause to “internet infiltration by LLM generated data” — the web these models train on is increasingly populated with content that earlier versions produced, and successive models converge toward it. The mechanism is not a flaw in one model; it is a property of the system.

Lundström-Imanov’s “The Economics of Model Collapse,” accepted to ICML 2026 (arXiv:2605.20279), quantified the degradation rate: a collapse-rate coefficient of 0.181 across ten retraining generations. Each time a model is retrained on a corpus that includes its own outputs, quality declines at this measurable rate. The paper describes the resulting loss of distributional fidelity as “measurable and often irreversible” — and the word “often” matters. Removing synthetic content from the training corpus after the fact does not restore the distribution that existed before. The model has already drifted, and the drift is not reversible by deletion.

Wang’s June 2026 paper, “Epidemiology of Model Collapse: Modeling Synthetic Data Contamination via Bilayer SIR Dynamics” (arXiv:2606.05168), modeled synthetic data contamination as an epidemic, treating the training corpus and the population of AI models as two interacting hosts. The finding: R₀ > 1 across all tested scenarios. An R₀ above one means contamination propagates and amplifies; it does not naturally die out. Wang identifies synthetic text detection as the highest-leverage intervention — the ability to identify AI-generated content and filter it before it enters the training pipeline. That detection capability does not currently operate at the scale the problem requires.

The mechanics of what Qontour did in summer 2023 have only become cheaper since. DALL-E 2 has been succeeded by models producing better output quality at lower cost per image. GPT-4 is now a baseline capability rather than a frontier one. A motivated actor replicating a creative corpus today would produce a more convincing fake with less effort. The Obscure Sorrows case was not a sophisticated attack. It was a routine project for a web design shop, completed with tools available to anyone, and the damage has been self-compounding for three years.

The Obscure Sorrows case is one node in a larger system, but it is a well-shaped specimen. A stable creative identity, replicated wholesale with publicly available tools, indexed by search, misattributed by AI assistants, with that misattribution reinforced each time a user follows the chatbot’s recommendation. The loop is running. What the model collapse literature measures at population scale, this case demonstrates at human scale — with names, a Tumblr account, and a Webflow agency’s portfolio page.

Baio called consent “the original sin of AI.” What the Obscure Sorrows case adds to that framing is directionality. Qontour didn’t need anything proprietary — just API access, an afternoon, and a domain that differed from Koenig’s by one word. Three years later, the system that answers queries about John Koenig’s work returns a narrative that assigns his creative identity to the agency that stole it. The academic literature recommends watermarking and provenance tracking as systemic fixes. Neither operates at scale. Until they do, every creator’s authorial identity in AI systems is protected only by the fact that no one has yet bothered to copy it.

AI-generated editorial illustration · TemperatureZero · June 22, 2026

Keep reading the signal

Get the Daily Signal — a concise briefing on what actually matters in AI and the systems around it.

Subscribe Free