Salt, science, and cognitive bias: teaching AI to fact-check itself

How to build and scale tripwires to counter Brandolini’s law

Sep 01, 2025

Introduction

It is becoming increasingly common for the articles we read to carry the fingerprints of large language models. Unlike humans, whose accountability is tracked in names, affiliations, and reputations, the accountability of an LLM is more opaque. It is measured less by truthfulness than by continued usage, user satisfaction, and persistent A/B testing on the servers of companies where the model resides.

During development, these models are tested against factual benchmarks. But once they are released and exposed to the variability of human prompting, their responses bend toward user’s preferences. That flexibility can make them useful, but it also risks reinforcing biases, producing confirmatory bias in the interaction itself.

The amount of energy needed to refute bullshit is an order of magnitude bigger than that needed to produce it - Alberto Brandolini, 2013. With LLMs, the cost of manufacturing has dropped close to zero. Which makes the question pressing: how do we make verification practical?

Guardrails

One idea is to let LLMs watch one another with a specialized focus. For example, they could check for cognitive biases. When we use language we are imprecise and through - often unintentional - habits of omission, equivocation, exaggeration, presupposition, and metaphor we can skirt by through a chain of reasoning that wouldn't compile under strict scrutiny.

Squirrel! Diversion (a working example)

To test this idea, I used a recent BBC science article on salt consumption and asked an LLM to flag cognitive biases from the Estimation category, as defined in the list of biases on Wikipedia. It identified eight. When pressed to be strict, it reduced the list to three “clear textbook” cases:

Textbook-clear matches

“One meta-analysis found a 17% greater risk of cardiovascular disease from consuming an extra 5g of salt per day.”
➡️ B2 (Base rate fallacy) – Reports a relative increase without anchoring it in baseline population risk, which can distort perception of the result.
“In fact, some scientists are now arguing that a low-salt diet is just as much of a risk factor for developing high blood pressure as high salt consumption.”
➡️ B5 (Hard–easy effect) – Frames a highly complex medical debate as a simple equivalence, overestimating our ability to resolve hard problems.
“The finding of a sweet spot in the middle is consistent with what you would expect for any essential nutrient… where at high levels you have toxicity and at low levels you have deficiency.”
➡️ A2 (Attribute substitution) – Replaces a complex epidemiological relationship with a simplified analogy, making the reasoning easier but less precise.

For transparency, here is the bias-spotting prompt:

Read the estimation section of cognitive biases. For each sentence in the pasted article, highlight any bias you detect. Label it using shorthand (A7, I1, O4, SP5, etc.), explain your reasoning if ambiguous, and remain neutral toward the article itself.

The choice of the article was deliberate: a reputable source, a non-sensitive, non-political and benign topic. Imagine what this would do to a controversial social media thread.

Independent Evaluation

In classical machine learning, a common practice is to hold out a fraction of the training data for validation. These holdout sets test whether a model can generalize to examples it has not seen.

Rather than evaluating whether a generative model produces simple a self-consistent and in-distribution piece of text, we can establish a reusable battery of validation criteria: verifiable facts, widely accepted scientific principles, domain-specific reference data, and carefully curated holdout statements. The text can then be tested against this battery for accuracy and coherence, rather than by direct article evaluation.

Through this indirection the reviewer can ask the language model not to assess the text as a whole, but instead to verify whether each criterion is upheld within it. As with the salt article, I asked the LLM to check whether the piece aligned with a series of “tripwires”: explicit criteria designed to test its consistency against well-known facts. This provides an independent axis of evaluation for the strength of the thesis under review.