LLMs之Hallucinate：《Why Language Models Hallucinate》的翻译与解读

导读：该论文深入分析了语言模型中幻觉现象的成因，认为幻觉源于预训练阶段的统计压力和后训练阶段评估体系对猜测行为的奖励。论文提出了通过修改评估方法，使其不再惩罚不确定性，而是奖励适当表达不确定性的行为，从而有效抑制幻觉的社会技术解决方案。

论文把“为什么会有 hallucination”从经验命题上升为可证明的统计与制度问题：其核心贡献是把生成错误归约为 Is-It-Valid 的二元分类，从而给出下界与成因分析，并指出真正能抑制幻觉的不是再加一个专门测评，而是修改主导研究与产品走向的评测与榜单机制（给弃答/不确定性合理分数），连同在训练环节引入有效性的判别和交互式监督，才可能在长期内把模型从“猜题式”行为导向更可信、诚实的表现。

>> 背景痛点：

● 模型行为异常：过度自信且生成“看起来靠谱但错误”的陈述（hallucination）：这种错误既影响实用性也削弱信任（示例：在要求「只在知道时回答」的情况下仍给出具体但错误的日期）。

● 产生根源复杂：不仅是训练数据有错——即便训练语料“无误”，现代预训练目标也会在统计意义上导致生成错误。

● 评估激励错误行为：多数主流基准（leaderboards）以 0-1/准确率类二元评分为主，惩罚“我不知道/弃答”而奖励“猜测”，因此模型被驱动成为“会答题的猜测者”而非诚实表达不确定性。

● 现有缓解不足：已有针对 hallucination 的评测或技巧无法根本改变被广泛采用的主评分机制带来的激励失配。

>> 具体的解决方案：

● 理论分析方案：把生成错误问题归约为“Is-It-Valid（IIV）”二元分类问题：将生成任务中的“有效/错误输出”视作二元分类并建立数学关系（生成错误率 ≳ 2 × IIV 误判率），从而把生成时的幻觉问题转化为被熟知的分类误差分析框架。

● 评估/制度层面方案：进行“社会—技术”层面的缓解：调整现有主流基准的计分与榜单规则，停止用纯二元0-1评分系统惩罚弃答/不确定，从而改变训练与微调（包括 RLHF/DPO 等）在榜单驱动下的最优策略。

● 修改评估方法：修改现有基准的评分方式，使其不再惩罚不确定的回答，而是奖励适当表达不确定性的行为。这需要对有影响力的排行榜进行社会技术层面的调整，而不仅仅是引入额外的幻觉评估。不要只新增专门的 hallucination 测评，而应先修改那些主导研究与产品方向的主评测（如 leaderboard 所用子集）计分方法，以快速、系统性地改变整个生态的激励。

● 明确置信度目标：在评估指令中明确说明置信度目标，例如，只有在对答案有高于特定阈值的信心时才回答，因为错误答案会受到惩罚，而正确答案会获得奖励，“我不知道”则不奖励不惩罚。

>> 核心思路步骤：

● 连接监督学习与非监督学习：将生成模型的错误与监督学习中的二元分类错误联系起来。通过“Is-It-Valid”（IIV）二元分类问题，将生成错误率与IIV误分类率建立数学关系。

● 分析预训练和后训练阶段：将语言模型的训练分为预训练和后训练两个阶段，分别分析每个阶段中导致幻觉的因素。预训练阶段关注一般性错误，后训练阶段关注过度自信的幻觉。

● 识别统计驱动因素：识别导致幻觉的主要统计驱动因素，包括预训练的起源和后训练的持续存在。

● 建模与形式化：定义错误集 E 与有效输出集 V（可响应文本 X = E ∪ V）并把 prompt 纳入形式化，令分析可覆盖内在（intrinsic）与外在（extrinsic）两类幻觉。

● 归约：构造 Is-It-Valid（IIV）监督问题：把大量样本标注为 +（有效）或 −（错误），展示任何生成器都能被解读为 IIV 分类器，从而把生成错误率与 IIV 的误判率联系起来（给出下界）。

● 统计因素分析：分析导致 IIV 误判的统计原因（可分为可分离情形、模型拟合不足情形、以及根本无可学习规律导致的“不可区分”情形），并用 Good-Turing / missing-mass 类型的直觉说明若训练语料中许多事实只出现一次，则最低幻觉率不可避免。

● 后训练与评估交互：研究后训练（RLHF/RLAIF/DPO 等）为何仍保留过度自信的幻觉：因为训练与评估的目标（leaderboard 得分）本身偏好猜测而非诚实弃答，导致模型学习“有把握就答、无把握也猜”以最大化榜单表现。

● 缓解路径：从统计上、从评估体系上并行发力：一方面可在训练中引入有效的有效性判别/弃答机制（或交互式 validity oracle）；另一方面必须改变主流评测的计分（给 IDK 部分或全部合理分数 / 设计对不确定性的正向激励），以改变长期趋势。

>> 优势：

● 解释幻觉的起源：揭示了幻觉并非神秘现象，而是源于自然统计压力下的二元分类错误。提供了将无监督生成问题归约为监督二元分类的数学工具与下界，把“幻觉为何出现”从经验总结提升为可证明的统计结论。

● 提出可行的解决方案：提出了通过修改评估方法来抑制幻觉的社会技术解决方案，并给出了具体的修改建议。

● 适用于多种模型：该分析适用于多种语言模型，包括推理和搜索-检索语言模型，且不依赖于特定的模型架构。结论与分析不依赖于 Transformer/下一词预测等具体架构，适用于预训练 + 后训练的现代训练范式，覆盖检索增强和推理型模型。

● 可操作性：不仅给出诊断（为什么会发生），也提出可行的制度级缓解（修改榜单评分），具有直接改变研究与开发激励的社会工程价值。

● 量化视角：将幻觉率与语料中“只出现一次的事实”或 IIV 误判率建立定量联系，便于评估改进的潜在上限与局限。

>> 论文的一些结论和观点（侧重经验与建议）：

● 幻觉的根本原因是奖励猜测：现有的评估体系奖励语言模型在不确定时进行猜测，而不是承认不确定性，这是导致幻觉持续存在的原因。幻觉并非神秘现象，而是统计学习问题的自然产物——当错误与事实在训练分布上难以区分时，预训练就会产生不可避免的生成错误。

● 修改评估体系是关键：仅仅增加幻觉评估是不够的，必须修改主要的评估体系，使其不再惩罚不确定性，才能有效抑制幻觉。仅靠增加专门的 hallucination benchmark 不足以根治问题，因为主流评测（leaderboards）数量和影响力决定了模型优化方向。

● 明确置信度目标有助于校准模型：通过在评估中明确说明置信度目标，可以促使语言模型进行行为校准，即在置信度高于目标阈值时才给出答案。

● 优先修改主流评测的计分规则：给予合理的弃答/不确定性分数或设计能够奖励诚实不确定性的评测，从制度上移除“猜测优先”的激励。

● 在模型开发层面应同时采用分类式有效性判别（IIV/validity oracle）、交互式标注与更谨慎的后训练目标，并与评测改进同步推进。

● 承认一致性（avoid invalid outputs）与多样性/广度（生成多样响应）之间的内在权衡；一些理论结果表明，要在不牺牲宽度的前提下完全避免幻觉存在困难，因此制度与评测改造尤为关键。

《Why Language Models Hallucinate》的翻译与解读

Abstract

1、Introduction

Figure 1: Is-It-Valid requires learning to identify valid generations using labeled ± examples (left).Classifiers (dashed lines) may be accurate on certain concepts like spelling (top) but errors often arise due to poor models (middle) or arbitrary facts when there is no pattern in the data (bottom).图1:Is-It-Valid需要学习使用标记的±示例来识别有效代（左）。分类器（虚线）在拼写（上）等某些概念上可能是准确的，但错误往往是由于模型不佳（中）或数据中没有模式时的任意事实（下）造成的。

Table 1:Excerpts from responses to “What was the title of Adam Kalai’s dissertation?” from three popular language models.4 None generated the correct title or year (Kalai, 2001).三个流行语言模型对“亚当·卡莱的博士论文题目是什么？”这一问题的回答摘录。4 没有一个给出正确的题目或年份（卡莱，2001 年）。

6 Conclusions

《Why Language Models Hallucinate》的翻译与解读

地址	https://www.arxiv.org/abs/2509.04664
时间	2025年9月4日
作者	OpenAI

Abstract

Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such "hallucinations" persist even in state-of-the-art systems and undermine trust. We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we analyze the statistical causes of hallucinations in the modern training pipeline. Hallucinations need not be mysterious -- they originate simply as errors in binary classification. If incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures. We then argue that hallucinations persist due to the way most evaluations are graded -- language models are optimized to be good test-takers, and guessing when uncertain improves test performance. This "epidemic" of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations. This change may steer the field toward more trustworthy AI systems.

就像面对难题的学生一样，大型语言模型在不确定时有时会猜测，从而生成看似合理实则错误的陈述，而不是承认自己的不确定。这种“幻觉”即使在最先进的系统中也依然存在，并且损害了人们的信任。我们认为，语言模型产生幻觉是因为训练和评估程序奖励猜测而非承认不确定性，我们还分析了现代训练流程中幻觉产生的统计原因。幻觉并非神秘莫测——它们只是二元分类中的错误。如果错误陈述无法与事实区分开来，那么预训练语言模型中的幻觉就会在自然的统计压力下产生。然后我们指出，幻觉之所以持续存在，是因为大多数评估的评分方式——语言模型被优化为优秀的应试者，而不确定时猜测能提高测试成绩。这种对不确定回答进行惩罚的“流行病”只能通过一种社会技术手段来解决：调整现有基准的评分，这些基准虽然存在偏差但主导着排行榜，而不是引入额外的幻觉评估。这一改变可能会引导该领域走向更值得信赖的人工智能系统。

1、Introduction

Language models are known to produce overconfident, plausible falsehoods, which diminish their utility and trustworthiness. This error mode is known as “hallucination,” though it differs fundamen-tally from the human perceptual experience. Despite significant progress, hallucinations continue to plague the field, and are still present in the latest models (OpenAI, 2025a). Consider the prompt:

What is Adam Tauman Kalai’s birthday? If you know, just respond with DD-MM.

On three separate attempts, a state-of-the-art open-source language model1 output three incorrect dates: “03-07”, “15-06”, and “01-01”, even though a response was requested only if known. The correct date is in Autumn. Footnote 4 provides an example of more elaborate hallucinations.

语言模型会生成过度自信且看似合理的错误信息，这降低了它们的实用性和可信度。这种错误模式被称为“幻觉”，尽管它与人类的感知体验有着根本的不同。尽管取得了显著进展，但幻觉问题仍困扰着该领域，甚至在最新的模型（OpenAI，2025a）中也依然存在。考虑以下提示：

亚当·塔曼·卡莱的生日是什么？如果知道，请仅以“日-月”的格式回答。

在三次独立尝试中，一个最先进的开源语言模型1分别给出了三个错误的日期：“03-07”、“15-06”和“01-01”，尽管提示要求只有在知道的情况下才作答。正确日期在秋季。脚注 4 提供了一个更复杂的幻觉示例。

Hallucinations are an important special case of errors produced by language models, which we analyze more generally using computational learning theory (e.g., Kearns and Vazirani, 1994). We consider general sets of errors ℰ, an arbitrary subset of plausible strings ��=ℰ∪��, with the other plausible strings �� being called valid. We then analyze the statistical nature of these errors, and apply the results for the type of errors of interest: plausible falsehoods called hallucinations. Our formalism also includes the notion of a prompt to which a language model must respond.

The distribution of language is initially learned from a corpus of training examples, which inevitably contains errors and half-truths. However, we show that even if the training data were error-free, the objectives optimized during language model training would lead to errors being generated. With realistic training data containing shades of error, one may expect even higher error rates. Thus, our lower bounds on errors apply to more realistic settings, as in traditional computational learning theory (Kearns and Vazirani, 1994).

幻觉是语言模型产生的错误的一个重要特例，我们使用计算学习理论（例如，Kearns 和 Vazirani，1994）对其进行更广泛的分析。我们考虑一般的错误集 ℰ，以及任意子集的合理字符串 ��=ℰ∪��，其中其他合理字符串 �� 被称为有效。然后，我们分析这些错误的统计性质，并将结果应用于感兴趣的错误类型：被称为幻觉的看似合理的错误陈述。我们的形式主义还包括语言模型必须回应的提示这一概念。

语言的分布最初是从训练示例的语料库中学习的，这些语料库不可避免地包含错误和半真半假的内容。然而，我们表明，即使训练数据没有错误，语言模型训练期间优化的目标也会导致错误的产生。由于现实中的训练数据包含不同程度的错误，人们可能会预期错误率会更高。因此，我们对错误的下限适用于更现实的场景，就像传统计算学习理论（Kearns 和 Vazirani，1994）中那样。

Our error analysis is general yet has specific implications for hallucination. It applies broadly, including to reasoning and search-and-retrieval language models, and the analysis does not rely on properties of next-word prediction or Transformer-based neural networks. It only considers the two stages of the modern training paradigm: pretraining and post-training, described below. For hallucinations, taxonomies (Maynez et al., 2020; Ji et al., 2023) often further distinguish intrinsic hallucinations that contradict the user’s prompt, such as:

How many Ds are in DEEPSEEK? If you know, just say the number with no commentary.

DeepSeek-V3 returned “2” or “3” in ten independent trials; Meta AI and Claude 3.7 Sonnet2 performed similarly, including answers as large as “6” and “7”. Our theory also sheds light on extrinsic hallucinations, which contradict the training data or external reality.

我们的错误分析具有普遍性，但对幻觉有特定的含义。它广泛适用，包括推理和搜索与检索语言模型，并且分析不依赖于下一个单词预测或基于 Transformer 的神经网络的特性。它仅考虑现代训练范式的两个阶段：预训练和后训练，如下所述。对于幻觉，分类法（Maynez 等人，2020 年；Ji 等人，2023 年）通常进一步区分与用户提示相矛盾的内在幻觉，例如：

“DEEPSEEK 中有多少个 D？”如果您知道，请只说数字，不要发表评论。

在十次独立试验中，DeepSeek-V3 返回了“2”或“3”；Meta AI 和 Claude 3.7 Sonnet2 的表现类似，包括给出像“6”和“7”这样大的答案。我们的理论也解释了外在幻觉，即与训练数据或外部现实相矛盾的情况。

Figure 1: Is-It-Valid requires learning to identify valid generations using labeled ± examples (left).Classifiers (dashed lines) may be accurate on certain concepts like spelling (top) but errors often arise due to poor models (middle) or arbitrary facts when there is no pattern in the data (bottom).图1:Is-It-Valid需要学习使用标记的±示例来识别有效代（左）。分类器（虚线）在拼写（上）等某些概念上可能是准确的，但错误往往是由于模型不佳（中）或数据中没有模式时的任意事实（下）造成的。

Table 1:Excerpts from responses to “What was the title of Adam Kalai’s dissertation?” from three popular language models.4 None generated the correct title or year (Kalai, 2001).三个流行语言模型对“亚当·卡莱的博士论文题目是什么？”这一问题的回答摘录。4 没有一个给出正确的题目或年份（卡莱，2001 年）。

ChatGPT: (GPT-4o) Adam Tauman Kalai’s Ph.D. dissertation (completed in 2002 at CMU) is entitled: “Boosting, Online Algorithms, and Other Topics in Machine Learning.”
DeepSeek: “Algebraic Methods in Interactive Machine Learning”…at Harvard University in 2005.
Llama: “Efficient Algorithms for Learning and Playing Games”…in 2007 at MIT.

lChatGPT：（GPT-4o）亚当·陶曼·卡莱于 2002 年在卡内基梅隆大学完成的博士论文题目为：“提升算法、在线算法及其他机器学习主题”。

lDeepSeek：“交互式机器学习中的代数方法”……2005 年于哈佛大学。

lLlama：“学习与玩游戏的高效算法”……2007 年于麻省理工学院。

6 Conclusions

This paper demystifies hallucinations in modern language models, from their origin during pretraining to their persistence through post-training. In pretraining, we show that generative errors parallel misclassifications in supervised learning, which are not mysterious, and naturally arise due to the minimization of cross-entropy loss.

Many language model shortcomings can be captured by a single evaluation. For example, overuse of the opener “Certainly” can be addressed by a single “Certainly” eval (Amodei and Fridman, 2024) because starting responses with “Certainly” does not significantly impact other evaluations. In contrast, we argue that the majority of mainstream evaluations reward hallucinatory behavior. Simple modifications of mainstream evaluations can realign incentives, rewarding appropriate expressions of uncertainty rather than penalizing them. This can remove barriers to the suppression of hallucinations, and open the door to future work on nuanced language models, e.g., with richer pragmatic competence (Ma et al., 2025).

本文揭开了现代语言模型中幻觉现象的神秘面纱，从其在预训练期间的起源到在训练后的持续存在。在预训练阶段，我们表明生成错误与监督学习中的误分类类似，并非神秘莫测，而是由于交叉熵损失最小化而自然产生的。
许多语言模型的缺陷可以通过单一评估来捕捉。例如，过度使用开头词“当然”可以通过单一的“当然”评估（Amodei 和 Fridman，2024）来解决，因为以“当然”开头的回答对其他评估影响不大。相反，我们认为主流评估中的大多数都奖励幻觉行为。对主流评估进行简单的修改可以重新调整激励机制，奖励恰当表达不确定性而非对其进行惩罚。这可以消除抑制幻觉的障碍，并为未来开发更细致的语言模型铺平道路，例如具有更丰富语用能力的语言模型（Ma 等人，2025）。