Logo

More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

1UC Santa Cruz, 2Stanford University, 3UC Santa Barbara
image
(a) Example of outputs from a reasoning model and a non-reasoning model on a perception task. Red highlights indicate visual hallucination. Multimodal reasoning models are generally more prone to amplifying hallucinations during the reasoning process compared to their non-reasoning counterparts. (b) Performance of different models on reasoning and perception tasks in the RH-Bench dataset. Better performing models are positioned in the upper right corner. Baseline non-reasoning models of varying scales typically exhibit weaker reasoning capabilities and fewer hallucination, whereas reasoning models display the opposite trend.

Abstract

Test-time compute has empowered multimodal large language models to generate extended reasoning chains, yielding strong performance on tasks such as multimodal math reasoning. However, we observe that this improved reasoning ability often comes with increased hallucination: as generations become longer, models tend to drift away from image-grounded content and rely more on language priors. Attention analysis reveals that longer reasoning chains reduce focus on visual inputs, contributing to hallucination. To systematically study this phenomenon, we introduce RH-AUC, a metric that quantifies how a model's perception accuracy changes with reasoning length, enabling evaluation of whether the model preserves visual grounding while reasoning. We also release RH-Bench, a diagnostic benchmark covering diverse multimodal tasks, designed to jointly assess the balance of reasoning ability and hallucination. We find that (i) larger models generally exhibit a better balance between reasoning and perception; (ii) reasoning and perception balance depends more on the types and domains of the training data than its volume. Our findings highlight the need for evaluation frameworks that account for both reasoning quality and perceptual reliability.

Takeaway 1: Reasoning Models Amplify Visual Hallucinations
Across training paradigms and model scales, multi-modal reasoning models exhibit a consistent drop in accuracy and rise in hallucination rates on general visual benchmarks.
Takeaway 2: Why Reasoning Models Amplify Hallucinations?
The reduction of visual attention in reasoning models amplifies visual hallucinations. Longer reasoning chains further exacerbate the degradation of attention to visual information and focus toward linguistic priors.
Takeaway 3: Moderate Reasoning Length Strikes the Best Reasoning-Hallucination Balance
Reasoning length exerts a non-monotonic effect on model performance: both insufficient and excessive reasoning degrade accuracy, and the optimal length is task-dependent.

Multimodal Reasoning Can Amplify Visual Hallucination

image
Across training paradigms and model scales, multi-modal reasoning models exhibit a consistent drop in accuracy and rise in hallucination rates on general visual benchmarks.
image
Two common types of hallucination patterns observed in multimodal reasoning models. (a) corresponds to hallucinations caused by visual misrecognition, while (b) reflects hallucinations arising from reasoning biases. Hallucinated spans are highlighted in red.

Why Reasoning Models Amplify Hallucinations?

image
Attention allocation and visual grounding between reasoning and non reasoning models. The reduction of visual attention in reasoning models amplifies visual hallucinations.
image
Attention shift in the reasoning model under different reasoning length. In normal thinking, the model generates outputs as typically expected, while in overthinking, the reasoning length is adjusted using Latent State Steering (Section Method). Longer reasoning chains further exacerbate the degradation of attention to visual information and focus toward linguistic priors.

Effects of Reasoning Length on Reasoning-Hallucination Balance

We explore the impact of reasoning length on the balance between hallucination and reasoning. An overview of the proposed control strategy is provided, including latent state steering, budget forcing, and test time scaling. We also explore the optimal generation length for various benchmarks and analyze the balance between hallucination and reasoning performance as reasoning length varies.

image
Reasoning-Hallucination balance of multimodal reasoning models under varying reasoning lengths. Thinking lengths are controlled within [0–600] tokens for reasoning and [0–300] for hallucination, corresponding to the longer chains required for reasoning and shorter for hallucination.

Evaluation on the Reasoning-Hallucination Balance

image

(a) Accuracy trends on the RH-Bench reasoning task across different reasoning lengths for 3B and 7B models. Larger models typically exhibit more stable performance across varying reasoning lengths. (b) Comparison of SFT+RL and RL-only training paradigms in terms of RH-AUC, with arrow directions indicating the increase in reasoning length for SFT+RL relative to RL-only. RL-only training tends to generate more concise reasoning chains, leading to a better perception hallucination balance. (c) Case study comparing RL-only and SFT+RL models. SFT+RL models often introduce rigid imitation reasoning paths, which limit the flexibility of visual reasoning.

BibTeX


      @misc{liu2025thinkingseeingassessingamplified,
      title={More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models}, 
      author={Chengzhi Liu and Zhongxing Xu and Qingyue Wei and Juncheng Wu and James Zou and Xin Eric Wang and Yuyin Zhou and Sheng Liu},
      year={2025},
      eprint={2505.21523},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.21523}, 
}