On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models

NeurIPS 2025

1Dept. of Electrical and Computer Engineering, 2IPAI, 3INMC, Seoul National University, Republic of Korea
* Equal contribution     † Corresponding author

Abstract

Large vision-language models (LVLMs), which integrate a vision encoder (VE) with a large language model, have achieved remarkable success across various tasks. However, there are still crucial challenges in LVLMs such as object hallucination, generating descriptions of objects that are not in the input image. Here, we argue that uncertain visual tokens within the VE is a key factor that contributes to object hallucination. Our statistical analysis found that there are positive correlations between visual tokens with high epistemic uncertainty and the occurrence of hallucinations. Furthermore, we show theoretically and empirically that visual tokens in early VE layers that exhibit large representation deviations under small adversarial perturbations indicate high epistemic uncertainty. Based on these findings, we propose a simple yet effective strategy to mitigate object hallucination by modifying the VE only. Our method comprises a proxy method with adversarial perturbations for identifying uncertain visual tokens efficiently and a method to mask these uncertain visual tokens during the self-attention process in the middle layers of the VE, suppressing their influence on visual encoding and thus alleviating hallucinations. Extensive experiments show that our method significantly reduces object hallucinations in LVLMs and can synergistically work with other prior arts.

Motivation

Previous approaches for object hallucination mitigation overlook deficiencies in the vision encoder, a core component responsible for visual perception.

Relationship between uncertain visual tokens and object hallucination

Figure 4: Relationship between uncertain visual tokens and object hallucination.

The x-axis represents the average variance within each bin, while the y-axis shows the corresponding metric scores. The results indicate that higher uncertainty is associated with more object hallucination, with p-value < 0.05. Note that higher values of CHAIRs and CHAIRi, and lower F1 score indicate more severe object hallucinations.

Through this statistical analysis, we confirm that uncertain visual tokens contribute to hallucination of LVLMs.

Method

Adversarial attack and uncertainty mask

Figure 1: Overall illustration of the adversarial attack and uncertainty mask generation process.

(a) The original image is processed by the vision encoder (VE) to obtain features $f_{orig}$. An adversarial image is created by adding optimizable noise and encoded to produce $f_{attk}$. The noise is optimized using Projected Gradient Descent (PGD) to maximize the mean squared error between $f_{orig}$ and $f_{attk}$.

(b) From intermediate layers of the encoder, the norm differences of corresponding features form layer-wise uncertainty maps. These maps are min-max normalized, aggregated, and standardized to produce the final binary uncertainty mask $M$.

Comparison with MC Dropout

Visual comparison of estimated uncertainty from MC dropout and our method

Figure 2: Visual comparison of estimated uncertainty from MC dropout and our method.

We compare our uncertainty map U with MC dropout to assess how well U approximates epistemic uncertainty. As shown in Figure 2, the results indicate that U closely aligns with the uncertainty estimated via MC dropout, demonstrating that U serves as an efficient approximation. On average, it is approximately 5 times faster than MC dropout in practice.

Quantitative Results

Quantitative results

Table 1: Quantitative results on CHAIR and POPE benchmarks.

Object hallucination is evaluated on the CHAIR and POPE benchmarks using three LVLMs and five decoding strategies, both with and without our method. POPE results are reported on three splits: Random, Popular, and Adversarial. The maximum token length is set to 512. Δ% denotes the relative difference in performance. ↑ / ↓ indicate that higher / lower values are better. We highlight the best scores in bold.

Qualitative Results

Qualitative results

Figure 7: Qualitative results of our method on LLaVA-1.5-7B and Shikra-7B.

Greedy decoding leads to object hallucinations by describing non-existent objects in the image (e.g., ‘several people’, ‘bench’, ‘handbag’, ‘passengers’ in LLaVA; ‘a few cars’, ‘car’, ‘a fire hydrant’ in Shikra). In contrast, our method, which modifies only the vision encoder, substantially reduces such hallucinations.

BibTeX

@article{seo2025epistemic,
  author    = {Seo, Hoigi and Kang, Dong Un and Cho, Hyunjin and Lee, Joohoon and Chun, Se Young},
  title     = {On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models},
  journal   = {Advances in neural information processing systems},
  year      = {2025},
}