On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models

Abstract

Large vision-language models (LVLMs), which integrate a vision encoder (VE) with a large language model, have achieved remarkable success across various tasks. However, there are still crucial challenges in LVLMs such as object hallucination, generating descriptions of objects that are not in the input image. Here, we argue that uncertain visual tokens within the VE is a key factor that contributes to object hallucination. Our statistical analysis found that there are positive correlations between visual tokens with high epistemic uncertainty and the occurrence of hallucinations. Furthermore, we show theoretically and empirically that visual tokens in early VE layers that exhibit large representation deviations under small adversarial perturbations indicate high epistemic uncertainty. Based on these findings, we propose a simple yet effective strategy to mitigate object hallucination by modifying the VE only. Our method comprises a proxy method with adversarial perturbations for identifying uncertain visual tokens efficiently and a method to mask these uncertain visual tokens during the self-attention process in the middle layers of the VE, suppressing their influence on visual encoding and thus alleviating hallucinations. Extensive experiments show that our method significantly reduces object hallucinations in LVLMs and can synergistically work with other prior arts.

Theorems

Lemma 3.1 (Approximate local Gaussianity under small perturbation).

Let $f = \{ f_t \}_{t=1}^L$ be a smooth $L$-layer neural network parameterized by $\theta$. For an input $x \in \mathbb{R}^{N \times 3}$, define the hidden state at layer $t$ as $z^{(t)} = f_t \circ \cdots \circ f_1(x)$. For a perturbed input $x + \epsilon$, with $\|\epsilon\|_\infty \leq k$ for sufficiently small $k > 0$, define the perturbed hidden state as $Z^{(t)} = f_t \circ \cdots \circ f_1(x + \epsilon)$. Then, under the assumption that the perturbation is small and $f \in C^2$, $Z^{(t)}$ can be locally approximated by a Gaussian centered at $z^{(t)}$, with a third-order remainder in the log-density.

Theorem 3.2 (Upper bound of differential entropy increases as hidden state deviation increases under adversarial attack).

Let $x$ be an input image, and let $\epsilon$ be a small adversarial perturbation. Define the perturbed input as $X := x + \epsilon$. Let $f = \{ f_t \}_{t=1}^L$ be a smooth $L$-block transformer that processes a sequence of $N$ input tokens. Let $z^{(t)} := f_t \circ \cdots \circ f_1(x) \in \mathbb{R}^{N \times d}$ and $Z^{(t)} := f_t \circ \cdots \circ f_1(X) \in \mathbb{R}^{N \times d}$ be the hidden states at layer $t$ for the clean and perturbed inputs, respectively. Denote the $i$-th token representation at layer $t$ as $z_i^{(t)} \in \mathbb{R}^d$ and $Z_i^{(t)} \in \mathbb{R}^d$. If $Z_i^{(t)}$ changes smoothly with small $\epsilon$, then the upper bound of the differential entropy of $Z_i^{(t)}$ increases as $\mathbb{E}_\epsilon \!\big[ \lVert Z_i^{(t)} - z_i^{(t)} \rVert_2^2 \big]$ increases.

Theorem 3.2 shows that under adversarial attack, the norm of hidden state deviation efficiently approximates the upper bound of visual token’s entropy. Leveraging this insight, we aim to obtain a mask that identifies uncertain visual tokens with an adversarial attack.

Motivation

Previous approaches for object hallucination mitigation overlook deficiencies in the vision encoder, the core component responsible for visual perception.

We focus on the uncertainty of visual tokens introduced by the pre-trained vision encoder.

Uncertain Visual Tokens Contribute to Hallucination

Figure 4: Relationship between uncertain visual tokens and object hallucination.

The x-axis represents the average variance within each bin, while the y-axis shows the corresponding metric scores. The results indicate that higher uncertainty is associated with more object hallucination, with p-value < 0.05. Note that higher values of CHAIR_s and CHAIR_i, and lower F1 score indicate more severe object hallucinations.

Adversarial Attack Reveals Uncertain Visual Tokens

We demonstrate that visual tokens in early layers exhibiting large representation deviations under small adversarial perturbations indicate high epistemic uncertainty.

Figure 1: Overall illustration of the adversarial attack and uncertainty mask generation process.

(a) The original image is processed by the vision encoder (VE) to obtain features $f_{orig}$. An adversarial image is created by adding optimizable noise and encoded to produce $f_{attk}$. The noise is optimized using Projected Gradient Descent (PGD) to maximize the mean squared error between $f_{orig}$ and $f_{attk}$.

(b) From intermediate layers of the encoder, the norm differences of corresponding features form layer-wise uncertainty maps. These maps are min-max normalized, aggregated, and standardized to produce the final binary uncertainty mask $M$.

Comparison with Uncertainty via MC Dropout

Figure 2: Visual comparison of estimated uncertainty from MC dropout and our method.

We compare our uncertainty map U with MC dropout to assess how well U approximates epistemic uncertainty. As shown in Figure 2, the results indicate that U closely aligns with the uncertainty estimated via MC dropout, demonstrating that U serves as an efficient approximation. On average, it is approximately 5 times faster than MC dropout in practice.

Training-free Hallucination Mitigation

We leverage the uncertainty mask identified via adversarial attack to mitigate object hallucination.

Figure 5: Illustration of our attention masking method during inference.

In the intermediate multi-head self-attention layers of the vision encoder, we apply the binary uncertainty mask M to the attention outputs. This token-wise masking reduces the influence of uncertain visual tokens, while preserving the meaningful visual representation.

Our method reduces uncertainty and mitigates object hallucination.

Figure 6: Impact of the proposed masking strategy on visual token uncertainty.

The average token-level variance estimated via MC dropout decreases, indicating reduced uncertainty.

This reduction correlates with improved performance on object hallucination metrics.

Quantitative Results

Our method significantly reduces object hallucinations in LVLMs and can synergistically work with other prior arts.

Table 1: Quantitative results on CHAIR and POPE benchmarks.

Object hallucination is evaluated on the CHAIR and POPE benchmarks using three LVLMs and five decoding strategies, both with and without our method. POPE results are reported on three splits: Random, Popular, and Adversarial. The maximum token length is set to 512. Δ% denotes the relative difference in performance. ↑ / ↓ indicate that higher / lower values are better. We highlight the best scores in bold.

Qualitative Results

Figure 7: Qualitative results of our method on LLaVA-1.5-7B and Shikra-7B.

Greedy decoding leads to object hallucinations by describing non-existent objects in the image (e.g., ‘several people’, ‘bench’, ‘handbag’, ‘passengers’ in LLaVA; ‘a few cars’, ‘car’, ‘a fire hydrant’ in Shikra). In contrast, our method, which modifies only the vision encoder, substantially reduces such hallucinations.

BibTeX

@article{seo2025epistemic,
  author    = {Seo, Hoigi and Kang, Dong Un and Cho, Hyunjin and Lee, Joohoon and Chun, Se Young},
  title     = {On Epistemic Uncertainty of Visual Tokens for Object Hallucinations in Large Vision-Language Models},
  journal   = {Advances in neural information processing systems},
  year      = {2025},
}