The 40th Annual Conference of the Japanese Society for Artificial Intelligence, 2026

Presentation Information

[4Yin-A-50]Mitigating Visual Attention Sink and Reducing Hallucination Using Soft Registers in Large Vision-Language Models

〇Satsuki Tamura¹, Takashi Shibata¹, Atsumasa Tsukie¹, Jun Miyazaki¹, Shota Takano¹, Kazuko Nakayama¹ (1. NTT EAST, Inc)

Keywords:

multi-modal,LVLM,Visual Attention Sink,Register Tokens,Interpretability

Large Vision-Language Model（LVLM）では，画像の背景など無関係な領域に Attention が過剰に集中する Visual
Attention Sink（VAS）現象が知られており，小物体の検出漏れや存在しない物体の誤検出（ハルシネーション）の原因と
なる．本研究では，画像トークン列の直後に学習可能な Soft Register を挿入し，LLM が画像トークンを処理する際の不要
な Attention を吸収させることで VAS の影響を緩和する手法を提案する．通信設備保守の危険予知タスクで評価した結果，
提案手法は F1 スコア 0.876 を達成し，Baseline の 0.791 に対して 8.5 ポイント改善した．特に小物体（矢印標識，コーン
等）の検出精度が大幅に向上し，ハルシネーションの抑制効果を確認した．

Comment

To browse or post comments, you must log in.Log in

Back to Session information