Presentation Information
[5G3-OS-37b-01]Enhancing Explainability of Video-Grounded Dialogue Question Answering with Vision-Language Models via Self-Reflection and Commonsense Knowledge Graph
〇Kota Hiwara1, Takeshi Morita1 (1. Aoyama Gakuin University)
Keywords:
Vision-Language Model,Large Language Model,Commonsense Knowledge Graph
Video-grounded dialogue question answering using Vision-Language Model (VLM) is a challenging reasoning task that requires understanding temporal changes in videos as well as question intent grounded in dialogue history. However, such models are prone to incorrect answers, and their generated responses often lack clear reasoning grounds, making systematic analysis of error causes difficult. In this study, we focus on the video-grounded dialogue benchmark dataset VDAct and aim to improve answer accuracy and explainability through automatic error diagnosis and answer refinement. First, we design and implement two types of automatic error classification methods for the rationales produced by the evaluation framework VDEval, and select an appropriate analysis method by comparing their agreement with human annotations. Next, to suppress hallucinations caused by visual misrecognition, we propose a self-reflection-based answer verification and revision method. In addition, to compensate for insufficient visual evidence or contextual reasoning, we introduce a question answering approach augmented with external commonsense knowledge graph based on the input question. Finally, we analyze score changes in VDEval and resolved error cases to discuss the effectiveness, limitations, and explainability of the proposed methods.
Comment
To browse or post comments, you must log in.Log in
