講演情報

[4B-04]Q-Frame: A Plug-and-play Question-related Frame Extraction Approach for Long Video Question Answering

*Li Zhi1、Wan Yanan1、Niu Hao1、Vizcarra Julio1、多屋 優人1 (1. KDDI総合研究所)
発表者区分:一般
論文種別:ショートペーパー
インタラクティブ発表:あり

キーワード:

Multimodal Large Language Model、Long Video Question Answering、Memory Network

Memory networks have been introduced into multimodal large language models (MLLMs) to facilitate fast inference and compact memory footprints for long video question answering (LVideoQA).
However, current memory-based MLLMs simply sample frames with consistent strides and ignore the correlation between the extracted frames and the questions, reducing their LVideoQA performance. Additionally, the scalability of these methods is limited due to the optimization required when combining with existing MLLMs. We propose a novel plug-and-play question-related frame extraction approach, Q-Frame, to collect only frames related to the question and plugging in existing MLLMs without extra trainable parameters. The experimental results demonstrate that our proposed Q-Frame enhanced LLaVA-Video-7B and LLaVA-OneVision-7B models 3.7% and 2.0% accuracy respectively on VideoMME LVideoQA task.