Presentation Information

[1Yin-A-15]Multimodal Large Language Model based on Compressed Video Representation for Video Understanding

〇Daichi Yashima1,3, Shuhei Kurita2, Yusuke Oda3, Komei Sugiura1 (1. Keio University, 2. NII, 3. NII LLMC)

Keywords:

multimodal large language model,video understanding,compressed video representation

In this study, we focus on video understanding by MLLMs. This task is challenging because processing a full stream of RGB frames is computationally intractable and highly redundant, as self-attention have quadratic complexity with sequence length. We propose a video MLLM that efficiently processes videos by operating directly on their compressed representations. A sparse set of RGB keyframes is retained for appearance, while temporal dynamics are encoded as a motion representation, removing the need for sequential RGB frames. These motion representations act as a compact proxy for optical flow, capturing temporal dynamics without full frame decoding. To refine the noise and low fidelity of block-based motions, we introduce a module to denoise and generate a fine-grained motion representation. Furthermore, our model efficiently compresses these features in a way that scales linearly with sequence length. We demonstrate the effectiveness of our approach on multiple challenging benchmarks, including VideoMME, LongVideoBench, NExT-QA, MLVU, and Perception Test, where it outperformed baseline methods.