Presentation Information

[1Yin-B-43]Japanese Video-QA: A Benchmark for Evaluating Video Understanding of Japanese Culture

〇Yudai Mine1, Takuya Shintate1, Kazuya Takahashi1 (1. NABLAS Inc.)

Keywords:

Multimodal Large Language Model,Video Question Answering,Benchmark Construction,Japanese Cultural Understanding

We propose Japanese Video-QA, a video question-answering benchmark specialized for Japanese cultural content.The dataset consists of 800 question-answer pairs constructed from 428 YouTube videos about Japan (219 short videos under 4 minutes and 209 medium-length videos of 4--20 minutes), through question generation by Gemini 2.5 Flash followed by human verification and correction.The videos cover six domains (100 subdomains): seasonal events, tourist attractions, traditional culture, food culture, nature and landscapes, and pop culture.Questions are organized into five categories---spatial, counting, action, temporal, and causal---with three answer formats: open-ended, multiple-choice, and yes/no.We adopt LLM-as-a-Judge evaluation using GPT-4o, scoring each response as 1 (incorrect), 2 (partially correct), or 3 (correct).Evaluating seven MLLMs, Gemini 3 Pro achieved the highest performance with a mean score of 2.61 (76.3\% scored 3).However, open-source models Qwen3-VL-8B-Instruct scored 2.24 (56.4\%) and Phi-4-multimodal-instruct scored only 1.74 (32.4\%).Our benchmark is the first attempt to quantitatively evaluate MLLM video understanding in the specific domain of Japanese culture, providing guidance for future model improvements.