Presentation Information
[1Yin-B-43]Japanese Video-QA: A Benchmark for Evaluating Video Understanding of Japanese Culture
〇Yudai Mine1, Takuya Shintate1, Kazuya Takahashi1 (1. NABLAS Inc.)
Keywords:
Multimodal Large Language Model,Video Question Answering,Benchmark Construction,Japanese Cultural Understanding
We propose Japanese Video-QA, a video question-answering benchmark specialized for Japanese cultural content.The dataset consists of 800 question-answer pairs constructed from 428 YouTube videos about Japan (219 short videos under 4 minutes and 209 medium-length videos of 4--20 minutes), through question generation by Gemini 2.5 Flash followed by human verification and correction.The videos cover six domains (100 subdomains): seasonal events, tourist attractions, traditional culture, food culture, nature and landscapes, and pop culture.Questions are organized into five categories---spatial, counting, action, temporal, and causal---with three answer formats: open-ended, multiple-choice, and yes/no.We adopt LLM-as-a-Judge evaluation using GPT-4o, scoring each response as 1 (incorrect), 2 (partially correct), or 3 (correct).Evaluating seven MLLMs, Gemini 3 Pro achieved the highest performance with a mean score of 2.61 (76.3\% scored 3).However, open-source models Qwen3-VL-8B-Instruct scored 2.24 (56.4\%) and Phi-4-multimodal-instruct scored only 1.74 (32.4\%).Our benchmark is the first attempt to quantitatively evaluate MLLM video understanding in the specific domain of Japanese culture, providing guidance for future model improvements.
Comment
To browse or post comments, you must log in.Log in
