The 40th Annual Conference of the Japanese Society for Artificial Intelligence, 2026

Presentation Information

[1Yin-B-43]Japanese Video-QA: A Benchmark for Evaluating Video Understanding of Japanese Culture

〇Yudai Mine¹, Takuya Shintate¹, Kazuya Takahashi¹ (1. NABLAS Inc.)

Keywords:

Multimodal Large Language Model,Video Question Answering,Benchmark Construction,Japanese Cultural Understanding

We propose Japanese Video-QA, a video question-answering benchmark specialized for Japanese cultural content.The dataset consists of 800 question-answer pairs constructed from 428 YouTube videos about Japan (219 short videos under 4 minutes and 209 medium-length videos of 4--20 minutes), through question generation by Gemini 2.5 Flash followed by human verification and correction.The videos cover six domains (100 subdomains): seasonal events, tourist attractions, traditional culture, food culture, nature and landscapes, and pop culture.Questions are organized into five categories---spatial, counting, action, temporal, and causal---with three answer formats: open-ended, multiple-choice, and yes/no.We adopt LLM-as-a-Judge evaluation using GPT-4o, scoring each response as 1 (incorrect), 2 (partially correct), or 3 (correct).Evaluating seven MLLMs, Gemini 3 Pro achieved the highest performance with a mean score of 2.61 (76.3\% scored 3).However, open-source models Qwen3-VL-8B-Instruct scored 2.24 (56.4\%) and Phi-4-multimodal-instruct scored only 1.74 (32.4\%).Our benchmark is the first attempt to quantitatively evaluate MLLM video understanding in the specific domain of Japanese culture, providing guidance for future model improvements.

Comment

To browse or post comments, you must log in.Log in

Back to Session information