The 40th Annual Conference of the Japanese Society for Artificial Intelligence, 2026

Presentation Information

[2Yin-B-01]Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision–Language Models

〇Issa Sugiura^1,2, Keito Sasagawa^4,2, Keisuke Nakao^4,2, Koki Maeda^6,2, Yin Ziqi², Yang Zhishen², Shuhei Kurita^3,2, Yusuke Oda², Ryoko Tokuhisa^5,7, Daisuke Kawahara^4,2Naoaki Okazaki^6,2 (1. Kyoto University, 2. NII LLMC, 3. National Institute of Informatics, 4. Waseda University, 5. Aichi Institute of Technology, 6. Institute of Science Tokyo, 7. Institute of Physical and Chemical Research)

Keywords:

Vision-Language Models,Dataset Construction

Vision–language models (VLMs) have rapidly advanced, and training datasets play a crucial role in the development of VLMs. However, existing publicly available training datasets are predominantly English-centric, and large-scale Japanese datasets covering diverse categories remain limited.
In this study, we construct Jagle, a large-scale Japanese multimodal post-training dataset comprising approximately 9.4 million instances across six categories and eighteen subsets. We compare models trained solely on the English training dataset FineVision with those trained on a combination of FineVision and Jagle. The results demonstrate that incorporating Jagle substantially improves Japanese performance while maintaining English performance. We release our dataset.

Back to Session information