Presentation Information

[2Yin-B-01]Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision–Language Models

〇Issa Sugiura1,2, Keito Sasagawa4,2, Keisuke Nakao4,2, Koki Maeda6,2, Yin Ziqi2, Yang Zhishen2, Shuhei Kurita3,2, Yusuke Oda2, Ryoko Tokuhisa5,7, Daisuke Kawahara4,2Naoaki Okazaki6,2 (1. Kyoto University, 2. NII LLMC, 3. National Institute of Informatics, 4. Waseda University, 5. Aichi Institute of Technology, 6. Institute of Science Tokyo, 7. Institute of Physical and Chemical Research)

Keywords:

Vision-Language Models,Dataset Construction

Vision–language models (VLMs) have rapidly advanced, and training datasets play a crucial role in the development of VLMs. However, existing publicly available training datasets are predominantly English-centric, and large-scale Japanese datasets covering diverse categories remain limited.
In this study, we construct Jagle, a large-scale Japanese multimodal post-training dataset comprising approximately 9.4 million instances across six categories and eighteen subsets. We compare models trained solely on the English training dataset FineVision with those trained on a combination of FineVision and Jagle. The results demonstrate that incorporating Jagle substantially improves Japanese performance while maintaining English performance. We release our dataset.