Presentation Information
[2Yin-B-33]Construction of a Vehicle Driving Dataset for Designing a Multimodal LLM Capable of Interpreting Time-Series Data
〇Koichi Seki1, Shugo Matsusaka1, Hideaki Bunazawa1, Takuya Shintate2, Shuheng You2, Yongpeng Cao2, Xi Xue2, Akira Yoshida2, Kunio Suzuki2 (1. Toyota Motor Corporation, 2. NABLAS Inc.)
Keywords:
Large Language Model,Multimodal LLM,Time Series Data,Caption Generation,Dataset Construction
In recent years, the multimodalization of large language models (LLMs) utilizing diverse datasets—including natural language, images, video, and audio—has been actively studied. However, the development of multimodal LLMs capable of interpreting time-series data has not progressed sufficiently. One contributing factor is the shortage of large-scale datasets in which time-series data are paired with corresponding descriptive text. This study focuses on vehicle driving data as a specific type of time-series data, aiming to generate textual descriptions of driving situations and construct a large-scale dataset. We propose a pipeline that converts driving data into graphical form and processes it using a vision-language model (VLM). A distinctive feature of our approach is the definition of driving events—such as "rapid acceleration" and "gentle right curve"—and the incorporation of these event definitions into prompts, enabling the generation of intended descriptions. Validation confirmed that the system can produce both overview and detailed descriptions of vehicle driving data. Ultimately, we constructed approximately 20,000 paired samples for 10-second and 1-minute driving segments. This outcome is expected to enable the construction of similar datasets for other types of time-series data beyond the automotive domain.
