The 40th Annual Conference of the Japanese Society for Artificial Intelligence, 2026

Presentation Information

[2Yin-B-33]Construction of a Vehicle Driving Dataset for Designing a Multimodal LLM Capable of Interpreting Time-Series Data

〇Koichi Seki¹, Shugo Matsusaka¹, Hideaki Bunazawa¹, Takuya Shintate², Shuheng You², Yongpeng Cao², Xi Xue², Akira Yoshida², Kunio Suzuki² (1. Toyota Motor Corporation, 2. NABLAS Inc.)

Keywords:

Large Language Model,Multimodal LLM,Time Series Data,Caption Generation,Dataset Construction

In recent years, the multimodalization of large language models (LLMs) utilizing diverse datasets—including natural language, images, video, and audio—has been actively studied. However, the development of multimodal LLMs capable of interpreting time-series data has not progressed sufficiently. One contributing factor is the shortage of large-scale datasets in which time-series data are paired with corresponding descriptive text. This study focuses on vehicle driving data as a specific type of time-series data, aiming to generate textual descriptions of driving situations and construct a large-scale dataset. We propose a pipeline that converts driving data into graphical form and processes it using a vision-language model (VLM). A distinctive feature of our approach is the definition of driving events—such as "rapid acceleration" and "gentle right curve"—and the incorporation of these event definitions into prompts, enabling the generation of intended descriptions. Validation confirmed that the system can produce both overview and detailed descriptions of vehicle driving data. Ultimately, we constructed approximately 20,000 paired samples for 10-second and 1-minute driving segments. This outcome is expected to enable the construction of similar datasets for other types of time-series data beyond the automotive domain.

Back to Session information