2025年度 人工知能学会全国大会(第39回)

2025年度 人工知能学会全国大会(第39回)

2025年5月27日〜5月30日大阪国際会議場+オンライン
人工知能学会
2025年度 人工知能学会全国大会(第39回)

2025年度 人工知能学会全国大会(第39回)

2025年5月27日〜5月30日大阪国際会議場+オンライン

[3K6-IS-2c-01]Transforming Low-quality Technical Documents into Narrative Sentences for Adapting LLMs to Niche Technical Domains

〇Ekant Muljibhai Amin1, Yuta Koreeda1, Yasuhiro Sogawa1(1. Advanced AI Innovation Center, Hitachi, Ltd.)
We investigate whether Large Language Models (LLMs) can effectively learn from industrial data that is limited in quantity and often lacks the coherent narrative flow found in general-purpose training corpora.
Our objective is to address the distinctive challenges posed by such industrial data, which can hinder domain adaptation.
To do so, we propose a data-quality based evaluation method that derives question-answer pairs from the training corpus, identifies the corresponding source chunk for each pair, and labels that chunk as high- or low-quality based on features such as structure, repetition, and punctuation. We then measure how each labeled subset contributes to domain-adapted performance.
Results show that LLMs derive most of their domain knowledge from high-quality data, suggesting that low-quality data is underutilized.
To overcome this limitation, we introduce a multi-step chain-of-thought approach that refines low-quality text into coherent narratives while preserving essential information.
This transformation significantly boosts performance: domain-relevance win-rates increase from 59% to 73%, and correctness from 32% to 55%.
Overall, our findings highlight the importance of data quality and offer a practical strategy for enhancing LLM effectiveness in real-world industrial settings.