[3K6-IS-2c-01]Transforming Low-quality Technical Documents into Narrative Sentences for Adapting LLMs to Niche Technical Domains
〇Ekant Muljibhai Amin1, Yuta Koreeda1, Yasuhiro Sogawa1(1. Advanced AI Innovation Center, Hitachi, Ltd.)
We investigate whether Large Language Models (LLMs) can effectively learn from industrial data that is limited in quantity and often lacks the coherent narrative flow found in general-purpose training corpora.
Our objective is to address the distinctive challenges posed by such industrial data, which can hinder domain adaptation.
To do so, we propose a data-quality based evaluation method that derives question-answer pairs from the training corpus, identifies the corresponding source chunk for each pair, and labels that chunk as high- or low-quality based on features such as structure, repetition, and punctuation. We then measure how each labeled subset contributes to domain-adapted performance.
Results show that LLMs derive most of their domain knowledge from high-quality data, suggesting that low-quality data is underutilized.
To overcome this limitation, we introduce a multi-step chain-of-thought approach that refines low-quality text into coherent narratives while preserving essential information.
This transformation significantly boosts performance: domain-relevance win-rates increase from 59% to 73%, and correctness from 32% to 55%.
Overall, our findings highlight the importance of data quality and offer a practical strategy for enhancing LLM effectiveness in real-world industrial settings.
Our objective is to address the distinctive challenges posed by such industrial data, which can hinder domain adaptation.
To do so, we propose a data-quality based evaluation method that derives question-answer pairs from the training corpus, identifies the corresponding source chunk for each pair, and labels that chunk as high- or low-quality based on features such as structure, repetition, and punctuation. We then measure how each labeled subset contributes to domain-adapted performance.
Results show that LLMs derive most of their domain knowledge from high-quality data, suggesting that low-quality data is underutilized.
To overcome this limitation, we introduce a multi-step chain-of-thought approach that refines low-quality text into coherent narratives while preserving essential information.
This transformation significantly boosts performance: domain-relevance win-rates increase from 59% to 73%, and correctness from 32% to 55%.
Overall, our findings highlight the importance of data quality and offer a practical strategy for enhancing LLM effectiveness in real-world industrial settings.
