講演情報

[4Yin-B-57]Improving Instruction-Following Response Quality by Distilling a Large Language Reasoning ModelsConstruction and Evaluation of an Instruction-Tuning Dataset with Reasoning Traces

〇Dung Tien Nguyen1, Sakae Mizuki1,2, Youmi Ma1, Yuta Katayama1, Naoaki Okazaki1,2,3 (1. Institute of Science Tokyo, 2. National Institute of Advanced Industrial Science and Technology, 3. National Institute of Informatics - Research and Development Center for Large Language Models (LLMC))

キーワード:

Instruction Tuning、Large Language Model、Reasoning Model、Distillation、Dataset Construction

We show that distilling teacher responses with reasoning traces from a large language reasoning model, gpt-oss, can substantially improve the usefulness of instruction-following dialogue responses. Specifically, using human-written instructions collected from LMSYS-Chat-1M, a large-scale human–AI conversation dataset, we construct GPT-OSS-LMSYS-Chat-1M-Synth by querying gpt-oss to generate synthetic assistant responses that include reasoning traces. During synthesis, we apply Best-of-N sampling guided by the teacher’s self-rated usefulness and use refusal-mitigation prompting to reduce instruction refusals. We then perform instruction-tuning of Qwen3 on this dataset and observe a substantial improvement on general chat, compared with a counterpart trained on assistant responses generated by a non-reasoning teacher model, Gemma3. Finally, an ablation that removes reasoning traces (training only on instruction–response pairs) confirms that including reasoning traces in its distillation contributes to the observed gains.