講演情報

[4Yin-B-57]Improving Instruction-Following Response Quality by Distilling a Large Language Reasoning ModelsConstruction and Evaluation of an Instruction-Tuning Dataset with Reasoning Traces

〇Dung Tien Nguyen¹, Sakae Mizuki^1,2, Youmi Ma¹, Yuta Katayama¹, Naoaki Okazaki^1,2,3 (1. Institute of Science Tokyo, 2. National Institute of Advanced Industrial Science and Technology, 3. National Institute of Informatics - Research and Development Center for Large Language Models (LLMC))

キーワード：

Instruction Tuning、Large Language Model、Reasoning Model、Distillation、Dataset Construction

We show that distilling teacher responses with reasoning traces from a large language reasoning model, gpt-oss, can substantially improve the usefulness of instruction-following dialogue responses. Specifically, using human-written instructions collected from LMSYS-Chat-1M, a large-scale human–AI conversation dataset, we construct GPT-OSS-LMSYS-Chat-1M-Synth by querying gpt-oss to generate synthetic assistant responses that include reasoning traces. During synthesis, we apply Best-of-N sampling guided by the teacher’s self-rated usefulness and use refusal-mitigation prompting to reduce instruction refusals. We then perform instruction-tuning of Qwen3 on this dataset and observe a substantial improvement on general chat, compared with a counterpart trained on assistant responses generated by a non-reasoning teacher model, Gemma3. Finally, an ablation that removes reasoning traces (training only on instruction–response pairs) confirms that including reasoning traces in its distillation contributes to the observed gains.

コメントの閲覧・投稿にはログインが必要です。ログイン

セッション詳細へ戻る