The 40th Annual Conference of the Japanese Society for Artificial Intelligence, 2026

Presentation Information

[3Yin-A-30]Improving Knowledge Distillation Fidelity via Adaptive Reweighting Based on Loss Divergence

〇Masato Mita¹ (1. Recruit Co.,Ltd.)

Keywords:

Knowledge Distillation,Large Language Model

Knowledge distillation (KD) transfers a strong teacher’s capabilities to a lightweight student, but for reasoning tasks the capacity gap often induces shortcut learning, where the student falls back on superficial pretraining patterns instead of following the teacher distribution. We propose \textbf{Adaptive Z-score Weighting (AZ-Weighting)}, a plug-in module that reweights each training sample by its relative KD-loss deviation. AZ-Weighting tracks the mean and variance of KD losses with an exponential moving average, converts each sample loss to a Z-score, and nonlinearly upweights high-deviation samples to focus learning on hard-to-align cases. Experiments on GSM8K show that AZ-Weighting improves strict format fidelity while maintaining answer accuracy.

Comment

To browse or post comments, you must log in.Log in

Back to Session information