Presentation Information

[3Yin-A-30]Improving Knowledge Distillation Fidelity via Adaptive Reweighting Based on Loss Divergence

〇Masato Mita1 (1. Recruit Co.,Ltd.)

Keywords:

Knowledge Distillation,Large Language Model

Knowledge distillation (KD) transfers a strong teacher’s capabilities to a lightweight student, but for reasoning tasks the capacity gap often induces shortcut learning, where the student falls back on superficial pretraining patterns instead of following the teacher distribution. We propose \textbf{Adaptive Z-score Weighting (AZ-Weighting)}, a plug-in module that reweights each training sample by its relative KD-loss deviation. AZ-Weighting tracks the mean and variance of KD losses with an exponential moving average, converts each sample loss to a Z-score, and nonlinearly upweights high-deviation samples to focus learning on hard-to-align cases. Experiments on GSM8K show that AZ-Weighting improves strict format fidelity while maintaining answer accuracy.