Gradient-aware knowledge distillation: Tackling gradient insensitivity through teacher guided gradient scaling

发布时间：2025-12-29

点击次数：

摘要：: Prior research on knowledge distillation has primarily focused on enhancing the process through logits and feature-based approaches. In this paper, we present a novel gradient-based perspective on the learning dynamics of knowledge distillation, revealing a previously overlooked issue of gradient insensitivity. This issue arises when the varying confidence levels of the teacher’s predictions are not adequately captured in the student’s gradient updates, hindering the effective transfer of nuanced knowledge. To address this challenge, we propose gradient-aware knowledge distillation, a method designed to mitigate gradient insensitivity by incorporating varying teacher confidence into the distillation procedure. Specifically, it adjusts the gradients of the student logits in accordance with the teacher confidence, introducing sample-specific adjustments that assign higher-weighted updates to the non-target classes of samples where the teacher exhibits greater confidence. Extensive experiments on image classification and object detection tasks demonstrate the superiority of our approach, particularly in heterogeneous teacher-student scenarios, achieving state-of-the-art performance on ImageNet. Moreover, the proposed method is versatile and can be effectively integrated with many logits distillation methods, providing a robust enhancement to existing methods. The code is available at https://github.com/snw2021/GKD.

联系方式