Knowledge distillation improves student model performance. However, using a larger teacher model does not necessarily result in better distillation gains due to significant architecture and output gaps with smaller student networks. To address this issue, we reconsider teacher outputs and find that categories with strong teacher confidence benefit distillation more, while those with weaker certainty contribute less. Thus, we propose Logits Uncertainty Distillation (LUD) to bridge this gap. We introduce category uncertainty weighting to consider the uncertainty in the teacher model's predictions. A confidence threshold, based on the teacher's predictions, helps construct a mask that discounts uncertain classes during distillation. Furthermore, we incorporate two Spearman correlation loss functions to align the logits of the teacher and student models. These loss functions measure the discrepancy between the models' outputs at the category and sample levels. We also introduce adaptive dynamic temperature factors to optimize the distillation process. By combining these techniques, we enhance knowledge distillation results and facilitate effective knowledge transfer between teacher and student models, even when architectural differences exist. Extensive experiments on multiple datasets demonstrate the effectiveness of our method.
Keywords: Knowledge distillation; Uncertainty learning.
© 2024. The Author(s).