Personalized Education: Blind Knowledge Distillation

Xiang Deng, Jian Zheng, Zhongfei Zhang ;


"Knowledge distillation compresses a large model (teacher) to a smaller one by letting the student imitate the outputs of the teacher. An interesting question is why the student still typically underperforms the teacher after the imitation. The existing literature usually attributes this to model capacity differences between them. However, capacity differences are unavoidable in model compression, and even large capacity differences are desired for achieving high compression rates. By designing exploratory experiments with theoretical analysis, we find that model capacity differences are not necessarily the root reason; instead the distillation data matter when the student capacity is greater than a threshold. In light of this, we propose personalized education (PE) to first help each student adaptively find its own blind knowledge region (BKR) where the student has not captured the knowledge from the teacher, and then teach the student on this region. Extensive experiments on several benchmark datasets demonstrate that PE substantially reduces the performance gap between students and teachers, even enables small students to outperform large teachers, and also beats the state-of-the-art approaches. Code link:"

Related Material

[pdf] [supplementary material] [DOI]