Bayesian distillation of deep learning models (Q2069701)

The authors present a Bayesian approach to teacher-student networks' knowledge distillation. Knowledge distillation was first proposed by \textit{G. Hinton} et al. in their paper [``Distilling the knowledge in a neural network'', Preprint, \url{arXiv:1503.02531}]. They proposed to train a large network with ground truth labels as the teacher network, then train a smaller model on the outputs of the teacher network as ``soft targets''. This work extends the prior framework of teacher-student networks. The authors argue that the parameters of the student network can be initialized from the teacher network. The teacher network is usually larger than the student network. To meaningfully initialize the student network, the authors propose to prune the teacher network so that it has the same architecture as the student network. With the assumption that the posterior of the teacher network follows a Gaussian distribution, the authors prove that the pruned teacher network also follows a Gaussian distribution.

0 references

reviewed by

Hongshan Li

0 references

zbMATH Keywords

deep learning

0 references

Bayesian methods

0 references

knowledge distillation

0 references

model selection