Knowledge Distillation: A Free-teacher Framework Driven by Word-Vector
Download as PDF
DOI: 10.23977/CNCI2020023
Corresponding Author
Chuyi Zou
ABSTRACT
Knowledge distillation (KD) is an effective method to transferring knowledge from a larger teacher network to a small student network, in order to enhance the generalization ability of the small student network, which satisfies the low-memory and fast running requirements in practice. Existing KD methods often require a pre-trained teacher as a first step to discover useful knowledge, then subsequently transferring knowledge to student network. However, this procedure is a two training complex stages, requiring an expensive computational cost for a pre-trained teacher. In this paper, we propose a free-teacher framework driven by word-vector to address this limitation. By utilizing existing word vector packets (such as 'GoogleNews-vectors-negative300', etc.), we are committed to create a semantic similarity matrix. This matrix provides the additional soft label which is similar to conventional teacher model’s outputs, while does not require any extra training cost. Extensive evaluations show that our approach improve the generalization performance of a variety of deep neural networks competitive to alternative methods on two image classification datasets: CIFAR10 and CIFAR100, whilst not requiring extra expensive training cost.
KEYWORDS
Knowledge Distillation; Classification; Deep Learning