Correlation operation based on intermediate layers for knowledge method

doi:10.3969/j.issn.1000-5641.2022.05.010

Abstract

Abstract:

Convolutional neural networks have made remarkable achievements in artificial intelligence, such as blockchain, speech recognition, and image understanding. However, improvement in model performance is accompanied by a substantial increase in the computational and parameter overhead, leading to a series of problems, such as a slow inference speed, large memory consumption, and difficulty of deployment on mobile devices. Knowledge distillation serves as a typical model compression method, and can transfer knowledge from the teacher network to the student network to improve the latter’s performance without any increase in the number of parameters. A method for extracting representative knowledge for distillation has become the core issue in this field. In this paper, we present a new knowledge distillation method based on intermediate correlation operation, which with the help of data augmentation captures the learning and transformation process of image features during each middle layer stage of the network. We model this feature transform procedure using a correlation operation to extract a new representation from the teacher network to guide the training of the student network. The experimental results demonstrate that our method achieves the best performance on both the CIFAR-10 and CIFAR-100 datasets, in comparison to previous state-of-the-art methods.

Key words: convolutional neural networks, model compression, knowledge distillation, knowledge representation, correlation operation

CLC Number:

TP183

Haojie WU, Yanjie WANG, Wenbing CAI, Fei WANG, Yang LIU, Peng PU, Shaohui LIN. Correlation operation based on intermediate layers for knowledge method[J]. Journal of East China Normal University(Natural Science), 2022, 2022(5): 115-125.

Figures/Tables 9

Fig.1

Fig.2

Table 1

Table 2

Table 3

Fig.3

Table 4

Table 5

Table 6

References 38

1	袁勇, 周涛, 周傲英, 等. 区块链技术: 从数据智能到知识自动化. 自动化学报, 2017, 43 (9): 1485- 1490.
2	KRIZHEVSKY A, SUTSKEVER I, HINTON G E. ImageNet classification with deep convolutional neural networks [C]// Proceedings of the Advances in Neural Information Processing Systems. 2012: 1097-1105.
3	RENS Q, HE K M, GIRSHICK R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39 (6): 1137- 1149.
4	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 770-778.
5	纪荣嵘, 林绍辉, 晁飞, 等. 深度神经网络压缩与加速综述. 计算机研究与发展, 2018, 55 (9): 1871- 1888.
6	孟宪法, 刘方, 李广, 等. 卷积神经网络压缩中的知识蒸馏技术综述. 计算机科学与探索, 2021, 15 (10): 1812- 1818.
7	DOSOVITSKIY A, FISCHER P, ILG E, et al. FlowNet: Learning optical flow with convolutional networks [C]// Proceedings of the IEEE International Conference on Computer Vision. 2015: 2758-2766.
8	ILG E, MAYER N, SAIKIA T, et al. FlowNet 2.0: Evolution of optical flow estimation with deep networks [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 2462-2470.
9	WANG H, TRAN D, TORRESANI L, et al. Video modeling with correlation networks [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2020: 352-361.
10	LI H, KADAV A, DURDANOVIC I, et al. Pruning filters for efficient convnets [EB/OL]. (2017-03-10)[2022-06-11]. https://arxiv.org/pdf/1608.08710.pdf.
11	LIN J, RAO Y M, LU J W, et al. Runtime neural pruning [C]// Proceedings of Advances in Neural Information Processing Systems. 2017: 2178–2188.
12	HUBARA I, COURBARIAUS M, SOUDRY D, et al. Binarized neural networks: Training deep neural networks with weights and activations constrained to + 1 or –1 [EB/OL]. (2016-03-17)[2022-06-01]. https://arxiv.org/pdf/1602.02830.pdf.
13	JACOB B, KLIGYS S, CHEN B, et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 2704-2713.
14	章振宇, 谭国平, 周思源. 基于1-bit压缩感知的高效无线联邦学习算法. 计算机应用, 2022, 42 (6): 1675- 1682.
15	TAI C, XIAO T, ZHANG Y, et al. Convolutional neural networks with low-rank regularization [EB/OL]. (2016-02-14)[2022-06-15]. https://arxiv.org/pdf/1511.06067.pdf.
16	IOANOU Y, ROBERTSON D, SHOTTON J, et al. Training CNNs with low-rank filters for efficient image classification [C]//Proceedings of the International Conference on Learning Representation. 2016: 45-61.
17	HOWARD A G, ZHU M L, CHEN B, et al. MobileNets: Efficient convolutional neural networks for mobile vision applications [EB/OL]. (2017-04-17)[2022-06-01]. https://arxiv.org/pdf/1704.04861.pdf.
18	TAN M, LE Q. EfficientNet: Rethinking model scaling for convolutional neural networks [C]// Proceedings of the International Conference on Machine Learning. 2019: 6105–6114.
19	HINTON G E, VINYALS O, DEAN J. Distilling the knowledge in a neural network [C]// Proceedings of the International Conference on Learning Representation Workshop. 2015: 60-72.
20	ROMERO A, BALLAS N, KAHOU S E, et al. FitNets: Hints for thin deep nets [C]// Proceedings of the International Conference on Learning Representation. 2015: 73-85.
21	ZAGORUYKO S, KOMODAKIS N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer [EB/OL]. (2017-02-12)[2022-06-14]. https://arxiv.org/pdf/1612.03928.pdf.
22	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth $16 \times 16 $ words: Transformers for image recognition at scale [EB/OL]. (2021-06-03)[2022-06-13]. https://arxiv.org/pdf/2010.11929v2.pdf.
23	WANG W H, WEI F R, DONG L, et al. MINILM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers [C]// Proceedings of the Advances in Neural Information Processing Systems. 2020: 5776-5788.
24	AGUILAR G, LING Y, ZHANG Y, et al. Knowledge distillation from internal representations [C]// Proceedings of the Association for the Advancement of Artificial Intelligence. 2020: 7350-7357.
25	YIM J, JOO D, BAE J, et al. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 7130–7138.
26	PARK W, KIM D, LU Y, et al. Relational knowledge distillation [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 3967-3976.
27	LIU Y F, CAO J J, LI B, et al. Knowledge distillation via instance relationship graph [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 7096-7104.
28	TUNG F, MORI G. Similarity-preserving knowledge distillation [C]// Proceedings of the IEEE International Conference on Computer Vision. 2019: 1365-1374.
29	KIM J, PARK S, KWAK N. Paraphrasing complex network: Network compression via factor transfer [C]// Proceedings of the Advances in Neural Information Processing System. 2018: 2760–2769.
30	HEO B, LEE M, YUN S, et al. Knowledge transfer via distillation of activation boundaries formed by hidden neurons [C]// Proceedings of the Association for the Advancement of Artificial Intelligence. 2019: 3779-3787.
31	SRINIVAS S, FLEURET F. Knowledge transfer with jacobian matching [C]// Proceedings of the 35th International Conference on Machine Learning. 2018: 4723-4731.
32	TIAN Y L, KRISHNAN D, ISOLA P. Contrastive representation distillation [EB/OL]. (2019-10-23)[2022-06-06]. https://arxiv.org/pdf/1910.10699v1.pdf.
33	XU G D, LIU Z W, LI X X, et al. Knowledge distillation meets self-supervision [C]// Proceedings of the European Conference on Computer Vision. 2020: 588-604.
34	SU C, LI P, XIE Y, et al. Hierarchical knowledge squeezed adversarial network compression [C]// Proceedings of the Association for the Advancement of Artificial Intelligence. 2020: 11370-11377.
35	SHEN Z Q, HE Z K, XUE X Y. MEAL: Multi-model ensemble via adversarial learning [C]// Proceedings of the Association for the Advancement of Artificial Intelligence. 2019: 4886-4893.
36	CHUNG I, PARK S U, KIM J, et al. Feature-map-level online adversarial knowledge distillation [C]// Proceedings of the International Conference on Machine Learning. 2020: 2006-2015.
37	JIN X, PENG B Y, WU Y C, et al. Knowledge distillation via route constrained optimization [C]// Proceedings of the IEEE International Conference on Computer Vision. 2019: 1345-1354.
38	ZAGORUYKO S, KOMODAKIS N. Wide residual networks [EB/OL]. (2017-05-23)[2022-06-01]. https://arxiv.org/pdf/1605.07146.pdf.

网络		网络参数量		网络计算量
教师	学生	教师	学生	教师	学生
WRN40-2	WRN16-2	2.2 M	0.7 M	329.0 M	101.6 M
WRN40-2	WRN16-1	2.2 M	0.2 M	329.0 M	26.9 M
ResNet-56	ResNet-20	0.9 M	0.3 M	126.8 M	41.2 M
ResNet $32 \times 4 $	ResNet $8 \times 4 $	7.4 M	1.2 M	1.1 G	0.2 G

网络		网络准确率/%		学生网络准确率/%
教师	学生	教师	学生	KD^[19]	Corr	CorrKD
WRN40-2	WRN16-2	95.2	94.0	94.1	94.4	94.7
WRN40-2	WRN16-1	95.2	93.2	93.6	94.4	94.4
ResNet-56	ResNet-20	93.9	92.6	92.7	92.9	92.9
ResNet $32 \times 4 $	ResNet $8 \times 4 $	95.7	92.4	92.7	92.7	92.9

网络		网络准确率/%		学生网络准确率/%
教师	学生	教师	学生	KD^[19]	Corr	CorrKD
WRN40-2	WRN16-2	76.8	73.7	74.1	74.3	75.8
WRN40-2	WRN16-1	76.8	71.7	72.4	72.7	74.6
ResNet-56	ResNet-20	73.4	69.2	71.0	69.3	71.3
ResNet $32 \times 4 $	ResNet $8 \times 4 $	79.6	72.8	73.1	73.1	74.3

网络		知识蒸馏方法准确率/%
教师	学生	KD^[19]	FitNet^[20]	AT^[21]	SP^[28]	FT^[29]	CorrKD
WRN40-2	WRN16-2	74.1	75.8	75.3	75.3	75.2	75.8
WRN40-2	WRN16-1	72.4	74.1	74.5	73.2	74.4	74.6

教师网络 $ \to $ 学生网络	$ k $	WRN16-2 准确率/%
WRN40-2 $ \to $ WRN16-2	3	75.3
	5	75.4
	7	75.8
	9	75.6