Matching Guided Distillation ECCV 2020 Kaiyu Yue, Jiangfan Deng, - PowerPoint PPT Presentation
Matching Guided Distillation ECCV 2020 Kaiyu Yue, Jiangfan Deng, and Feng Zhou Algorithm Research, Aibee Inc. Motivation Motivation Distillation Obstacle The gap in semantic feature structure between the intermediate features of teacher
Matching Guided Distillation ECCV 2020 Kaiyu Yue, Jiangfan Deng, and Feng Zhou Algorithm Research, Aibee Inc.
Motivation
Motivation Distillation Obstacle • The gap in semantic feature structure between the intermediate features of teacher and student Classic Scheme • Transform intermediate features by adding the adaptation modules, such the conv layer Problems • 1) The adaptation module brings more parameters into training 2) The adaptation module with random initialization or special transformation isn’t friendly for distilling a pre-trained student
Matching Guided Distillation Framework
Matching Guided Distillation – Matching Given two feature sets from teacher and student, we use Hungarian method to achieve the flow guided matrix M . Flow guided matrix M indicates the matched relationships.
Matching Guided Distillation – Channels Reduction One student channel could match multiple teacher channels. We perform reduction into one tensor for guiding the student. Channels Reduction
Matching Guided Distillation – Distillation After reducing teacher channels, we start to distill student using partial distance training loss, such as L2 loss. Channels Reduction Distance Loss
Matching Guided Distillation – Coordinate Descent Optimization The overall training takes a coordinate-descent approach between two optimization objects — flow guided matrix update and parameters update. Updating flow guided matrix M Coordinate Descent Optimization Channels Reduction Training student model using SGD Distance Loss
Matching Guided Distillation Reduction Methods
Matching Guided Distillation – Channels Reduction We propose three efficient methods for reducing teacher channels: Sparse Matching, Random Drop and Absolute Max Pooling.
Matching Guided Distillation – Sparse Matching Each student channel will only match the most related teacher channel. Unmatched teacher channels are ignored. Matching Distance Loss
Matching Guided Distillation – Random Drop Sampling a random teacher channel from the ones associated with each student channel. Matching Distance Loss
Matching Guided Distillation – Absolute Max Pooling To keep both positive and negative feature information of teacher, we propose a novel pooling mechanism that reduce features according to the absolute value at the same feature structure location. Matching Distance Loss
Matching Guided Distillation Main Results
Results – Fine-Grained Recognition on CUB-200 + 3.97 % on top1 + 5.44 % on top1
Results – Large-Scale Classification on ImageNet-1K + 1.83 % on top1 + 2.6 % on top1
Results – Object Detection and Instance Segmentation on COCO
Summary MGD is lightweight and efficient for various tasks • MGD gets rid of channels number constraint between teacher and student, it’s flexible to plug into network • MGD is friendly for distilling a pre-trained student • Project webpage: http://kaiyuyue.com/mgd
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.