Matching Guided Distillation ECCV 2020 Kaiyu Yue, Jiangfan Deng, - - PowerPoint PPT Presentation

▶

Nov 02, 2023 467 likes •671 views

Matching Guided Distillation ECCV 2020 Kaiyu Yue, Jiangfan Deng, and Feng Zhou Algorithm Research, Aibee Inc. Motivation Motivation Distillation Obstacle The gap in semantic feature structure between the intermediate features of teacher

SLIDE 1

Matching Guided Distillation

ECCV 2020 Kaiyu Yue, Jiangfan Deng, and Feng Zhou Algorithm Research, Aibee Inc.

SLIDE 2

Motivation

SLIDE 3

Motivation

Distillation Obstacle

The gap in semantic feature structure between the intermediate features of teacher and student

Classic Scheme

Transform intermediate features by adding the adaptation modules, such the conv layer

Problems

1) The adaptation module brings more parameters into training 2) The adaptation module with random initialization or special transformation isn’t friendly for distilling a pre-trained student

SLIDE 4

Matching Guided Distillation Framework

SLIDE 5

Matching Guided Distillation – Matching

Given two feature sets from teacher and student, we use Hungarian method to achieve the flow guided matrix M. Flow guided matrix M indicates the matched relationships.

SLIDE 6

Matching Guided Distillation – Channels Reduction

One student channel could match multiple teacher channels. We perform reduction into one tensor for guiding the student.

Channels Reduction

SLIDE 7

Matching Guided Distillation – Distillation

After reducing teacher channels, we start to distill student using partial distance training loss, such as L2 loss.

Channels Reduction Distance Loss

SLIDE 8

Matching Guided Distillation – Coordinate Descent Optimization

The overall training takes a coordinate-descent approach between two optimization objects — flow guided matrix update and parameters update.

Channels Reduction Distance Loss

Updating flow guided matrix M Training student model using SGD Coordinate Descent Optimization

SLIDE 9

Matching Guided Distillation Reduction Methods

SLIDE 10

Matching Guided Distillation – Channels Reduction

We propose three efficient methods for reducing teacher channels: Sparse Matching, Random Drop and Absolute Max Pooling.

SLIDE 11

Matching Guided Distillation – Sparse Matching

Each student channel will only match the most related teacher channel. Unmatched teacher channels are ignored.

Distance Loss Matching

SLIDE 12

Matching Guided Distillation – Random Drop

Sampling a random teacher channel from the ones associated with each student channel.

Distance Loss Matching

SLIDE 13

Matching Guided Distillation – Absolute Max Pooling

To keep both positive and negative feature information of teacher, we propose a novel pooling mechanism that reduce features according to the absolute value at the same feature structure location.

Distance Loss Matching

SLIDE 14

Matching Guided Distillation Main Results

SLIDE 15

Results – Fine-Grained Recognition on CUB-200

+ 3.97 % on top1 + 5.44 % on top1

SLIDE 16

Results – Large-Scale Classification on ImageNet-1K

+ 1.83 % on top1 + 2.6 % on top1

SLIDE 17

Results – Object Detection and Instance Segmentation on COCO

SLIDE 18

Summary

MGD is lightweight and efficient for various tasks
MGD gets rid of channels number constraint between teacher and student, it’s flexible to plug into network
MGD is friendly for distilling a pre-trained student

Matching Guided Distillation

ECCV 2020 Kaiyu Yue, Jiangfan Deng, and Feng Zhou Algorithm Research, Aibee Inc.

Motivation

Motivation

The gap in semantic feature structure between the intermediate features of teacher and student

Transform intermediate features by adding the adaptation modules, such the conv layer

1) The adaptation module brings more parameters into training 2) The adaptation module with random initialization or special transformation isn’t friendly for distilling a pre-trained student

Matching Guided Distillation Framework

Matching Guided Distillation – Matching

Given two feature sets from teacher and student, we use Hungarian method to achieve the flow guided matrix M. Flow guided matrix M indicates the matched relationships.

Matching Guided Distillation – Channels Reduction

One student channel could match multiple teacher channels. We perform reduction into one tensor for guiding the student.

Channels Reduction

Matching Guided Distillation – Distillation

After reducing teacher channels, we start to distill student using partial distance training loss, such as L2 loss.

Channels Reduction Distance Loss

Matching Guided Distillation – Coordinate Descent Optimization

The overall training takes a coordinate-descent approach between two optimization objects — flow guided matrix update and parameters update.

Channels Reduction Distance Loss

Updating flow guided matrix M Training student model using SGD Coordinate Descent Optimization

Matching Guided Distillation Reduction Methods

Matching Guided Distillation – Channels Reduction

We propose three efficient methods for reducing teacher channels: Sparse Matching, Random Drop and Absolute Max Pooling.

Matching Guided Distillation – Sparse Matching

Each student channel will only match the most related teacher channel. Unmatched teacher channels are ignored.

Distance Loss Matching

Matching Guided Distillation – Random Drop

Sampling a random teacher channel from the ones associated with each student channel.

Distance Loss Matching

Matching Guided Distillation – Absolute Max Pooling

To keep both positive and negative feature information of teacher, we propose a novel pooling mechanism that reduce features according to the absolute value at the same feature structure location.

Distance Loss Matching

Matching Guided Distillation Main Results

Results – Fine-Grained Recognition on CUB-200

+ 3.97 % on top1 + 5.44 % on top1

Results – Large-Scale Classification on ImageNet-1K

+ 1.83 % on top1 + 2.6 % on top1

Results – Object Detection and Instance Segmentation on COCO

Summary

Project webpage: http://kaiyuyue.com/mgd