CS 4803 / 7643: Deep Learning Topics: Moving beyond supervised - - PowerPoint PPT Presentation
CS 4803 / 7643: Deep Learning Topics: Moving beyond supervised - - PowerPoint PPT Presentation
CS 4803 / 7643: Deep Learning Topics: Moving beyond supervised learning Zsolt Kira Georgia Tech Administrativia Projects! Due April 30 th Template online Can use MS Word but follow the organization/rubric! No
Administrativia
- Projects!
– Due April 30th – Template online – Can use MS Word but follow the organization/rubric!
- No posters/presentations
(C) Zsolt Kira 2
Project Note
- Important note:
– Your project should include doing something beyond just downloading open-source code, fine-tuning, and showing the result – This can include:
- implementation of additional approaches (even if leveraging open-
source code),
- a thorough analysis/investigation of some phenomena or hypothesis
- theoretical analysis, or
- When using external resources, provide references to
anything you used in the write-up!
(C) Zsolt Kira 3
Supervised Learning
4
Supervised Learning
- ML has been focused largely on this
- Lots of other problem settings are now coming up:
○
What if we have unlabeled data?
○
What if we have many datasets?
○
What if we only have one example per (new) class?
But wait, there’s more!
- Transfer Learning
- Semi-supervised learning
- One/Few-shot learning
- Un/Self-Supervised Learning
- Domain adaptation
- Meta-Learning
- Zero-shot learning
- Continual / Lifelong-learning
- Multi-modal learning
- Multi-task learning
- Active learning
- …
(C) Zsolt Kira 5 Setting Source Target Shift Type Semi-supervised Single labeled Single unlabeled None Domain Adaptation Single labeled Single unlabeled Non- semantic Domain Generalization Multiple labeled Unknown Non- semantic Cross-Task Transfer Single labeled Single unlabeled Semantic Few-Shot Learning Single labeled Single few- labeled Semantic Un/Self- Supervised Single unlabeled Many labeled Both/Task
An Entire Class on this!
- Deep Unsupervised Learning class (UC Berkeley)
- Link:
– https://sites.google.com/view/berkeley-cs294-158-sp20/home
(C) Zsolt Kira 6
But wait, there’s more!
- Transfer Learning
- Semi-supervised learning
- One/Few-shot learning
- Un/Self-Supervised Learning
- Domain adaptation
- Meta-Learning
- Zero-shot learning
- Continual / Lifelong-learning
- Multi-modal learning
- Multi-task learning
- Active learning
- …
(C) Zsolt Kira 7
What is Semi-Supervised Learning?
8
Supervised Learning Semi-Supervised Learning
Slide Credit: Pieter Abbeel et al., CS294-158, UC Berkeley
What is Semi-Supervised Learning?
9
Supervised Learning Semi-Supervised Learning
Slide Credit: Pieter Abbeel et al., CS294-158, UC Berkeley
Semi-Supervised Learning
- Classification: Fully Supervised
○ Training data: (image, label), predict label for new images.
- What if we have a few labeled samples and many unlabeled samples? Labeling
is generally time-consuming and expensive in certain domains.
- Semi-Supervised Learning
○ Training data: Labeled data (image, label) and Unlabeled data (image) ○ Goal: Use the unlabeled data to make supervised learning better ○ Note: If we have lots of labeled data, this goal is much harder
10
Slide Credit: Pieter Abbeel et al., CS294-158, UC Berkeley
Why Semi-Supervised Learning?
Slide: Thang Luong
11
Slide Credit: Pieter Abbeel et al., CS294-158, UC Berkeley
- My take: Reality might be in-between:
- Might be able to improve upon high-labeled data regime but with exponentially increasing
unlabeled data (of the proper type)
- See
Agenda
■ Core concepts
■
Confidence vs Entropy
■
Pseudo Labeling
■
Entropy minimization
■
Virtual Adversarial Training
■
Label Consistency
■
Make sure augmentations of the sample have the same class
■
Pi-Model, Temporal Ensembling, Mean Teacher
■
Regularization
■
Weight decay
■
Dropout
■
Data-Augmentation (MixUp, CutOut)
■
Unsupervised Data Augmentation (UDA), MixMatch
■
Co-Training / Self-Training / Pseudo Labeling (Noisy Student)
12
Slide Credit: Pieter Abbeel et al., CS294-158, UC Berkeley
Pseudo Labeling
- Simple idea:
○ Train on labeled data ○ Make predictions on unlabeled data ○ Add confident predictions to training data ○ Can do these both end-to-end (no need to separate stages) 13
Slide Credit: Pieter Abbeel et al., CS294-158, UC Berkeley
Issue: Confidences on New Data
- Predictions on unlabeled data
may be too flat (high entropy)
- Solution: Entropy
minimization
- Several ways to achieve this
○ Explicit loss ○ Sharpening function (e.g. temperature scaling) 14
Image Credit: Figure modified from MixMatch paper
Label Consistency with Data Augmentation
15
Label Consistency with Data Augmentation
Could be Unlabeled or Labeled
16
Label Consistency with Data Augmentation
17
Label Consistency with Data Augmentation
Make sure that the logits are similar
18
More Data Augmentation -> Regularization
19
Realistic Evaluation of Semi-Supervised Learning
20
Outline
■ Realistic Evaluation of Semi-Supervised Learning
■
pi-model
■
Temporal Ensembling
■
Mean Teacher
■
Virtual Adversarial Training
21
pi-Model
22
Temporal Ensembling for Semi-Supervised Learning
pi-Model
23
Temporal Ensembling for Semi-Supervised Learning
Comparison
24
Comparison
25
Varying number of labels
26
Class Distribution Mismatch
27
MixMatch
28
MixMatch
29
MixMatch
MixUp
30
MixMatch
31
MixMatch
32
MixMatch
33
MixMatch
34
FixMatch
35
FixMatch - Results
36
But wait, there’s more!
- Transfer Learning
- Semi-supervised learning
- One/Few-shot learning
- Un/Self-Supervised Learning
- Domain adaptation
- Meta-Learning
- Zero-shot learning
- Continual / Lifelong-learning
- Multi-modal learning
- Multi-task learning
- Active learning
- …
(C) Zsolt Kira 37
Few-Shot Learning
(C) Zsolt Kira 38
Slide Credit: Hugo Larochelle
Few-Shot Learning
(C) Zsolt Kira 39
Slide Credit: Hugo Larochelle
Normal Approach
- Do what we always do: Fine-tuning
– Train classifier on base classes – Freeze features – Learn classifier weights for new classes using few amounts
- f labeled data (during “query” time!)
(C) Zsolt Kira 40
A Closer Look at Few-shot Classification, Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, Jia-Bin Huang
Cons of Normal Approach
- The training we do on the base classes does not
factor the task into account
- No notion that we will be performing a bunch of N-
way tests
- Idea: simulate what we will see during test time
(C) Zsolt Kira 41
Meta-Training Approach
- Set up a set of smaller tasks during training which
simulates what we will be doing during testing
– Can optionally pre-train features on held-out base classes (not typical)
- Testing stage is now the same, but with new classes
(C) Zsolt Kira 42
https://www.borealisai.com/en/blog/tutorial-2-few-shot-learning-and-meta-learning-i/
Meta-Learning Approaches
- Learning a model conditioned on support set
(C) Zsolt Kira 43
More Sophisticated Meta-Learning Approaches
- Learn gradient descent:
– Parameter initialization and update rules – Output:
- Parameter initialization
- Meta-learner that decides how to update parameters
- Learn just an initialization and use normal gradient
descent (MAML)
– Output:
- Just parameter initialization!
- We are using SGD
(C) Dhruv Batra & Zsolt Kira 44
Meta-Learner
- How to parametrize learning algorithms?
- Two approaches to defining a meta-learner
– Take inspiration from a known learning algorithm
- kNN/kernel machine: Matching networks (Vinyals et al. 2016)
- Gaussian classifier: Prototypical Networks (Snell et al. 2017)
- Gradient Descent: Meta-Learner LSTM (Ravi & Larochelle, 2017) ,
MAML (Finn et al. 2017)
– Derive it from a black box neural network
- MANN (Santoro et al. 2016)
- SNAIL (Mishra et al. 2018)
(C) Zsolt Kira 45
Slide Credit: Hugo Larochelle
More Sophisticated Meta-Learning Approaches
- Learn gradient descent:
– Parameter initialization and update rules – Output:
- Parameter initialization
- Meta-learner that decides how to update parameters
- Learn just an initialization and use normal gradient
descent (MAML)
– Output:
- Just parameter initialization!
- We are using SGD
(C) Zsolt Kira 46
Meta-Learner LSTM
(C) Zsolt Kira 47
Slide Credit: Hugo Larochelle
Meta-Learner LSTM
(C) Zsolt Kira 48
Slide Credit: Hugo Larochelle
Meta-Learner LSTM
(C) Zsolt Kira 49
Slide Credit: Hugo Larochelle
Meta-Learner LSTM
(C) Zsolt Kira 50
Slide Credit: Hugo Larochelle
Model-Agnostic Meta-Learning (MAML)
(C) Zsolt Kira 53
Slide Credit: Hugo Larochelle
Model-Agnostic Meta-Learning (MAML)
(C) Zsolt Kira 55
Slide Credit: Sergey Levine
Model-Agnostic Meta-Learning (MAML)
(C) Zsolt Kira 56
Slide Credit: Sergey Levine
Comparison
(C) Zsolt Kira 57
Slide Credit: Sergey Levine
But wait, there’s more!
- Transfer Learning
- Semi-supervised learning
- One/Few-shot learning
- Un/Self-Supervised Learning
- Domain adaptation
- Meta-Learning
- Zero-shot learning
- Continual / Lifelong-learning
- Multi-modal learning
- Multi-task learning
- Active learning
- …
(C) Zsolt Kira 58
Deep Learning is one way to learn features
59
Deep Unsupervised Learning:
1.
Learn representations without labels
2.
Subset of Deep Learning, which is a subset of Representation Learning, which is a subset of Machine Learning Self-Supervised Learning
1.
Often used interchangeably with unsupervised learning
2.
Self-Supervised: Create your own supervision through pretext tasks
Motivation
60
Yann LeCun’s cake
Motivation
61
Yann LeCun’s cake Slide: LeCun
Current List of Tasks
■ Reconstruct from a corrupted (or partial) version
■
Denoising Autoencoder
■
In-painting
■
Colorization, Split-Brain Autoencoder ■ Visual common sense tasks
■
Relative patch prediction
■
Jigsaw puzzles
■
Rotation ■ Contrastive Learning
■
word2vec
■
Contrastive Predictive Coding (CPC)
■
Instance Discrimination
■
Recent State-of-the-art progress
62
Relative Position of Image Patches
63
Task: Predict the relative position of the second patch with respect to the first
Slide: Zisserman
Relative Position of Image Patches
64
Slide: Zisserman Doersch, Gupta, Efros
Relative Position of Image Patches
65
Solving Jigsaw Puzzles
66
Solving Jigsaw Puzzles
67
Rotation
68
Rotation
69
Rotation
70
Rotation
71
Rotation
72
Instance Discrimination
attract repel
Instance Discrimination
attract repel
- 1. MoCo
- 2. SimCLR
Momentum Contrast (MoCo)
Momentum Contrast (MoCo)
Momentum Contrast (MoCo)
Momentum Contrast (MoCo)
Momentum Contrast (MoCo)
Momentum Contrast (MoCo)
Results
82
Results – Object Detection
83
We have come a long way
- ML background – error decomposition, overfitting, features, etc.
- Linear classifiers
– & softmax
- Computation Graph
- Gradient Descent
- Adding layers
- Backpropagation, automatic differentiation
- Optimization – regularization/normalization (batch norm, dropout), augmentation,
different optimizers (adam, adagrad, etc.)
- Convolution and Pooling layers
- Modern CNNs - AlexNet, VGG, Inception, ResNet
- 3D CNNs
- Recurrent Neural Networks and LSTMs
– NLP, word/sentence vectors, attention, etc.
- Unsupervised feature learning
- Generative models (GANs, VAEs)
- Deep Reinforcement Learning
- Other applications: Few-Shot Learning, structure
(C) Zsolt Kira 84
Things to Watch out For
- Research is cyclical
– SVMs, boosting, probabilistic graphical models & Bayes Nets, Structural Learning, Sparse Coding, Deep Learning – Deep learning is unique in its depth and breadth, but... – Deep learning may be improved, reinvented, combined, overtaken
- Learn fundamentals for techniques across the field:
– Know the span of ML techniques and choose the ones that fit your problem! – Be responsible in 1) how you use it, 2) promises you make and how you convey it
- Try to understand landscape of the field
– Look out for what is coming up next, not where we are
- Have fun!
Some current/upcoming topics
- Current / Recent Past
– AutoML – Meta-learning – Unsupervised, semi-supervised, domain adaptation, zero/one/few-shot learning – Continual/lifelong learning without forgetting – Memory – Visual question answering, embodied question answering – Adversarial Examples
- More recent
– Deep Learning and logic! – Deep Learning and SAT problems – World modeling, learning intuitive/physics models – Visual dialogue, agents, chatbots – Fixing reinforcement learning
- First you have to admit you have a problem
- Exploration and world modeling
– Simulation frameworks, joint perception, planning, and action
- Navigation, mapping
– Just scaling everything up and watching the magic!