Semi-Supervised Learning
Barnabas Poczos
Slides Courtesy: Jerry Zhu, Aarti Singh
Semi-Supervised Learning Barnabas Poczos Slides Courtesy: Jerry - - PowerPoint PPT Presentation
Semi-Supervised Learning Barnabas Poczos Slides Courtesy: Jerry Zhu, Aarti Singh Supervised Learning Feature Space Label Space Goal: Optimal predictor (Bayes Rule) depends on unknown P XY , so instead learn a good prediction rule from training
Barnabas Poczos
Slides Courtesy: Jerry Zhu, Aarti Singh
Learning algorithm
Labeled
Goal: Feature Space Label Space Optimal predictor (Bayes Rule) depends on unknown PXY, so instead learn a good prediction rule from training data
2
Human expert/ Special equipment/ Experiment “Crystal” “Needle” “Empty”
Cheap and abundant ! Expensive and scarce !
“0” “1” “2” …
“Sports” “News” “Science” …
3
Luis von Ahn: Games with a purpose (ReCaptcha) Word challenging to OCR (Optical Character Recognition)
You provide a free label!
4
Supervised learning (SL) Semi-Supervised learning (SSL)
Learning algorithm Goal: Learn a better prediction rule than based on labeled data alone.
“Crystal”
5
6
Assume each class is a coherent group (e.g. Gaussian) Then unlabeled data can help identify the boundary more accurately.
Positive labeled data Negative labeled data Unlabeled data Supervised Decision Boundary Semi-Supervised Decision Boundary
7
3 5 8 7 9 4 2 1 5 3 8 7 9 4 2 1 “0” “1” “2” …
“Similar” data points have “similar” labels
8
This embedding can be done by manifold learning algorithms
▪ Self-Training ▪ Generative methods, mixture models ▪ Graph-based methods ▪ Co-Training ▪ Semi-supervised SVM ▪ Many others
9
10
11
Propagating 1-NN
12
15
> 1/2 < Estimate the parameters from the labeled data Decision for any test point not in the labeled dataset
16
17
18
19
20
21
22
23
24
25
26
27
28
Assumption: Similar unlabeled data have similar labels.
29
Similarity Graphs: Model local neighborhood relations between data points
30
Assumption: Nodes connected by heavy edges tend to have similar label
If data points i and j are similar (i.e. weight wij is large), then their labels are similar fi = fj
Loss on labeled data (mean square,0-1) Graph based smoothness prior
31
Co-training (Blum & Mitchell, 1998) (Mitchell, 1999) assumes that (i) features can be split into two sets; (ii) each sub-feature set is sufficient to train a good classifier.
the two sub-feature sets respectively.
labels) they feel most confident.
by the other classifier, and the process repeats.
33
Blum & Mitchell’98
35
▪ Generative methods ▪ Graph-based methods ▪ Co-Training ▪ Semi-Supervised SVMs ▪ Many other methods SSL algorithms can use unlabeled data to help improve prediction accuracy if data satisfies appropriate assumptions
36