Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS
Statistical Machine Learning
A Crash Course
Part I: Basics
- 11.05.2012
Statistical Machine Learning A Crash Course Part I: Basics - - - PowerPoint PPT Presentation
Statistical Machine Learning A Crash Course Part I: Basics - 11.05.2012 Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS Machine Learning What is ML? What is its goal? Develop a machine / an algorithm that learns to perform
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 2
system
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
better...”
3
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
4
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
5
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
handwrite.
6
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
7 salmon sea bass
length count l* 2 4 6 8 10 12 16 18 20 22 5 10 20 15 25
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
8
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 9
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 10
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
sentence, recognizing digits, etc.
11
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
12
1 −1 1
1 2 3 4 5 6 40 50 60 70 80 90 100
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
13
Training data learn model
Learned parameters Test data different from training data predict
Predicted output
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
steering control.
14
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
15
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
16
?
2 4 6 8 10 14 15 16 17 18 19 20 21 22 width lightness
salmon sea bass
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
17
2 4 6 8 10 14 15 16 17 18 19 20 21 22 width lightness
salmon sea bass
Occam’s Razor
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 18
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
really mean?
every word? For all speakers?
19
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
20
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 21
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
22
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
■ Recommended book:
ISBN 0-287-31073-8 (very good book, but not an easy read).
■ Other useful books:
ISBN 0-471-05669-3 (new version of a classic).
Cambridge University Press, 2003. ISBN 0-521-64298-1 (free download at http:// www.inference.phy.cam.ac.uk/mackay/itila/book.html).
X (perspective from Bayesian statistics)
ISBN 0-387-95284-5 (the statistical perspective).
but getting outdated).
23
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
24
B = r B = b F = o F = a
the time
the time
box with equal probability
p(B = r) = 0.6 p(B = b) = 0.4
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
particular random variable .
random variable being .
variable itself and a value that the random variable can take.
25
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
that random variable takes on a specific value
26
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
joint distribution
27
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
28
1701-1761
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
range of values:
29
x0
⇤
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
30
x δx p(x) P(x)
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
31
in general
p(y) =
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
32
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
but also to random vectors.
33
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 34
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
that it comes from some class .
35
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
underlying probability distribution.
generated is always there, even if you don’t see a single probability distribution anywhere.
36
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
37
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 38
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 39
for class .
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 40
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 41
class given the observation (feature)
class posterior class-conditional probability (likelihood) class prior normalization term
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 42
decision boundary
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
happening.
from a certain class.
interpretation was quite contentious in statistics for a long time.
43
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 44
R1 R2 x0 b x p(x, C1) p(x, C2) x
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 45
We do not need the normalization!
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 46
probability:
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 47
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 48
feature space
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
may help clarifying this up.
49
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
50
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
51
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 52
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 53
0-1 loss The same decision rule that minimized the misclassification rate
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
54
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
55
0.25 0.5 0.75 1 0.5 1 1.5 2
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 56
data.
labels (classes).
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 57
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 58
µ
x
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 59
the probability density with parameters
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
60
(our parametric density)
distributed)
N
N
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
61
N
N
N
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
62
N
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
63
covariance matrix symmetric, invertible (d x d matrix) mean (d x 1 vector) d-dimensional random vector determinant
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
64
x1 x2 (a) x1 x2 (b) x1 x2 (c)
general case: axis aligned: spherical:
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 65
density takes (or we do not know what class of function we need)?
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
not smooth enough
66
too smooth about right 0.5 1 5 0.5 1 5 0.5 1 5
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 67
be approximated arbitrarily well.
x1 D = 1 x1 x2 D = 2 x1 x2 x3 D = 3
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 68
x
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
fixed determine K-nearest neighbor
69
fixed determine Kernel density estimation
hypercube
K
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 70
2 , j = 1, . . . , d
N
N
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 71
N
n=1
N
n=1
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
72
2
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 73
too smooth about right not smooth enough
0.5 1 5 0.5 1 5 0.5 1 5
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
fixed determine K-nearest neighbor
74
fixed determine Kernel density estimation
sphere until data points fall into the sphere
K
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
0.5 1 5 0.5 1 5 0.5 1 5
75
not smooth enough too smooth about right
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 76
k-nearest neighbor classification
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 77
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
78
Mixture Models
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
79
0.5 0.3 0.2 (a) 0.5 1 0.5 1 (b) 0.5 1 0.5 1
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 80
approximate every (smooth) density
M
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
81
integrates to 1:
M
M
j
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
82
“weight” of mixture component
mixture component
M
mixture density
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 83
n=1 p(j|xn)xn
n=1 p(j|xn)
N
Circular dependency No analytical solution!
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 84
components
N
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
85
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
86
n=1 p(1|xn)xn
n=1 p(1|xn)
maximum likelihood for component 1:
n=1 p(2|xn)xn
n=1 p(2|xn)
maximum likelihood for component 2:
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
87
j=1 p(x|j)πj
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
component and for all data points:
88
αnj
i=1 πiN(xn|µi, σi)
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
89
j
N
j
N
n=1
j
N
with
“soft count”
j
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 90
(a) −2 2 −2 2 (b) −2 2 −2 2 (c) −2 2 −2 2 (d) −2 2 −2 2 (e) −2 2 −2 2 (f) −2 2 −2 2
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
91
N
K
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 92
incomplete data via EM algorithm, In Journal Royal Statistical Society, Series B. Vol. 39, 1977
Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, TR-97-021, ICSI, U.C. Berkeley, CA, USA
justifies incremental, sparse, and other variants, In Learning in Graphical Models, M.I. Jordan (editor)
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
mixtures of different parametric distributions.
far the most common one.
variable models.
93
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
1 2 3 4 5 6 40 50 60 70 80 90 100
94
1 2 3 4 5 6 40 50 60 70 80 90 100
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
95
Make each point a separate cluster Until the clustering is satisfactory Merge the two clusters with the smallest inter-cluster distance end
Construct a single cluster containing all points Until the clustering is satisfactory Split the cluster that yields the two components with the largest inter-cluster distance end
[Forsyth & Ponce]
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 96
Choose k data points to act as cluster centers Until the cluster centers are unchanged Allocate each data point to cluster whose center is nearest Now ensure that every cluster has at least
supplying empty clusters with a point chosen at random from points far from their cluster center. Replace the cluster centers with the mean of the elements in their clusters. end Algorithm 16.5: Clustering by K-Means
from [Forsyth & Ponce]
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
97
(a) −2 2 −2 2 (b) −2 2 −2 2 (c) −2 2 −2 2 (d) −2 2 −2 2 (e) −2 2 −2 2 (f) −2 2 −2 2 (g) −2 2 −2 2 (h) −2 2 −2 2 (i) −2 2 −2 2
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
98
Ψ(clusters, data) = ⌥
i∈clusters
⇤ ⌥
j∈i-th cluster
||xj − ci||2 ⇥ ⌃ ⌅
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
99
[Comaniciu & Meer, 02]
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
points.
label.
100
[Comaniciu & Meer, 02]
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
Feature Space Analysis, IEEE Trans. Pattern Analysis Machine Intell., Vol. 24, No. 5, 603-619, 2002.
101
N
i=1
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
102
[Comaniciu & Meer, 02] g(y) = −k(y)
i=1 xig
h
i=1 g
h
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
103
Region of interest Center of mass Mean Shift vector
Objective : Find the densest region
From Ukrainitz & Sarel
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
densities.
104
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
parameter from the data set .
105
θ
ˆ θ
X
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
true value of the parameter .
106
ˆ θ
p(ˆ θ(X))
E[ˆ θ(X)]
true value average estimate
θ
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
107
p(ˆ θ(X)) p(ˆ θ(X))
ˆ θ ˆ θ
E[ˆ θ(X)] E[ˆ θ(X)]
small variance large variance
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
108
N
i=1
i=1
i=1
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
109
N
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
estimation, but more generally to any estimation problem.
unknown data point.
110
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
111
−3 −2 −1 1 2 0.03 0.06 0.09 0.12 0.15
(bias)2 variance (bias)2 + variance test error
Parameter of our estimator e.g. kernel bandwidth in KDE flexible models simple models
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
data set that we have.
properties of the data.
112
Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |
113
run 1 run 2 run 3 run 4
test train