[PPT] - Statistical Machine Learning A Crash Course Part I: Basics - PowerPoint Presentation

SLIDE 1

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS

Statistical Machine Learning

A Crash Course

Part I: Basics

11.05.2012

SLIDE 2

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 2

■ What is ML? What is its goal? Develop a machine / an algorithm that learns to perform a task from past experience. ■ Why? What for?

Fundamental component of every intelligent and/or autonomous

system

Discovering “rules” and patterns in data
Automatic adaptation of systems
Attempting to understand human / biological learning

Machine Learning

SLIDE 3

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Why?

■ Machine learning is become increasingly important ■ ...but may sound intimidating, ■ and often I see people use the wrong tools:

“I used neural networks...” [because that sounds neat]
“I used <insert super complicated technique> because it works

better...”

■ In many cases this complication is not necessary. ■ With a basic understanding and a some foundational tools you can get a long way!

3

SLIDE 4

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Goals for today

■ Get to know the basics of statistical machine learning:

Bayesian decision theory
(Linear) classifiers & SVMs
Boosting
Nonlinear regression
Basic clustering techniques

■ High level goals:

Avoid pitfalls
Get to know the important tools
Give you a starting point

4

SLIDE 5

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Machine Learning in Action

5

SLIDE 6

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Machine Learning: Examples

■ Example 1: Recognition of handwritten digits

These digits are given to us as small digital images.
We have to build a “machine” to decide which digit it is.
Obvious challenge: There are many different ways in which people

handwrite.

6

SLIDE 7

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Machine Learning: Examples

■ Example 2: Classification of fish

7 salmon sea bass

length count l* 2 4 6 8 10 12 16 18 20 22 5 10 20 15 25

SLIDE 8

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Machine Learning: Examples

■ More examples:

Email filtering:
Speech recognition:
Vehicle control:

8

SLIDE 9

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 9

■ Recognition of speech, letters, faces, ... ■ Autonomous vehicle navigation ■ Games

Backgammon world-champion
Chess: Deep-Blue vs. Kasparov

■ Google ■ Finding new astronomical structures ■ Fraud detection (credit card applications) ■ ...

Impact & Successes

SLIDE 10

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 10

■ Develop a machine / an algorithm that learns to perform a task from past experience. ■ Put more abstractly:

Our task is to learn a mapping from input to output.
Put differently, we want to predict the output from the input.
Input: images, text, other measurements,...
Output:
Parameter(s): (that is/are being “learned”)

Machine Learning

f : I → O y = f(x; θ) x ∈ I y ∈ O θ ∈ Θ

SLIDE 11

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Classification vs. Regression

■ Classification:

Learn a mapping into a discrete space, e.g.:
Examples: Spam / not spam, sea bass vs. salmon, parsing a

sentence, recognizing digits, etc.

11

O = {verb, noun, nounphrase, . . .} O = {0, 1, 2, 3, . . .} O = {0, 1}

SLIDE 12

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Classification vs. Regression

■ Regression:

Learn a mapping into a continuous space, e.g.:
Examples: “Curve fitting”, financial analysis, ...

12

O = R3 O = R

1 −1 1

1 2 3 4 5 6 40 50 60 70 80 90 100

SLIDE 13

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

General Paradigm

■ Training: ■ Testing:

13

Training data learn model

θ

Learned parameters Test data different from training data predict

0, 1, 2, 8, 4, 6, 6, 7, 8, 9

Predicted output

SLIDE 14

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

What data do we have for training?

■ Data with labels (input / output pairs): supervised learning

E.g., image with digit label
Sensory data for car with intended

steering control.

■ Data without labels: unsupervised learning

E.g., automatic clustering (grouping) of sounds
Clustering of text according to topics

■ Data with and without labels: semi-supervised learning ■ No examples: learning-by-doing

Reinforcement learning

14

5

SLIDE 15

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Some Key Challenges

■ We need generalization!

We cannot simply memorize the training set.

■ What if we see an input that we haven’t seen before?

Different shape of the digit image (unknown writer)
“Dirt” on the picture, etc.
We need to learn what is important for carrying out our task.

■ This is one of the most crucial points that we will return to many times.

15

SLIDE 16

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Generalization

■ How do we achieve generalization?

16

?

2 4 6 8 10 14 15 16 17 18 19 20 21 22 width lightness

salmon sea bass

SLIDE 17

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ How do we achieve generalization? ■ We should not make the model overly complex!

Generalization

17

2 4 6 8 10 14 15 16 17 18 19 20 21 22 width lightness

salmon sea bass

Occam’s Razor

SLIDE 18

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 18

■ Input x: ■ Features:

Choosing the “right” features is very important.
Coding and use of domain knowledge.
May allow for invariance (e.g., volume and pitch of voice).

■ Curse of Dimensionality:

If the features are too high-dimensional, we may run into trouble.
Dimensionality reduction

Some Key Challenges

SLIDE 19

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Some Key Challenges

■ How do we measure performance?

99% correct classification in speech recognition: What does that

really mean?

We understand the meaning of the sentence? We understand

every word? For all speakers?

■ Need more concrete numbers:

% of correctly classified letters
average distance driven (until accident...)
% of games won
% correctly recognized words, sentences, etc.

■ Training vs. testing performance!

19

SLIDE 20

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Some Key Challenges

■ We also need to define the right error metric:

Whis is better?
Euclidean distance (L2 norm) might be useless.

20

SLIDE 21

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 21

■ Which is the right model?

The learned parameters can mean a lot of different things.
w: may characterize the family of functions or the model space
w: may index the hypothesis space
w: vector, adjacency matrix, graph, ...

Some Key Challenges

SLIDE 22

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Some Key Challenges

■ Even if we have solved the other problems, computation is usually quite hard:

Learning often involves some kind of optimization
Find (search) best model parameters
Often we have to deal with thousands of training examples
Given a model, compute the prediction efficiently

22

SLIDE 23

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Readings

■ Recommended book:

Christopher Bishop: Pattern Recognition and Machine Learning. Springer, 2006.

ISBN 0-287-31073-8 (very good book, but not an easy read).

■ Other useful books:

Duda, Hart & Stork: Pattern Classification, Wiley-Interscience, 2000, 2nd edition.

ISBN 0-471-05669-3 (new version of a classic).

David J. C. MacKay: Information Theory, Inference, and Learning Algorithms.

Cambridge University Press, 2003. ISBN 0-521-64298-1 (free download at http:// www.inference.phy.cam.ac.uk/mackay/itila/book.html).

Gelman et al.: Bayesian Data Analysis. CRC Press, 2nd ed., 2004, ISBN 1-584-88388-

X (perspective from Bayesian statistics)

Hastie, Tibshirani, Friedman: The Elements of Statistical Learning, Springer, 2001.

ISBN 0-387-95284-5 (the statistical perspective).

Tom Mitchell: Machine Learning, McGraw-Hill, 1997, ISBN 0-07-042807-7 (classic,

but getting outdated).

23

SLIDE 24

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Brief Review of Basic Probability

■ What you should already know:

24

B = r B = b F = o F = a

Random picking:
Red box: 60% of

the time

Blue box: 40% of

the time

Pick fruit from a

box with equal probability

p(B = r) = 0.6 p(B = b) = 0.4

p(F = a|B = r) = p(F = o|B = b) = 0.25 p(F = o|B = r) = p(F = a|B = b) = 0.75

SLIDE 25

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Brief Review of Basic Probability

■ We usually do not mention the random variable (RV) explicitly (for brevity). ■ Instead of we write:

if we want to denote the probability distribution for a

particular random variable .

if we want to denote the value of the probability of the

random variable being .

It should be obvious from the context when we mean the random

variable itself and a value that the random variable can take.

■ Some people use upper case for (discrete) probability distributions. I usually don’t for brevity.

25

p(X = x)

p(X) p(x) X x P(X = x)

SLIDE 26

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Brief Review of Basic Probability

■ Joint probability:

The probability distribution of random variables and taking
n a configuration jointly.
For example:

■ Conditional probability:

The probability distribution of random variable given the fact

that random variable takes on a specific value

For example:

26

p(X, Y )

X Y p(B = b, F = o) p(X|Y ) X Y p(B = b|F = o)

SLIDE 27

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ Probabilities are always non-negative: ■ Probabilities sum to 1: ■ Sum rule or marginalization:

and are called marginal distributions of the

joint distribution

Basic Rules I

27

p(x) =

y

p(x, y) p(y) =

x

p(x, y) p(x) p(y) p(x, y) p(x) ≥ 0

x

p(x) = 1 ⇒ 0 ≤ p(x) ≤ 1

SLIDE 28

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Basic Rules II

■ Product rule:

From this we directly follow...

■ Bayes’ rule or Bayes’ theorem:

We will widely use these rules.

28

p(x, y) = p(x|y)p(y) = p(y|x)p(x) p(y|x) = p(x|y)p(y) p(x)

Rev. Thomas Bayes

1701-1761

SLIDE 29

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Continuous RVs

■ What if we have continuous random variables, say ?

Any single value has zero probability.
We can only assign a probability for a random variable being in a

range of values:

■ Instead we use the probability density ■ Cumulative distribution function:

29

X = x ∈ R Pr(x0 < X < x1) = Pr(x0 ≤ X ≤ x1)

p(x)

Pr(x0 ≤ X ≤ x1) = x1

x0

p(x) dx P(z) = z

⇤

p(x) dx and P ⇥(x) = p(x)

SLIDE 30

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Continuous RVs

■ Probability density function = pdf ■ Cumulative distribution function = cdf ■ We can work with a density (pdf) as if it was a probability distribution:

For simplicity we usually use the same notation for both.

30

x δx p(x) P(x)

SLIDE 31

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Basic rules for pdfs

■ What are the rules?

Non-negativity:
“Summing” to 1:
But:
Marginalization:
Product rule:

31

p(x) ≥ 0

p(x) dx = 1

p(x) ⇥ 1

in general

p(x, y) = p(x|y)p(y) = p(y|x)p(x) p(x) =

p(x, y) dy

p(y) =

p(x, y) dx

SLIDE 32

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Expectations

■ The average value of a function under a probability distribution is the expectation: ■ For joint distributions we sometimes write: ■ Conditional expectation:

32

f(x) p(x) Ex[f(x, y)] Ex|y[f] = Ex[f|y] =

x

f(x)p(x|y) E[f] = E[f(x)] =

x

f(x)p(x)

r

E[f] = ⇥ f(x)p(x) dx

SLIDE 33

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Variance and Covariance

■ Variance of a single RV: ■ Covariance of two RVs: ■ Random vectors:

All we have said so far not only applies to scalar random variables,

but also to random vectors.

In particular, we have the covariance matrix:

33

var[x] = E

(x − E[x])2⇥

= E[x2] − E[x]2 cov(x, y) = Ex,y [(x − E[x])(y − E[y])] = Ex,y[xy] − E[x]E[y] cov(x, y) = Ex,y

(x − E[x])(y − E[y])T⇥

= Ex,y[xyT] − E[x]E[y]T

SLIDE 34

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 34

■ Example: Character Recognition ■ Goal: Classify new letter so that the probability of a wrong classification is minimized.

Bayesian Decision Theory

SLIDE 35

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ 1st concept: Class conditional probabilities

Probability of making an observation knowing

that it comes from some class .

Here is often a feature (vector).
measures / describes properties of the data.
Examples: # of black pixels, height-width ratio, ...

35

p(x|a) p(x|b)

x x

Bayesian Decision Theory

x

Ck

x x

p(x|Ck)

SLIDE 36

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Statistical Methods

■ Statistical methods in machine learning all have in common that they assume that the process that “generates” the data is governed by the rules of probability.

The data is understood to be a set of random samples from some

underlying probability distribution.

■ For now will be all about probabilities. ■ Later the use of probability will sometimes be much less explicit.

Nonetheless, the basic assumption about how the data is

generated is always there, even if you don’t see a single probability distribution anywhere.

36

SLIDE 37

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ 2nd concept: Class priors (a priori probability of a data point belonging to a particular class)

Example:
Generally:

C1 = a C2 = b

37

Bayesian Decision Theory

p(C1) = 0.75 p(C2) = 0.25

k

p(Ck) = 1

p(Ck)

SLIDE 38

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 38

■ Example: ■ Question:

How do we decide which class the data point belongs to?
Here, we should decide for class .

p(x|a) p(x|b) x = 15

x

Bayesian Decision Theory

a

SLIDE 39

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 39

■ Example: ■ Question:

How do we decide which class the data point belongs to?
Since is a lot smaller than we should now decide

for class .

p(x|a) p(x|b)

x

Bayesian Decision Theory

x = 25

p(x|a) p(x|b)

b

SLIDE 40

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 40

■ Example: ■ Question:

How do we decide which class the data point belongs to?
Remember that and
This means we should decide class .

p(x|a) p(x|b)

x

Bayesian Decision Theory

x = 20

p(a) = 0.75 p(b) = 0.25

a

SLIDE 41

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 41

■ Formalize this using Bayes’ theorem:

We want to find the a-posteriori probability (posterior) of the

class given the observation (feature)

Bayesian Decision Theory

x

Ck

p(Ck|x) = p(x|Ck)p(Ck) p(x)

class posterior class-conditional probability (likelihood) class prior normalization term

p(Ck|x) = p(x|Ck)p(Ck) p(x) = p(x|Ck)p(Ck)

j p(x|Cj)p(Cj)

SLIDE 42

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 42

Bayesian Decision Theory

decision boundary

p(x|a) p(x|b) x p(x, a) = p(x|a)p(a) p(x, b) = p(x|b)p(b) x p(a|x) p(b|x) x

SLIDE 43

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Bayesian Decision Theory

■ Why is it called this way?

To some extent, because it involves applying Bayes’ rule.
But this is not the whole story...
The real reason is that it is built on so-called Bayesian probabilities.

■ Bayesian probabilities (the short story):

Probability is not just interpreted as a frequency of a certain event

happening.

Rather, it is seen as a degree of belief in an outcome.
Only this allows us to assert a prior belief in a data point coming

from a certain class.

Even though this might seem easy to accept to you, this

interpretation was quite contentious in statistics for a long time.

43

SLIDE 44

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 44

■ Goal: Minimize the misclassification rate ■ I.e. the probability of a wrong classification

Bayesian Decision Theory

R1 R2 x0 b x p(x, C1) p(x, C2) x

p(error) = p(x ∈ R1, C2) + p(x ∈ R2, C1) =

R1

p(x, C2) dx +

R2

p(x, C1) dx =

R1

p(x|C2)p(C2) dx +

R2

p(x|C1)p(C1) dx

SLIDE 45

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 45

■ Decision rule:

Decide if
This is equivalent to
Which is equivalent to

■ Bayes optimal classifier:

A classifier obeying this rule is called a Bayes optimal classifier.

Bayesian Decision Theory

p(C1|x) > p(C2|x) p(x|C1)p(C1) > p(x|C2)p(C2) p(x|C1) p(x|C2) > p(C2) p(C1) C1

We do not need the normalization!

SLIDE 46

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 46

■ Generalization to more than 2 classes:

Decide for class if and only if it has the highest a-posteriori

probability:

This is equivalent to:

More Classes

k p(Ck|x) > p(Cj|x) ⇥j = k p(x|Ck)p(Ck) > p(x|Cj)p(Cj) ⇥j = k p(x|Ck) p(x|Cj) > p(Cj) p(Ck) ⇥j = k

SLIDE 47

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 47

■ Decision regions: R1, R2, ...

More Classes

SLIDE 48

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 48

■ Generalization to more than one feature:

So far:
More generally: with being the dimensionality of the

feature space

Example from last time: salmon vs. sea-bass
: width
: lightness

■ Our framework generalizes quite straightforwardly:

Multivariate class-conditional densities:
Etc...

More Features

x ∈ R d x ∈ Rd R2 x = (x1, x2)

x1 x2 p(x|Ck)

SLIDE 49

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Loss Functions

■ So far, we have tried to minimize the misclassification rate. ■ But there are many cases when not every misclassification is equally bad:

Smoke detector:
If there is a fire, we need to be very sure that we classify it as such.
If there is no fire, it is ok to occasionally have a false alarm.
Medical diagnosis:
If the patient is sick, we need to be very sure that we report them as sick.
If they are healthy, it is ok to classify them as sick and order further testing that

may help clarifying this up.

49

SLIDE 50

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Loss Functions

■ Key idea: ■ Introduce a loss function that expresses this:

Possible decisions:
True classes:
Loss function:
Expected loss of making a decision :

50

loss(decision = healthy|patient = sick) >> loss(decision = sick|patient = healthy) αi

Cj

λ(αi|Cj)

R(αi|x) = ECk|x[λ(αi|Ck)] =

j

λ(αi|Cj)p(Cj|x) αi

SLIDE 51

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Risk Minimization

■ The expected loss of a decision is also called the risk of making a decision. ■ Instead of minimizing the misclassification rate, we minimize the overall risk.

51

R(αi|x) = ECk|x[λ(αi|Ck)] =

j

λ(αi|Cj)p(Cj|x)

SLIDE 52

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 52

■ Example:

2 classes:
2 decisions:
Loss function:
Risk of both decisions:

■ Goal: Decide so that overall risk is minimized

This means: Decide if
Decision rule:

C1, C2 α1, α2 λ(αi|Cj) = λij R(α2|x) > R(α1|x)

α1

Risk Minimization

R(α1|x) = λ11p(C1|x) + λ12p(C2|x) R(α2|x) = λ21p(C1|x) + λ22p(C2|x) p(x|C1) p(x|C2) > λ12 − λ22 λ21 − λ11 · p(C2) p(C1)

SLIDE 53

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 53

■ Special case:

Then: Decide if

α1

Risk Minimization

p(x|C1) p(x|C2) > λ12 − λ22 λ21 − λ11 · p(C2) p(C1) p(x|C1) p(x|C2) > p(C2) p(C1) λ(αi|Cj) = 0, i = j 1, i = j

0-1 loss The same decision rule that minimized the misclassification rate

SLIDE 54

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Bayesian Decision Theory

■ We are done with classification. No?

We have decision rules for simple and general loss functions.
Even “Bayes optimal”.
We can deal with 2 or more classes.
We can deal with high dimensional feature vectors.
We can incorporate prior knowledge on the class distribution.

■ What are we going to do the rest of today?

Where is the catch?

■ Where do we get these probability distributions from?

54

SLIDE 55

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Training Data

■ How do we get the probability distributions from this so that we can classify with them?

55

0.25 0.5 0.75 1 0.5 1 1.5 2

SLIDE 56

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 56

■ So far:

Bayes optimal classification
Based on probability distributions:
The prior is easy to deal with:
We can just “count” the number of occurrences of each class in the training

data.

■ We need to estimate (learn) the class-conditional probability density:

Supervised training: We know the data points and their true

labels (classes).

Estimate the density separately for each class .
“Abbreviation”:

Probability Density Estimation

p(x|Ck)p(Ck) p(Ck)

p(x) = p(x|Ck) p(x|Ck)

Ck

SLIDE 57

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 57

■ (Training) data: ■ Estimation: ■ Methods:

Parametric representation / model
Non-parametric representation
Mixture models

x1, x2, x3, x4, . . .

x x

p(x)

Probability Density Estimation

SLIDE 58

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 58

■ Simplest case:

Gaussian distribution
: Mean
: Variance

■ Notation for parametric density models:

For the Gaussian case:

p(x|µ, σ) = 1 √ 2πσ exp

−(x − µ)2

2σ2 ⇥

µ

x

p(x|θ) θ = (µ, σ)

Parametric Models

µ σ2

SLIDE 59

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 59

■ Learning = Estimation

f the parameters

(given the data ) ■ Likelihood of

Defined as the probability that the data was generated from

the probability density with parameters

Likelihood :

x x

X = {x1, x2, x3, . . . , xN}

Maximum Likelihood Method

θ X

L(θ) = p(X|θ)

θ X θ L(θ)

SLIDE 60

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Maximum Likelihood Method

60

■ Computing the likelihood...

of a single datum

(our parametric density)

of all data?
Assumption: The data is i.i.d. (independent and identically

distributed)

■ Log-likelihood: ■ Maximize the (log-)likelihood w.r.t.

p(xn|θ)

L(θ) = p(X|θ) =

N

n=1

p(xn|θ)

log L(θ) = log p(X|θ) =

N

n=1

log p(xn|θ) θ

SLIDE 61

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Maximum Likelihood Method

61

■ Maximum likelihood estimation of a Gaussian:

Take the partial derivatives and set them to 0.

■ Closed form solution:

ˆ µ = 1 N

N

n=1

xn ˆ σ2 = 1 N

N

n=1

(xn − ˆ µ)2

log L(θ) = log p(X|θ) =

N

n=1

log p(xn|µ, σ)

SLIDE 62

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ Likelihood:

62

L(θ) = p(X|θ) =

N

n=1

p(xn|θ) p(X|θ)

ˆ θ θ

Maximum Likelihood Method

SLIDE 63

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Multivariate Gaussians

■ Before we move on, we should look at the multivariate case of a Gaussian:

63

N(x|µ, Σ) = 1 (2π)(d/2)|Σ|1/2 exp

−1

2(x − µ)TΣ−1(x − µ) ⇥

covariance matrix symmetric, invertible (d x d matrix) mean (d x 1 vector) d-dimensional random vector determinant

SLIDE 64

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Multivariate Gaussians

■ Some 2-dimensional Gaussians:

64

x1 x2 (a) x1 x2 (b) x1 x2 (c)

general case: axis aligned: spherical:

Σ = e f ⇥ Σ =

σ2

σ2 ⇥ = σ2I Σ = a b b c ⇥

SLIDE 65

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 65

■ Non-parametric representations: Why?

Often we do not know what functional form the class-conditional

density takes (or we do not know what class of function we need)?

■ Here: Probability density is estimated directly from the data (i.e. without an explicit parametric model):

Histograms
Kernel density estimation (Parzen windows)
K-nearest neighbors

x

Non-parametric Methods

SLIDE 66

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ Discretize feature space into bins:

not smooth enough

66

too smooth about right 0.5 1 5 0.5 1 5 0.5 1 5

Histograms

SLIDE 67

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 67

■ Properties:

Very general - in the infinite data limit any probability density can

be approximated arbitrarily well.

At the same time: Brute-force method

■ Problems:

High-dimensional feature spaces
Exponential increase in the # of bins
Hence requires exponentially much data
“Curse of dimensionality”
Size of the bins?

Histograms

x1 D = 1 x1 x2 D = 2 x1 x2 x3 D = 3

SLIDE 68

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 68

■ Data point is sampled from probability density

Probability that is in region

■ If is sufficiently small, then is almost constant:

: Volume of region

■ If is sufficiently large, we can estimate :

More well-founded approach

x p(x)

x

R

Pr(x ∈ R) =

R

p(y) dy ⇒ p(x) ≈ K NV Pr(x ∈ R) = K N R Pr(x ∈ R) R

R

Pr(x ∈ R) =

R

p(y) dy ≈ p(x)V

V

p(y)

SLIDE 69

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

fixed determine K-nearest neighbor

69

More well-founded approach

fixed determine Kernel density estimation

Example: Determine the #
f data points in a fixed

hypercube

X =

x(1), . . . , x(N)⇥

K K V V

K

p(x) ≈ K N · V

SLIDE 70

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 70

■ Parzen window approach:

Hypercubes in d dimensions with edge length h

Kernel Density Estimation (KDE)

H(u) =

1,

|uj| ≤ h

2 , j = 1, . . . , d

0,

therwise

K(x) =

N

n=1

H(x − x(i)) p(x) ≈ K(x) NV = 1 Nhd

N

n=1

H(x − x(n)) V =

H(u) du = hd

SLIDE 71

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 71

■ In general:

Arbitrary kernel:

Kernel Density Estimation (KDE)

k(u) ≥ 0,

k(u) du = 1

K(x) =

N

⇤

n=1

k ||x − x(n)|| h ⇥ V = hd p(x) ≈ K NV = 1 Nhd

N

X

n=1

k ✓||x − x(n)|| h ◆

SLIDE 72

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Kernel Density Estimation (KDE)

■ Common kernels:

Gaussian kernel
Problem: Kernel has infinite support
Requires a lot of computation
Parzen window
Not very smooth results
Epanechnikov kernel
Smoother, but finite support

■ Problem:

We have to select the kernel bandwidth appropriately.

72

k(u) = 1 √ 2π exp

−1

2u2 ⇥ k(u) = max

0, 3

4(1 − u2) ⇥ h k(u) =

1,

u ≤ 1

2

0,

therwise

SLIDE 73

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 73

too smooth about right not smooth enough

0.5 1 5 0.5 1 5 0.5 1 5

Gaussian KDE Example

SLIDE 74

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

fixed determine K-nearest neighbor

74

More well-founded approach

fixed determine Kernel density estimation

Increase the size of a

sphere until data points fall into the sphere

K K V V

K

p(x) ≈ K N · V p(x) ≈ K N · V (x)

SLIDE 75

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

0.5 1 5 0.5 1 5 0.5 1 5

75

not smooth enough too smooth about right

K-Nearest Neighbor (kNN): Example

SLIDE 76

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 76

■ Bayesian classification:

k-nearest neighbor classification

P(x|Cj) ≈ Kj NjV P(x) ≈ K NV

P(Cj) ≈ Nj N P(Cj|x) ≈ Kj NjV Nj N NV K = Kj K P(Cj|x) = P(x|Cj)P(Cj) P(x)

K-Nearest Neighbor (kNN)

SLIDE 77

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 77

■ Nonparametric probability density estimation

Histograms: Size of the bins?
too large: too smooth (too much bias)
too small: not smooth enough (too much variance)
Kernel density estimation: Kernel bandwidth?
h too large: too smooth
h too small: not smooth enough
K-nearest neighbor: Number of neighbors?
K too large: too smooth
K too small: not smooth enough

■ General problem of many density estimation approaches

including parametric models and mixture models

Bias-Variance Problem

SLIDE 78

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ Parametric

e.g. Gaussian
good analytic properties
simple
small memory requirements
fast

78

Mixture Models

■ Nonparametric

e.g. KDE, kNN
general
large memory requirements
slow

SLIDE 79

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Mixture Models

79

0.5 0.3 0.2 (a) 0.5 1 0.5 1 (b) 0.5 1 0.5 1

SLIDE 80

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 80

■ Sum of individual Gaussian distributions

In the limit (i.e. with many mixture components) this can

approximate every (smooth) density

x

Mixture of Gaussians (MoG)

p(x) =

M

j=1

p(x|j)p(j) p(x)

SLIDE 81

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Mixture of Gaussians

81

■ Remarks:

The mixture density

integrates to 1:

The mixture parameters are:

p(x) =

M

j=1

p(x|j)p(j) p(j) = πj with 0 ≤ πj ≤ 1,

M

j=1

πj = 1 p(x|j) = N(x|µj, σj) = 1 √ 2πσj exp

−(x − µj)2

2σ2

j

⇥

p(x) dx = 1

θ = {µ1, σ1, π1, . . . , µM, σM, πM}

SLIDE 82

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ “Generative model”

82

Mixture of Gaussians

j 1 2 3

x

p(x)

x

p(x) p(j)

“weight” of mixture component

p(x|j)

mixture component

p(x) =

M

j=1

p(x|j)p(j)

mixture density

SLIDE 83

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 83

■ Maximum likelihood estimation:

maximize

Mixture of Gaussians

⇒ µj = N

n=1 p(j|xn)xn

N

n=1 p(j|xn)

L = log L(θ) =

N

n=1

log p(xn|θ) µj L ∂L ∂µj = 0

Circular dependency No analytical solution!

SLIDE 84

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 84

■ Maximum likelihood estimation:

maximize

■ Gradient ascent

Complex gradient (nonlinear, circular dependencies)
Optimization of one Gaussian component depends on all other

components

Mixture of Gaussians

µj L L = log L(θ) =

N

n=1

log p(xn|θ) ∂L ∂µj = 0

SLIDE 85

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ Different strategy: ■ observed data: ■ unobserved: 1 111 22 2 2

unobserved = hidden or latent variable:

85

Mixture of Gaussians

j p(x) p(x|1) p(x|2) 1 2

x

j|x p(j = 1|x) : 1 111 00 p(j = 2|x) : 0 000 11 1 1

SLIDE 86

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ Suppose we knew: 1 111 22 2 2

86

Mixture of Gaussians

x

p(j = 1|x) : 1 111 00 p(j = 2|x) : 0 000 11 1 1 µ1 = N

n=1 p(1|xn)xn

N

n=1 p(1|xn)

maximum likelihood for component 1:

µ2 = N

n=1 p(2|xn)xn

N

n=1 p(2|xn)

maximum likelihood for component 2:

SLIDE 87

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ Suppose we had a guess about the distribution: ■ Compute the probability for each mixture component:

e.g.

Mixture of Gaussians

87

x

1 111 22 2 2

p(j = 1|x) p(j = 2|x) p(j = 1|x) = p(x|1)p(1) p(x) = p(x|1)π1 M

j=1 p(x|j)πj

SLIDE 88

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

EM for Gaussian Mixtures

■ Algorithm:

Initialize with some parameters:

■ Loop:

E-step: Compute the posterior distribution for each mixture

component and for all data points:

The are also called the responsibilities.

88

αnj

µ1, σ1, π1, . . . αnj = p(j|xn) = πjN(xn|µj, σj) M

i=1 πiN(xn|µi, σi)

SLIDE 89

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

EM for Gaussian Mixtures

■ Algorithm:

Initialize with some parameters:

■ Loop:

M-step: Compute the new parameters using weighted estimates

89

µnew

j

= 1 Nj

N

n=1

αnjxn

σnew

j

⇥2 = 1 Nj

N

⇤

n=1

αnj(xn − µnew

j

)2 Nj =

N

n=1

αnj

with

“soft count”

µ1, σ1, π1, . . . πnew

j

= Nj N

SLIDE 90

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 90

(a) −2 2 −2 2 (b) −2 2 −2 2 (c) −2 2 −2 2 (d) −2 2 −2 2 (e) −2 2 −2 2 (f) −2 2 −2 2

Expectation Maximization (EM)

SLIDE 91

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

How many components?

■ How many mixture components do we need?

More components will typically lead to a better likelihood.
But are more components necessarily better? No! Overfitting!

■ Automatic selection (simple):

Find that maximizes the Akaike information criterion:
: # of parameters
Or find that maximizes the Bayesian information criterion:
: # of data points

91

k k

N

log p(X|θML) − K

K

log p(X|θML) − 1 2K log N

SLIDE 92

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 92

■ EM Standard Reference:

A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum-Likelihood from

incomplete data via EM algorithm, In Journal Royal Statistical Society, Series B. Vol. 39, 1977

■ EM Tutorial:

Jeff A. Bilmes, A Gentle Tutorial of the EM Algorithm and its

Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models, TR-97-021, ICSI, U.C. Berkeley, CA, USA

■ Modern interpretation:

Neal, R.M. and Hinton, G.E., A view of the EM algorithm that

justifies incremental, sparse, and other variants, In Learning in Graphical Models, M.I. Jordan (editor)

EM Readings

SLIDE 93

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Before we move on...

■ ...it is important to understand that... ■ Mixture models are much more general than mixtures of Gaussians:

One can have mixtures of any parametric distribution, and even

mixtures of different parametric distributions.

Gaussian mixtures are only one of many possibilities, though by

far the most common one.

■ Expectation maximization is not just for fitting mixtures

f Gaussians:
One can fit other mixture models with EM.
EM is still more general, in that it applies to many other hidden

variable models.

93

SLIDE 94

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

1 2 3 4 5 6 40 50 60 70 80 90 100

Brief Aside: Clustering

■ The context in which we introduced mixture models was density estimation. ■ But they are also very useful for clustering:

Goal:
Divide the feature space into meaningful groups.
Find the group assignment.
Unsupervised learning.

94

1 2 3 4 5 6 40 50 60 70 80 90 100

SLIDE 95

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Simple Clustering Methods

■ Agglomerative clustering: ■ Divisive clustering:

95

Make each point a separate cluster Until the clustering is satisfactory Merge the two clusters with the smallest inter-cluster distance end

Construct a single cluster containing all points Until the clustering is satisfactory Split the cluster that yields the two components with the largest inter-cluster distance end

[Forsyth & Ponce]

SLIDE 96

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS | 96

K-Means Clustering

Choose k data points to act as cluster centers Until the cluster centers are unchanged Allocate each data point to cluster whose center is nearest Now ensure that every cluster has at least

ne data point; possible techniques for doing this include .

supplying empty clusters with a point chosen at random from points far from their cluster center. Replace the cluster centers with the mean of the elements in their clusters. end Algorithm 16.5: Clustering by K-Means

from [Forsyth & Ponce]

SLIDE 97

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

K-Means Clustering

97

(a) −2 2 −2 2 (b) −2 2 −2 2 (c) −2 2 −2 2 (d) −2 2 −2 2 (e) −2 2 −2 2 (f) −2 2 −2 2 (g) −2 2 −2 2 (h) −2 2 −2 2 (i) −2 2 −2 2

SLIDE 98

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

K-Means Clustering

■ K-Means is quite easy to implement and reasonably fast. ■ Other nice property: We can understand it as the local

ptimization of an objective function:

98

Ψ(clusters, data) = ⌥

i∈clusters

⇧

⇤ ⌥

j∈i-th cluster

||xj − ci||2 ⇥ ⌃ ⌅

SLIDE 99

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Mean Shift Clustering

■ Mean shift is a method for finding modes in a cloud of data points where the points are most dense.

99

[Comaniciu & Meer, 02]

SLIDE 100

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Mean Shift

■ The mean shift procedure tries to find the modes of a kernel density estimate through local search.

The black lines indicate various search paths starting at different

points.

Paths that converge at the same point get assigned the same

label.

100

[Comaniciu & Meer, 02]

SLIDE 101

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

■ Start with kernel density estimate: ■ We can derive the mean shift procedure by taking the gradient of the kernel density estimate. ■ For details see:

D. Comaniciu, P. Meer: Mean Shift: A Robust Approach toward

Feature Space Analysis, IEEE Trans. Pattern Analysis Machine Intell., Vol. 24, No. 5, 603-619, 2002.

Mean Shift

101

ˆ f(x) = 1 Nhd

N

⌅

i=1

k ⇥

x − xi

h

2⇤

SLIDE 102

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Mean Shift

■ Procedure:

Start at a random data point.
Compute the mean shift vector:
Here
Move the current point by the mean shift vector:
Repeat until convergence.

102

[Comaniciu & Meer, 02] g(y) = −k(y)

mh,g(x) = ⌅N

i=1 xig

⇥ x−xi

h

2⇤

⌅N

i=1 g

⇥ x−xi

h

2⇤

− x x ← x + mh,g(x)

SLIDE 103

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Illustration

103

Intuitive Description

Region of interest Center of mass Mean Shift vector

Objective : Find the densest region

From Ukrainitz & Sarel

SLIDE 104

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Evaluation

■ What have we seen so far:

Classification using the Bayes classifier.
Probability density estimation to estimate the class-conditional

densities.

■ How do we know how well we are carrying out each of these tasks? ■ We need a way of performance evaluation:

for density estimation (or really: parameter estimation)
for the classifier as a whole

104

SLIDE 105

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Bias and Variance

■ As we saw maximum likelihood is just one possible parameter estimator.

How can we assess how good an estimator is?
Assume that we have an estimator that estimates the

parameter from the data set .

■ Bias of an estimator:

Expected deviation from the real parameter:

■ Variance of an estimator:

105

bias(ˆ θ) = EX[ˆ θ(X) − θ]

θ

ˆ θ

X

var(ˆ θ) = EX

(ˆ

θ(X) − EX[ˆ θ(X)])2⇥

SLIDE 106

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Bias of an Estimator

■ The estimate is random, because we assumed that is a random sample from a true, underlying distribution:

An estimator is biased if the average estimate differs from the

true value of the parameter .

Otherwise it is called unbiased.

106

ˆ θ

p(ˆ θ(X))

E[ˆ θ(X)]

θ

true value average estimate

θ

ˆ θ(X) X

SLIDE 107

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Variance of an Estimator

■ Ideally, we want an unbiased estimator with small variance. ■ In practice, this is not that easy as we will see shortly.

107

p(ˆ θ(X)) p(ˆ θ(X))

ˆ θ ˆ θ

E[ˆ θ(X)] E[ˆ θ(X)]

small variance large variance

SLIDE 108

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Example: MLE of a Gaussian

■ Let’s compute the bias of the ML estimate of the mean

f a Gaussian:
The MLE of the mean of a Gaussian is unbiased.

108

E[ˆ µ(X) − µ] = E ⇤ 1 N

N

⇧

i=1

xi ⌅ − µ = 1 N N ⇧

i=1

E[xi] ⇥ − µ = 1 N N ⇧

i=1

µ ⇥ − µ = µ − µ = 0

SLIDE 109

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Example: MLE of a Gaussian

■ Are all MLEs unbiased? No:

The MLE of the variance of a Gaussian is biased.

■ We can easily give an unbiased estimator:

109

E[ˆ σ2(X) − σ2] = · · · = N − 1 N σ2 − σ2 ˜ σ2(X) = 1 N − 1

N

i=1

(xi − ¯ X)2

SLIDE 110

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Why do we care?

■ The mean-squared error of an estimator can be expressed in terms of bias and variance: ■ More importantly:

The notion of bias and variance does not only apply to parameter

estimation, but more generally to any estimation problem.

When we do classification, we want to estimate the class for an

unknown data point.

110

MSE(ˆ θ) = E ⇤ (ˆ θ(X) − θ)2⌅ =

bias(ˆ

θ) ⇥2 + var(ˆ θ)

SLIDE 111

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Bias-Variance Tradeoff

■ Typical situation:

111

−3 −2 −1 1 2 0.03 0.06 0.09 0.12 0.15

(bias)2 variance (bias)2 + variance test error

Parameter of our estimator e.g. kernel bandwidth in KDE flexible models simple models

SLIDE 112

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Bias-Variance Tradeoff

■ Our learning algorithm will only generalize well if we find the right tradeoff between bias and variance:

Simple enough to prevent “overfitting” to the particular training

data set that we have.

Yet expressive enough to be able to represent the important

properties of the data.

■ To ensure that our learning algorithm works well, we have to evaluate it on test data.

But what if we don’t have any?

112

SLIDE 113

Stefan Roth, 11.05.2012 | Department of Computer Science | GRIS |

Cross-Validation

■ k-fold cross-validation:

Split training set into k similarly sized groups.
Train on k-1 of the groups and test of the remaining one.
Repeat for all k possible combinations.

■ Special case:

Leave-one-out error: k = N

113

run 1 run 2 run 3 run 4

test train