[PPT] - CS480/680 Lecture 4: May 15, 2019 Statistical Learning [RN]: Sec PowerPoint Presentation

SLIDE 1

CS480/680 Lecture 4: May 15, 2019

Statistical Learning [RN]: Sec 20.1, 20.2, [M]: Sec. 2.2, 3.2

CS480/680 Spring 2019 Pascal Poupart 1 University of Waterloo

SLIDE 2

CS480/680 Spring 2019 Pascal Poupart 2

Statistical Learning

View: we have uncertain knowledge of the world
Idea: learning simply reduces this uncertainty

University of Waterloo

SLIDE 3

CS480/680 Spring 2019 Pascal Poupart 3

Terminology

Probability distribution:

– A specification of a probability for each event in

ur sample space

– Probabilities must sum to 1

Assume the world is described by two (or

more) random variables

– Joint probability distribution

Specification of probabilities for all combinations of

events

University of Waterloo

SLIDE 4

CS480/680 Spring 2019 Pascal Poupart 4

Joint distribution

Given two random variables ! and ":
Joint distribution:

Pr(! = ' Λ " = )) for all ', )

Marginalisation (sumout rule):

Pr(! = ') = Σ) Pr(! = ' Λ " = )) Pr(" = )) = Σ' Pr(! = ' Λ " = ))

University of Waterloo

SLIDE 5

CS480/680 Spring 2019 Pascal Poupart 5

Example: Joint Distribution

cold ~cold headache 0.108 0.012 ~headache 0.016 0.064 cold ~cold headache 0.072 0.008 ~headache 0.144 0.576 sunny ~sunny P(headacheΛsunnyΛcold) = P(~headacheΛsunnyΛ~cold) = P(headacheVsunny) = P(headache) = marginalization

University of Waterloo

SLIDE 6

CS480/680 Spring 2019 Pascal Poupart 6

Conditional Probability

Pr($|&): fraction of worlds in which & is true

that also have $ true

H F

H=“Have headache” F=“Have Flu” Pr(() = 1/10 Pr(-) = 1/40 Pr((|-) = 1/2 Headaches are rare and flu is rarer, but if you have the flu, then there is a 50-50 chance you will have a headache

University of Waterloo

SLIDE 7

CS480/680 Spring 2019 Pascal Poupart 7

Conditional Probability

H F

H=“Have headache” F=“Have Flu” Pr($) = 1/10 Pr(*) = 1/40 Pr($|*) = 1/2 Pr($|*) = Fraction of flu inflicted worlds in which you have a headache =(# worlds with flu and headache)/ (# worlds with flu) = (Area of “H and F” region)/ (Area of “F” region) = Pr($ Λ *)/ Pr(*)

University of Waterloo

SLIDE 8

CS480/680 Spring 2019 Pascal Poupart 8

Conditional Probability

Definition:

Pr($|&) = Pr($ Λ &) / Pr(&)

Chain rule:

Pr($ Λ &) = Pr($|&) Pr(&) Memorize these!

University of Waterloo

SLIDE 9

CS480/680 Spring 2019 Pascal Poupart 9

Inference

H F

H=“Have headache” F=“Have Flu” Pr($) = 1/10 Pr(*) = 1/40 Pr($|*) = 1/2 One day you wake up with a

headache. You think “Drat! 50%
f flues are associated with

headaches so I must have a 50- 50 chance of coming down with the flu”

Is your reasoning correct? Pr(Λ$) = Pr $ =

University of Waterloo

SLIDE 10

CS480/680 Spring 2019 Pascal Poupart 10

Example: Joint Distribution

cold ~cold headache 0.108 0.012 ~headache 0.016 0.064 cold ~cold headache 0.072 0.008 ~headache 0.144 0.576 sunny ~sunny Pr(ℎ%&'&(ℎ% Λ (*+' | -.//0) = Pr(ℎ%&'&(ℎ% Λ (*+' | ~-.//0) =

University of Waterloo

SLIDE 11

CS480/680 Spring 2019 Pascal Poupart 11

Bayes Rule

Note

Pr($|&)Pr(&) = Pr($Λ&) = Pr(&Λ$) = Pr(&|$)*+($)

Bayes Rule

Pr(&|$) = [(Pr($|&)Pr(&)]/Pr($)

Memorize this!

University of Waterloo

SLIDE 12

CS480/680 Spring 2019 Pascal Poupart 12

Using Bayes Rule for inference

Often we want to form a hypothesis about the world

based on what we have observed

Bayes rule is vitally important when viewed in terms
f stating the belief given to hypothesis H, given

evidence e

Posterior probability Prior probability Likelihood Normalizing constant

University of Waterloo

SLIDE 13

CS480/680 Spring 2019 Pascal Poupart 13

Bayesian Learning

Prior: Pr($)
Likelihood: Pr(&|$)
Evidence: ( = < &1, &2, … , &/ >
Bayesian Learning amounts to computing the

posterior using Bayes’ Theorem: Pr($|() = 1 Pr((|$)Pr($)

University of Waterloo

SLIDE 14

CS480/680 Spring 2019 Pascal Poupart 14

Bayesian Prediction

Suppose we want to make a prediction about an

unknown quantity X

Pr($|&) = Σ* Pr($|&, ℎ-).(ℎ*|&)

= Σ* Pr($|ℎ-).(ℎ*|&)

Predictions are weighted averages of the predictions
f the individual hypotheses
Hypotheses serve as “intermediaries” between raw

data and prediction

University of Waterloo

SLIDE 15

CS480/680 Spring 2019 Pascal Poupart 15

Candy Example

Favorite candy sold in two flavors:

– Lime (hugh) – Cherry (yum)

Same wrapper for both flavors
Sold in bags with different ratios:

– 100% cherry – 75% cherry + 25% lime – 50% cherry + 50% lime – 25% cherry + 75% lime – 100% lime

University of Waterloo

SLIDE 16

CS480/680 Spring 2019 Pascal Poupart 16

Candy Example

You bought a bag of candy but don’t know its flavor

ratio

After eating ! candies:

– What’s the flavor ratio of the bag? – What will be the flavor of the next candy?

University of Waterloo

SLIDE 17

CS480/680 Spring 2019 Pascal Poupart 17

Statistical Learning

Hypothesis H: probabilistic theory of the world

– ℎ1: 100% cherry – ℎ2: 75% cherry + 25% lime – ℎ3: 50% cherry + 50% lime – ℎ4: 25% cherry + 75% lime – ℎ5: 100% lime

Examples E: evidence about the world

– '1: 1st candy is cherry – '2: 2nd candy is lime – '3: 3rd candy is lime – …

University of Waterloo

SLIDE 18

CS480/680 Spring 2019 Pascal Poupart 18

Candy Example

Assume prior Pr($) = < 0.1, 0.2, 0.4, 0.2, 0.1 >
Assume candies are i.i.d. (identically and

independently distributed)

Pr(/|ℎ) = P2 3(42|ℎ)

Suppose first 10 candies all taste lime:

Pr(/|ℎ5) = Pr(/|ℎ3) = Pr(/|ℎ1) =

University of Waterloo

SLIDE 19

CS480/680 Spring 2019 Pascal Poupart 19

Posterior

University of Waterloo

SLIDE 20

CS480/680 Spring 2019 Pascal Poupart 20

Prediction

Probability that next candy is lime

University of Waterloo

SLIDE 21

CS480/680 Spring 2019 Pascal Poupart 21

Bayesian Learning

Bayesian learning properties:

– Optimal (i.e. given prior, no other prediction is correct more often than the Bayesian one) – No overfitting (all hypotheses considered and weighted)

There is a price to pay:

– When hypothesis space is large, Bayesian learning may be intractable – i.e. sum (or integral) over hypothesis often intractable

Solution: approximate Bayesian learning

University of Waterloo

SLIDE 22

CS480/680 Spring 2019 Pascal Poupart 22

Maximum a posteriori (MAP)

Idea: make prediction based on most probable

hypothesis ℎ"#$

ℎ"#$ = &'()&*ℎ+ Pr(ℎ+|0) Pr(2|0) » Pr(2|ℎ345)

In contrast, Bayesian learning makes prediction

based on all hypotheses weighted by their probability

University of Waterloo

SLIDE 23

CS480/680 Spring 2019 Pascal Poupart 23

MAP properties

MAP prediction less accurate than Bayesian

prediction since it relies only on one hypothesis ℎ"#$

But MAP and Bayesian predictions converge as data

increases

Controlled overfitting (prior can be used to penalize

complex hypotheses)

Finding ℎ"#$ may be intractable:

– ℎ"#$ = &'()&*+ Pr(ℎ|0) – Optimization may be difficult

University of Waterloo

SLIDE 24

CS480/680 Spring 2019 Pascal Poupart 24

Maximum Likelihood (ML)

Idea: simplify MAP by assuming uniform prior

(i.e., Pr(ℎ%) = Pr(ℎ() "), ()

ℎ+,- = ./01.2ℎ Pr(ℎ) Pr(3|ℎ) ℎ+5 = ./01.2ℎ Pr(3|ℎ)

Make prediction based on ℎ+5 only:

Pr(6|3) » Pr(6|ℎ78)

University of Waterloo

SLIDE 25

CS480/680 Spring 2019 Pascal Poupart 25

ML properties

ML prediction less accurate than Bayesian and MAP

predictions since it ignores prior info and relies only

n one hypothesis ℎ"#
But ML, MAP and Bayesian predictions converge as

data increases

Subject to overfitting (no prior to penalize complex

hypothesis that could exploit statistically insignificant data patterns)

Finding ℎ"# is often easier than ℎ"$%

ℎ"# = '()*'+ℎ Σ- log Pr(4-|ℎ)

University of Waterloo

CS480/680 Lecture 4: May 15, 2019

Statistical Learning [RN]: Sec 20.1, 20.2, [M]: Sec. 2.2, 3.2

Statistical Learning

Terminology

– A specification of a probability for each event in

– Probabilities must sum to 1

more) random variables

– Joint probability distribution

events

Joint distribution

Pr(! = ' Λ " = )) for all ', )

Pr(! = ') = Σ) Pr(! = ' Λ " = )) Pr(" = )) = Σ' Pr(! = ' Λ " = ))

Example: Joint Distribution

Conditional Probability

that also have $ true

Conditional Probability

Conditional Probability

Pr($|&) = Pr($ Λ &) / Pr(&)

Pr($ Λ &) = Pr($|&) Pr(&) Memorize these!

Inference

Is your reasoning correct? Pr(*Λ$) = Pr * $ =

Example: Joint Distribution

Bayes Rule

Pr($|&)Pr(&) = Pr($Λ&) = Pr(&Λ$) = Pr(&|$)*+($)

Pr(&|$) = [(Pr($|&)Pr(&)]/Pr($)

Memorize this!

Using Bayes Rule for inference

based on what we have observed

evidence e

Posterior probability Prior probability Likelihood Normalizing constant

Bayesian Learning

posterior using Bayes’ Theorem: Pr($|() = 1 Pr((|$)Pr($)

Bayesian Prediction

unknown quantity X

= Σ* Pr($|ℎ-).(ℎ*|&)

data and prediction

Candy Example

– Lime (hugh) – Cherry (yum)

– 100% cherry – 75% cherry + 25% lime – 50% cherry + 50% lime – 25% cherry + 75% lime – 100% lime

Candy Example

ratio

– What’s the flavor ratio of the bag? – What will be the flavor of the next candy?

Statistical Learning

– ℎ1: 100% cherry – ℎ2: 75% cherry + 25% lime – ℎ3: 50% cherry + 50% lime – ℎ4: 25% cherry + 75% lime – ℎ5: 100% lime

– '1: 1st candy is cherry – '2: 2nd candy is lime – '3: 3rd candy is lime – …

Candy Example

independently distributed)

Pr(/|ℎ) = P2 3(42|ℎ)

Pr(/|ℎ5) = Pr(/|ℎ3) = Pr(/|ℎ1) =

Posterior

Prediction

Bayesian Learning

– Optimal (i.e. given prior, no other prediction is correct more often than the Bayesian one) – No overfitting (all hypotheses considered and weighted)

– When hypothesis space is large, Bayesian learning may be intractable – i.e. sum (or integral) over hypothesis often intractable

Maximum a posteriori (MAP)

hypothesis ℎ"#$

ℎ"#$ = &'()&*ℎ+ Pr(ℎ+|0) Pr(2|0) » Pr(2|ℎ345)

based on all hypotheses weighted by their probability

MAP properties

prediction since it relies only on one hypothesis ℎ"#$

increases

complex hypotheses)

– ℎ"#$ = &'()&*+ Pr(ℎ|0) – Optimization may be difficult

Maximum Likelihood (ML)

(i.e., Pr(ℎ%) = Pr(ℎ() "), ()

ℎ+,- = ./01.2ℎ Pr(ℎ) Pr(3|ℎ) ℎ+5 = ./01.2ℎ Pr(3|ℎ)

Pr(6|3) » Pr(6|ℎ78)

ML properties

predictions since it ignores prior info and relies only

data increases

hypothesis that could exploit statistically insignificant data patterns)

ℎ"# = '()*'+ℎ Σ- log Pr(4-|ℎ)

Is your reasoning correct? Pr(Λ$) = Pr $ =