CS480/680 Lecture 4: May 15, 2019 Statistical Learning [RN]: Sec - - PowerPoint PPT Presentation

cs480 680 lecture 4 may 15 2019
SMART_READER_LITE
LIVE PREVIEW

CS480/680 Lecture 4: May 15, 2019 Statistical Learning [RN]: Sec - - PowerPoint PPT Presentation

CS480/680 Lecture 4: May 15, 2019 Statistical Learning [RN]: Sec 20.1, 20.2, [M]: Sec. 2.2, 3.2 University of Waterloo CS480/680 Spring 2019 Pascal Poupart 1 Statistical Learning View: we have uncertain knowledge of the world Idea:


slide-1
SLIDE 1

CS480/680 Lecture 4: May 15, 2019

Statistical Learning [RN]: Sec 20.1, 20.2, [M]: Sec. 2.2, 3.2

CS480/680 Spring 2019 Pascal Poupart 1 University of Waterloo

slide-2
SLIDE 2

CS480/680 Spring 2019 Pascal Poupart 2

Statistical Learning

  • View: we have uncertain knowledge of the world
  • Idea: learning simply reduces this uncertainty

University of Waterloo

slide-3
SLIDE 3

CS480/680 Spring 2019 Pascal Poupart 3

Terminology

  • Probability distribution:

– A specification of a probability for each event in

  • ur sample space

– Probabilities must sum to 1

  • Assume the world is described by two (or

more) random variables

– Joint probability distribution

  • Specification of probabilities for all combinations of

events

University of Waterloo

slide-4
SLIDE 4

CS480/680 Spring 2019 Pascal Poupart 4

Joint distribution

  • Given two random variables ! and ":
  • Joint distribution:

Pr(! = ' Λ " = )) for all ', )

  • Marginalisation (sumout rule):

Pr(! = ') = Σ) Pr(! = ' Λ " = )) Pr(" = )) = Σ' Pr(! = ' Λ " = ))

University of Waterloo

slide-5
SLIDE 5

CS480/680 Spring 2019 Pascal Poupart 5

Example: Joint Distribution

cold ~cold headache 0.108 0.012 ~headache 0.016 0.064 cold ~cold headache 0.072 0.008 ~headache 0.144 0.576 sunny ~sunny P(headacheΛsunnyΛcold) = P(~headacheΛsunnyΛ~cold) = P(headacheVsunny) = P(headache) = marginalization

University of Waterloo

slide-6
SLIDE 6

CS480/680 Spring 2019 Pascal Poupart 6

Conditional Probability

  • Pr($|&): fraction of worlds in which & is true

that also have $ true

H F

H=“Have headache” F=“Have Flu” Pr(() = 1/10 Pr(-) = 1/40 Pr((|-) = 1/2 Headaches are rare and flu is rarer, but if you have the flu, then there is a 50-50 chance you will have a headache

University of Waterloo

slide-7
SLIDE 7

CS480/680 Spring 2019 Pascal Poupart 7

Conditional Probability

H F

H=“Have headache” F=“Have Flu” Pr($) = 1/10 Pr(*) = 1/40 Pr($|*) = 1/2 Pr($|*) = Fraction of flu inflicted worlds in which you have a headache =(# worlds with flu and headache)/ (# worlds with flu) = (Area of “H and F” region)/ (Area of “F” region) = Pr($ Λ *)/ Pr(*)

University of Waterloo

slide-8
SLIDE 8

CS480/680 Spring 2019 Pascal Poupart 8

Conditional Probability

  • Definition:

Pr($|&) = Pr($ Λ &) / Pr(&)

  • Chain rule:

Pr($ Λ &) = Pr($|&) Pr(&) Memorize these!

University of Waterloo

slide-9
SLIDE 9

CS480/680 Spring 2019 Pascal Poupart 9

Inference

H F

H=“Have headache” F=“Have Flu” Pr($) = 1/10 Pr(*) = 1/40 Pr($|*) = 1/2 One day you wake up with a

  • headache. You think “Drat! 50%
  • f flues are associated with

headaches so I must have a 50- 50 chance of coming down with the flu”

Is your reasoning correct? Pr(*Λ$) = Pr * $ =

University of Waterloo

slide-10
SLIDE 10

CS480/680 Spring 2019 Pascal Poupart 10

Example: Joint Distribution

cold ~cold headache 0.108 0.012 ~headache 0.016 0.064 cold ~cold headache 0.072 0.008 ~headache 0.144 0.576 sunny ~sunny Pr(ℎ%&'&(ℎ% Λ (*+' | -.//0) = Pr(ℎ%&'&(ℎ% Λ (*+' | ~-.//0) =

University of Waterloo

slide-11
SLIDE 11

CS480/680 Spring 2019 Pascal Poupart 11

Bayes Rule

  • Note

Pr($|&)Pr(&) = Pr($Λ&) = Pr(&Λ$) = Pr(&|$)*+($)

  • Bayes Rule

Pr(&|$) = [(Pr($|&)Pr(&)]/Pr($)

Memorize this!

University of Waterloo

slide-12
SLIDE 12

CS480/680 Spring 2019 Pascal Poupart 12

Using Bayes Rule for inference

  • Often we want to form a hypothesis about the world

based on what we have observed

  • Bayes rule is vitally important when viewed in terms
  • f stating the belief given to hypothesis H, given

evidence e

Posterior probability Prior probability Likelihood Normalizing constant

University of Waterloo

slide-13
SLIDE 13

CS480/680 Spring 2019 Pascal Poupart 13

Bayesian Learning

  • Prior: Pr($)
  • Likelihood: Pr(&|$)
  • Evidence: ( = < &1, &2, … , &/ >
  • Bayesian Learning amounts to computing the

posterior using Bayes’ Theorem: Pr($|() = 1 Pr((|$)Pr($)

University of Waterloo

slide-14
SLIDE 14

CS480/680 Spring 2019 Pascal Poupart 14

Bayesian Prediction

  • Suppose we want to make a prediction about an

unknown quantity X

  • Pr($|&) = Σ* Pr($|&, ℎ-).(ℎ*|&)

= Σ* Pr($|ℎ-).(ℎ*|&)

  • Predictions are weighted averages of the predictions
  • f the individual hypotheses
  • Hypotheses serve as “intermediaries” between raw

data and prediction

University of Waterloo

slide-15
SLIDE 15

CS480/680 Spring 2019 Pascal Poupart 15

Candy Example

  • Favorite candy sold in two flavors:

– Lime (hugh) – Cherry (yum)

  • Same wrapper for both flavors
  • Sold in bags with different ratios:

– 100% cherry – 75% cherry + 25% lime – 50% cherry + 50% lime – 25% cherry + 75% lime – 100% lime

University of Waterloo

slide-16
SLIDE 16

CS480/680 Spring 2019 Pascal Poupart 16

Candy Example

  • You bought a bag of candy but don’t know its flavor

ratio

  • After eating ! candies:

– What’s the flavor ratio of the bag? – What will be the flavor of the next candy?

University of Waterloo

slide-17
SLIDE 17

CS480/680 Spring 2019 Pascal Poupart 17

Statistical Learning

  • Hypothesis H: probabilistic theory of the world

– ℎ1: 100% cherry – ℎ2: 75% cherry + 25% lime – ℎ3: 50% cherry + 50% lime – ℎ4: 25% cherry + 75% lime – ℎ5: 100% lime

  • Examples E: evidence about the world

– '1: 1st candy is cherry – '2: 2nd candy is lime – '3: 3rd candy is lime – …

University of Waterloo

slide-18
SLIDE 18

CS480/680 Spring 2019 Pascal Poupart 18

Candy Example

  • Assume prior Pr($) = < 0.1, 0.2, 0.4, 0.2, 0.1 >
  • Assume candies are i.i.d. (identically and

independently distributed)

Pr(/|ℎ) = P2 3(42|ℎ)

  • Suppose first 10 candies all taste lime:

Pr(/|ℎ5) = Pr(/|ℎ3) = Pr(/|ℎ1) =

University of Waterloo

slide-19
SLIDE 19

CS480/680 Spring 2019 Pascal Poupart 19

Posterior

University of Waterloo

slide-20
SLIDE 20

CS480/680 Spring 2019 Pascal Poupart 20

Prediction

Probability that next candy is lime

University of Waterloo

slide-21
SLIDE 21

CS480/680 Spring 2019 Pascal Poupart 21

Bayesian Learning

  • Bayesian learning properties:

– Optimal (i.e. given prior, no other prediction is correct more often than the Bayesian one) – No overfitting (all hypotheses considered and weighted)

  • There is a price to pay:

– When hypothesis space is large, Bayesian learning may be intractable – i.e. sum (or integral) over hypothesis often intractable

  • Solution: approximate Bayesian learning

University of Waterloo

slide-22
SLIDE 22

CS480/680 Spring 2019 Pascal Poupart 22

Maximum a posteriori (MAP)

  • Idea: make prediction based on most probable

hypothesis ℎ"#$

ℎ"#$ = &'()&*ℎ+ Pr(ℎ+|0) Pr(2|0) » Pr(2|ℎ345)

  • In contrast, Bayesian learning makes prediction

based on all hypotheses weighted by their probability

University of Waterloo

slide-23
SLIDE 23

CS480/680 Spring 2019 Pascal Poupart 23

MAP properties

  • MAP prediction less accurate than Bayesian

prediction since it relies only on one hypothesis ℎ"#$

  • But MAP and Bayesian predictions converge as data

increases

  • Controlled overfitting (prior can be used to penalize

complex hypotheses)

  • Finding ℎ"#$ may be intractable:

– ℎ"#$ = &'()&*+ Pr(ℎ|0) – Optimization may be difficult

University of Waterloo

slide-24
SLIDE 24

CS480/680 Spring 2019 Pascal Poupart 24

Maximum Likelihood (ML)

  • Idea: simplify MAP by assuming uniform prior

(i.e., Pr(ℎ%) = Pr(ℎ() "), ()

ℎ+,- = ./01.2ℎ Pr(ℎ) Pr(3|ℎ) ℎ+5 = ./01.2ℎ Pr(3|ℎ)

  • Make prediction based on ℎ+5 only:

Pr(6|3) » Pr(6|ℎ78)

University of Waterloo

slide-25
SLIDE 25

CS480/680 Spring 2019 Pascal Poupart 25

ML properties

  • ML prediction less accurate than Bayesian and MAP

predictions since it ignores prior info and relies only

  • n one hypothesis ℎ"#
  • But ML, MAP and Bayesian predictions converge as

data increases

  • Subject to overfitting (no prior to penalize complex

hypothesis that could exploit statistically insignificant data patterns)

  • Finding ℎ"# is often easier than ℎ"$%

ℎ"# = '()*'+ℎ Σ- log Pr(4-|ℎ)

University of Waterloo