Parametric Methods Steven J Zeil Old Dominion Univ. Fall 2010 1 - - PowerPoint PPT Presentation

parametric methods
SMART_READER_LITE
LIVE PREVIEW

Parametric Methods Steven J Zeil Old Dominion Univ. Fall 2010 1 - - PowerPoint PPT Presentation

Distributions Estimating Distribution Parameters Parametric Classification Regression Parametric Methods Steven J Zeil Old Dominion Univ. Fall 2010 1 Distributions Estimating Distribution Parameters Parametric Classification Regression


slide-1
SLIDE 1

Distributions Estimating Distribution Parameters Parametric Classification Regression

Parametric Methods

Steven J Zeil

Old Dominion Univ.

Fall 2010

1

slide-2
SLIDE 2

Distributions Estimating Distribution Parameters Parametric Classification Regression

Outline

1

Distributions

2

Estimating Distribution Parameters Maximum Likelihood Estimation Evaluating an Estimator: Bias and Variance Bayes’ Estimator

3

Parametric Classification

4

Regression Regression Error Linear Regression Polynomial Regression Tuning Model Complexity Model Selection

2

slide-3
SLIDE 3

Distributions Estimating Distribution Parameters Parametric Classification Regression

Parametric Methods

Assume that sample is drawn from a known distribution

3

slide-4
SLIDE 4

Distributions Estimating Distribution Parameters Parametric Classification Regression

Parametric Methods

Assume that sample is drawn from a known distribution Advantage: model can be formed from a small number of parameters

3

slide-5
SLIDE 5

Distributions Estimating Distribution Parameters Parametric Classification Regression

Parametric Methods

Assume that sample is drawn from a known distribution Advantage: model can be formed from a small number of parameters

e.g., mean & variance

3

slide-6
SLIDE 6

Distributions Estimating Distribution Parameters Parametric Classification Regression

Parametric Methods

Assume that sample is drawn from a known distribution Advantage: model can be formed from a small number of parameters

e.g., mean & variance

Estimate parameters from the sample to get an estimated distribution

3

slide-7
SLIDE 7

Distributions Estimating Distribution Parameters Parametric Classification Regression

Parametric Methods

Assume that sample is drawn from a known distribution Advantage: model can be formed from a small number of parameters

e.g., mean & variance

Estimate parameters from the sample to get an estimated distribution Then use that distribution to make decisions

3

slide-8
SLIDE 8

Distributions Estimating Distribution Parameters Parametric Classification Regression

Distributions

Classification discriminant function gi(x) = P(Ci)p(x|Ci)

4

slide-9
SLIDE 9

Distributions Estimating Distribution Parameters Parametric Classification Regression

Distributions

Classification discriminant function gi(x) = P(Ci)p(x|Ci) For classification, we need to estimate the densities p(x|Ci) and P(Ci)

4

slide-10
SLIDE 10

Distributions Estimating Distribution Parameters Parametric Classification Regression

Distributions

Classification discriminant function gi(x) = P(Ci)p(x|Ci) For classification, we need to estimate the densities p(x|Ci) and P(Ci) For regression, we need to estimate p(y|x)

4

slide-11
SLIDE 11

Distributions Estimating Distribution Parameters Parametric Classification Regression

Distributions

Classification discriminant function gi(x) = P(Ci)p(x|Ci) For classification, we need to estimate the densities p(x|Ci) and P(Ci) For regression, we need to estimate p(y|x) In this chapter, we use single variables ( x = [x])

4

slide-12
SLIDE 12

Distributions Estimating Distribution Parameters Parametric Classification Regression

Example Distributions

Bernoulli: x ∈ {0, 1} P(x) = px

0(1 − p0)(1−x)

Multinomial: K > 2 states, xi ∈ {0, 1} P(x1, x2, . . . , xk) =

  • i

pxi

i

5

slide-13
SLIDE 13

Distributions Estimating Distribution Parameters Parametric Classification Regression

Example Distributions

Gaussian (Normal): p(x) = 1 √ 2πσ exp

  • −(x − µ)2

2σ2

  • 6
slide-14
SLIDE 14

Distributions Estimating Distribution Parameters Parametric Classification Regression

Outline

1

Distributions

2

Estimating Distribution Parameters Maximum Likelihood Estimation Evaluating an Estimator: Bias and Variance Bayes’ Estimator

3

Parametric Classification

4

Regression Regression Error Linear Regression Polynomial Regression Tuning Model Complexity Model Selection

7

slide-15
SLIDE 15

Distributions Estimating Distribution Parameters Parametric Classification Regression

Likelihood

iid sample X = {xt}t, drawn from p(x|θ)

8

slide-16
SLIDE 16

Distributions Estimating Distribution Parameters Parametric Classification Regression

Likelihood

iid sample X = {xt}t, drawn from p(x|θ) How to find θ that makes our sample as likely as possible?

8

slide-17
SLIDE 17

Distributions Estimating Distribution Parameters Parametric Classification Regression

Likelihood

iid sample X = {xt}t, drawn from p(x|θ) How to find θ that makes our sample as likely as possible? Because the xt are indep, the likelihood of θ given X is l(θ|X) ≡ p(X|θ) =

N

  • t=1

p(xt|θ)

8

slide-18
SLIDE 18

Distributions Estimating Distribution Parameters Parametric Classification Regression

Maximum Likelihood Estimation (MLE)

Likelihood l(θ|X) ≡ p(X|θ) =

N

  • t=1

p(xt|θ) In MLE, find the θ that makes X the most likely to be seen

Search for θ that maximizes l(θ|X)

To simplify, we often instead maximize the log likelihood: L(θ|X) ≡ log l(θ|X) =

N

  • t=1

log p(xt|θ) Maximum Likelihood Estimator θ∗ = argmaxθL(θ|X)

9

slide-19
SLIDE 19

Distributions Estimating Distribution Parameters Parametric Classification Regression

Example MLEs

Bernoulli: x ∈ {0, 1} P(x) = px

0(1 − p0)(1−x)

L(p0|X) = log

  • t

pxt

0 (1 − p0)(1−xt)

MLE : p0 =

  • t xt

N Multinomial: K > 2 states, xi ∈ {0, 1} P(x1, x2, . . . , xk) =

  • i

pxi

i

L(p1, p2, . . . , pk|X) = log

  • t
  • i

p

xt

i

i

MLE : pi =

  • t xt

i

N

10

slide-20
SLIDE 20

Distributions Estimating Distribution Parameters Parametric Classification Regression

Example MLEs

Gaussian (Normal): p(x) = N(µ, σ2) = 1 √ 2πσ exp

  • −(x − µ)2

2σ2

  • MLE for µ : m =
  • t xt

N MLE for σ2 : s2 =

  • t (xt − m)2

N

11

slide-21
SLIDE 21

Distributions Estimating Distribution Parameters Parametric Classification Regression

Bias and Variance

Population X drawn from p(x|θ) Estimator of θ, di = d(Xi)

  • n sample Xi

Bias: bθ(d) = E[d] − θ Variance: E

  • (d − E[d])2

Mean square error: r(d, θ) = E

  • (d − θ)2

= (E[d] − θ)2 + E

  • (d − E[d])2

= Bias2 + Variance

12

slide-22
SLIDE 22

Distributions Estimating Distribution Parameters Parametric Classification Regression

Estimators

If we have prior knowledge of p(θ)

13

slide-23
SLIDE 23

Distributions Estimating Distribution Parameters Parametric Classification Regression

Estimators

If we have prior knowledge of p(θ) Bayes’ rule: p(θ|X) = p(X|θ)p(θ) p(X) = p(X|θ)p(θ)

  • p(X|θ′)p(θ′)dθ′

13

slide-24
SLIDE 24

Distributions Estimating Distribution Parameters Parametric Classification Regression

Estimators

If we have prior knowledge of p(θ) Bayes’ rule: p(θ|X) = p(X|θ)p(θ) p(X) = p(X|θ)p(θ)

  • p(X|θ′)p(θ′)dθ′

Problem: except in special cases, this won’t have a nice, closed-form solution

13

slide-25
SLIDE 25

Distributions Estimating Distribution Parameters Parametric Classification Regression

Estimators

If we have prior knowledge of p(θ) Bayes’ rule: p(θ|X) = p(X|θ)p(θ) p(X) = p(X|θ)p(θ)

  • p(X|θ′)p(θ′)dθ′

Problem: except in special cases, this won’t have a nice, closed-form solution

Numerical estimation

13

slide-26
SLIDE 26

Distributions Estimating Distribution Parameters Parametric Classification Regression

Estimators

If we have prior knowledge of p(θ) Bayes’ rule: p(θ|X) = p(X|θ)p(θ) p(X) = p(X|θ)p(θ)

  • p(X|θ′)p(θ′)dθ′

Problem: except in special cases, this won’t have a nice, closed-form solution

Numerical estimation Use simpler “point” estimators

13

slide-27
SLIDE 27

Distributions Estimating Distribution Parameters Parametric Classification Regression

Estimators

If we have prior knowledge of p(θ) Bayes’ rule: p(θ|X) = p(X|θ)p(θ) p(X) = p(X|θ)p(θ)

  • p(X|θ′)p(θ′)dθ′

Problem: except in special cases, this won’t have a nice, closed-form solution

Numerical estimation Use simpler “point” estimators

If form is tractable, we can do Bayes’ estimator

13

slide-28
SLIDE 28

Distributions Estimating Distribution Parameters Parametric Classification Regression

Simpler Estimators

Maximum a Posteriori (MAP) θMAP = argmaxθp(θ|X)

14

slide-29
SLIDE 29

Distributions Estimating Distribution Parameters Parametric Classification Regression

Simpler Estimators

Maximum a Posteriori (MAP) θMAP = argmaxθp(θ|X) Maximum Likelihood (ML) θML = argmaxθp(X|θ)

14

slide-30
SLIDE 30

Distributions Estimating Distribution Parameters Parametric Classification Regression

Bayes’ Estimator

Bayes: θBayes = E[θ|X] =

  • θp(θ|X)dθ

Example: xt ∼ N(θ, σ2

0) and θ ∼ N(µ, σ2)

Let m be mean of the sample By the Central limit theorem, the distribution of even a non-normal poupulation’s mean is approx. normal, centered on the population mean, with a standard dev. of

σ √ N

θML = m θMAP = θBayes = E[θ|X] = N/σ2 N/σ2

0 + 1/σ2 m +

1/σ2 N/σ2

0 + 1/σ2 µ

15

slide-31
SLIDE 31

Distributions Estimating Distribution Parameters Parametric Classification Regression

Parametric Classification

gi(x) = p(x|Ci)P(Ci)

  • r, equivalently, we could maximize

gi(x) = log p(x|Ci) + log P(Ci) If the P(x|Ci) are Gaussian, p(x|Ci) = 1 √ 2πσi exp

  • −(x − µi)2

2σ2

i

  • gi(x) = −1

2 log 2π − log σi − (x − µi)2 2σ2

i

+ log P(Ci) We can estimate µi, σi, and P(Ci) from a sample and drop the first (constant) term

16

slide-32
SLIDE 32

Distributions Estimating Distribution Parameters Parametric Classification Regression

Parametric Classification (Gaussian from sample)

gi(x) = − log si − (x − mi)2 2s2

i

+ log ˆ P(Ci) Example: consider a 2-class problem with equal priors and σ0 = σ1. this becomes gi(x) = −(x − mi)2 2 which simply measures distance of x from the mean.

17

slide-33
SLIDE 33

Distributions Estimating Distribution Parameters Parametric Classification Regression

Equal Variances & Priors

Choose Ci if |x − mi| = mink |x − mk|

18

slide-34
SLIDE 34

Distributions Estimating Distribution Parameters Parametric Classification Regression

Parametric Classification (Gaussian from sample)

gi(x) = − log si − (x − mi)2 2s2

i

+ log ˆ P(Ci) Example 2: consider a 2-class problem with σ0 < σ1. There will be two transition points

19

slide-35
SLIDE 35

Distributions Estimating Distribution Parameters Parametric Classification Regression

Different Variances

20

slide-36
SLIDE 36

Distributions Estimating Distribution Parameters Parametric Classification Regression

Outline

1

Distributions

2

Estimating Distribution Parameters Maximum Likelihood Estimation Evaluating an Estimator: Bias and Variance Bayes’ Estimator

3

Parametric Classification

4

Regression Regression Error Linear Regression Polynomial Regression Tuning Model Complexity Model Selection

21

slide-37
SLIDE 37

Distributions Estimating Distribution Parameters Parametric Classification Regression

Regression

Model our data {xt, rt}t as f = f (x) + ε f is the unknown function we are seeking ε is random error

Assume ε ∼ N(0, σ2)

Then p(r|x) ∼ N(g(x|θ), σ2)

22

slide-38
SLIDE 38

Distributions Estimating Distribution Parameters Parametric Classification Regression

Regression Likelihood

p(r|x) ∼ N(g(x|θ), σ2) L(θ|X) = log

N

  • t=1

p(xt, rt) = log

N

  • t=1

p(rt|xt) + log

N

  • t=1

p(xt) ≃ log

N

  • t=1

p(rt|xt) = log

N

  • t=1

√ 2πσ exp

  • −(rt − g(xt|θ))2

2σ2

  • 23
slide-39
SLIDE 39

Distributions Estimating Distribution Parameters Parametric Classification Regression

Regression Likelihood & Error

L(θ|X) = log

N

  • t=1

√ 2πσ exp

  • −(rt − g(xt|θ))2

2σ2

  • =

−N log √ 2πσ − 1 2σ2

N

  • t=1

(rt − g(xt|θ))2 ≃ −

N

  • t=1

(rt − g(xt|θ))2 E(θ|X) =

N

  • t=1

(rt − g(xt|θ))2

24

slide-40
SLIDE 40

Distributions Estimating Distribution Parameters Parametric Classification Regression

Linear Regression

Minimize E(θ|X) =

N

  • t=1

(rt − g(xt|θ))2 where g(xt|w1, w0) = w1xt + w0 E(θ|X) =

N

  • t=1

(rt − w1xt − w0)2 Take the partial derivatives w.r.t. the wi and set to zero. . .

25

slide-41
SLIDE 41

Distributions Estimating Distribution Parameters Parametric Classification Regression

Linear Regression (cont.)

E(θ|X) =

N

  • t=1

(rt − w1xt − w0)2 dE dw0 = −2

N

  • t=1

(rt − w1xt − w0)) =

  • t

rt = Nw0 + w1

  • t

xt dE dw1 = −2

N

  • t=1

xt(rt − w1xt − w0)) =

  • t

rtxt = w0

  • t

xt + w1

  • t
  • xt2

26

slide-42
SLIDE 42

Distributions Estimating Distribution Parameters Parametric Classification Regression

Solving Linear Regression

  • t

rt = Nw0 + w1

  • t

xt

  • t

rtxt = w0

  • t

xt + w1

  • t
  • xt2

A =

  • N
  • t xt
  • t xt
  • t (xt)2
  • w =

w0 w1

  • y =
  • t rt
  • t rtxt
  • A

w = y

27

slide-43
SLIDE 43

Distributions Estimating Distribution Parameters Parametric Classification Regression

Polynomial Regression

g(xt| w) =

k

  • i=0

wi

  • xti

D =       1 x1

  • x12

. . .

  • x1k

1 x2

  • x22

. . .

  • x2k

. . . . . . . . . . . . 1 xN

  • xN2

. . .

  • xNk

     

  • r =

     r1 r2 . . . rN      (DTD) w = DT r

  • w = (DTD)−1DT

r

28

slide-44
SLIDE 44

Distributions Estimating Distribution Parameters Parametric Classification Regression

Noise and Error

E[(r − g(x))2|x] = E[(r − E[r|x])2|x] + (E[r|x] − g(x))2 First term on right is noise variance

29

slide-45
SLIDE 45

Distributions Estimating Distribution Parameters Parametric Classification Regression

Noise and Error

E[(r − g(x))2|x] = E[(r − E[r|x])2|x] + (E[r|x] − g(x))2 First term on right is noise variance

does not depend on g(·) or X

29

slide-46
SLIDE 46

Distributions Estimating Distribution Parameters Parametric Classification Regression

Noise and Error

E[(r − g(x))2|x] = E[(r − E[r|x])2|x] + (E[r|x] − g(x))2 First term on right is noise variance

does not depend on g(·) or X

Second term is squared error of g(·) compared to the (unknown) f (·)

29

slide-47
SLIDE 47

Distributions Estimating Distribution Parameters Parametric Classification Regression

Breaking Down the Error

EX [(E[r|x]−g(x))2] = (E[r|x]−EX [g(x)])2+EX [(g(x)−EX [g(x)])2] First term on right is bias

30

slide-48
SLIDE 48

Distributions Estimating Distribution Parameters Parametric Classification Regression

Breaking Down the Error

EX [(E[r|x]−g(x))2] = (E[r|x]−EX [g(x)])2+EX [(g(x)−EX [g(x)])2] First term on right is bias Second term is variance

30

slide-49
SLIDE 49

Distributions Estimating Distribution Parameters Parametric Classification Regression

Estimating Bias and Variance

(If we knew the “true” f ), take M samples Xi and fit gi for each sample Bias2(g) = 1 N

  • t
  • ¯

g(xt) − f (xt) 2 Variance(g) = 1 NM

  • t
  • i
  • gi(xt) − ¯

g(xt) 2 Consider different order of polynomial fits to data generated by adding random error to a low-order polynomial.

31

slide-50
SLIDE 50

Distributions Estimating Distribution Parameters Parametric Classification Regression

Bias/Variance Dilemma

As we increase complexity

bias decreases (a better fit to data), and variance increases (fit varies more with data)

High bias & low variance suggests underfitting Low bias & high variance suggests overfitting

32

slide-51
SLIDE 51

Distributions Estimating Distribution Parameters Parametric Classification Regression

Sample Fits

f in green, ¯ g in red

33

slide-52
SLIDE 52

Distributions Estimating Distribution Parameters Parametric Classification Regression

Best Fit

Look at the “elbow” for optimal complexity

34

slide-53
SLIDE 53

Distributions Estimating Distribution Parameters Parametric Classification Regression

Model Selection

In practice, we don’t know f so we can’t compute the bias and variance Various techniques allow for estimate of the total error Permits selection of complexity

35

slide-54
SLIDE 54

Distributions Estimating Distribution Parameters Parametric Classification Regression

Cross-Validation

1 Reserve a portion of the training data 2 Estimate the model parameters for varying model complexities 3 Compute error on the reserved portion of the training data 36

slide-55
SLIDE 55

Distributions Estimating Distribution Parameters Parametric Classification Regression

Cross Validation Example

37

slide-56
SLIDE 56

Distributions Estimating Distribution Parameters Parametric Classification Regression

Regularization

Turn the complexity into a model parameter c Change the Error function to penalize complexity E ′ = E(X|θ) + λc If λ is too large, risk bias. Too small risks variance.

Optimize λ via cross-validation

38

slide-57
SLIDE 57

Distributions Estimating Distribution Parameters Parametric Classification Regression

Bayesian Model Selection

Prior prob on models, p(model) p(model|X) = p(X|model)p(model) p(X)

39

slide-58
SLIDE 58

Distributions Estimating Distribution Parameters Parametric Classification Regression

Bayesian Model Selection

Prior prob on models, p(model) p(model|X) = p(X|model)p(model) p(X)

If p(model) favors simpler models, equiv to regularization

39

slide-59
SLIDE 59

Distributions Estimating Distribution Parameters Parametric Classification Regression

Bayesian Model Selection

Prior prob on models, p(model) p(model|X) = p(X|model)p(model) p(X)

If p(model) favors simpler models, equiv to regularization

Can choose best (MAP) or an average over models with high posterior probs

39