Regression: Probabilistic perspective CE-717: Machine Learning - - PowerPoint PPT Presentation

โ–ถ
regression probabilistic perspective
SMART_READER_LITE
LIVE PREVIEW

Regression: Probabilistic perspective CE-717: Machine Learning - - PowerPoint PPT Presentation

Regression: Probabilistic perspective CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018 Curve fitting: probabilistic perspective } Describing uncertainty over value of target variable as a probability distribution


slide-1
SLIDE 1

Regression: Probabilistic perspective

CE-717: Machine Learning

Sharif University of Technology

  • M. Soleymani

Fall 2018

slide-2
SLIDE 2

Curve fitting: probabilistic perspective

2

} Describing uncertainty over value of target variable as a

probability distribution

} Example: ๐‘”(๐‘ฆ; ๐’™) ๐‘”(๐‘ฆ'; ๐’™) ๐‘ž(๐‘ง|๐‘ฆ', ๐’™, ๐œ)

slide-3
SLIDE 3

The learning diagram including noisy target

3

} Type equation here.

โ„Ž: ๐’ด โ†’ ๐’ต ๐‘”: ๐’ด โ†’ ๐’ต ๐‘ฆ A , ๐‘ง(A) , โ€ฆ , ๐‘ฆ C , ๐‘ง(C) ๐‘ฆ A , โ€ฆ , ๐‘ฆ C ๐‘” ๐’š = โ„Ž(๐’š)

๐‘„ ๐‘ฆ, ๐‘ง = ๐‘„ ๐‘ฆ ๐‘„(๐‘ง|๐‘ฆ) Target distribution Distribution

  • n features

[Y.S. Abou Mostafa, 2012]

slide-4
SLIDE 4

Curve fitting: probabilistic perspective (Example)

} Special case:

Observed output = function + noise

๐‘ง = ๐‘” ๐’š; ๐’™ + ๐œ—

e.g., ๐œ—~๐‘‚(0, ๐œL)

} Noise: Whatever we cannot capture with our chosen

family of functions

4

slide-5
SLIDE 5

Curve fitting: probabilistic perspective (Example)

} Best regression

๐”ฝ ๐‘ง|๐’š = ๐น ๐‘”(๐’š; ๐’™) + ๐œ— = ๐‘”(๐’š; ๐’™)

} ๐‘” ๐’š; ๐’™ is trying to capture the mean of the observations

๐‘ง given the input ๐’š:

} ๐”ฝ ๐‘ง|๐’š : conditional expectation of ๐‘ง given ๐’š

} evaluated according to the model (not according to the

underlying distribution ๐‘„)

5

๐œ—~๐‘‚(0,๐œL)

slide-6
SLIDE 6

Curve fitting using probabilistic estimation

6

} Maximum Likelihood (ML) estimation } Maximum A Posteriori (MAP) estimation } Bayesian approach

slide-7
SLIDE 7

Maximum likelihood estimation

7

} Given observations ๐’  =

๐’š P , ๐‘ง P

PQA R

} Find the parameters that maximize the (conditional)

likelihood of the outputs: ๐‘€ ๐’ ; ๐œพ = ๐‘ž ๐’› ๐’€, ๐œพ = W ๐‘ž(๐‘ง P |๐’š P , ๐œพ)

R PQA

๐’› = ๐‘ง(A) โ‹ฎ ๐‘ง(R) ๐’€ = 1 ๐‘ฆA

(A)

โ‹ฏ ๐‘ฆ[

(A)

1 โ‹ฎ ๐‘ฆA

(L)

โ‹ฎ โ‹ฏ โ‹ฑ ๐‘ฆ[

(L)

โ‹ฎ 1 ๐‘ฆA

(R)

โ‹ฏ ๐‘ฆ[

(R)

slide-8
SLIDE 8

Maximum likelihood estimation (Contโ€™d)

๐‘ง = ๐‘” ๐’š; ๐’™ + ๐œ— , ๐œ—~๐‘‚(0, ๐œL)

} ๐‘ง given ๐’š is normally distributed with mean ๐‘”(๐’š; ๐’™) and

variance ๐œL:

} we model the uncertainty in the predictions, not just the mean

๐‘ž(๐‘ง|๐’š, ๐’™, ๐œL) = 1 2๐œŒ

  • ๐œ

exp {โˆ’ 1 2๐œL ๐‘ง โˆ’ ๐‘” ๐’š; ๐’™

L }

8

slide-9
SLIDE 9

Maximum likelihood estimation (Contโ€™d)

9

} Example: univariate linear function

๐‘ž(๐‘ง|๐’š, ๐’™, ๐œL) = 1 2๐œŒ

  • ๐œ

exp {โˆ’ 1 2๐œL ๐‘ง โˆ’ ๐‘ฅ' โˆ’ ๐‘ฅA๐‘ฆ L }

  • Why is this line a bad fit according to the
  • likelihood criterion?
  • ๐‘ž(๐‘ง|๐’š,๐’™,๐œL) for most of the points will be

near zero (as they are far from this line)

slide-10
SLIDE 10

Maximum likelihood estimation (Contโ€™d)

} Maximize the likelihood of the outputs (i.i.d):

๐‘€ ๐’ ; ๐’™, ๐œL = W ๐‘ž(๐‘ง P |๐’š(P), ๐’™, ๐œL)

C PQA

๐’™ e = argmax

๐’™

๐‘€ ๐’ ; ๐’™, ๐œL = argmax

๐’™

W ๐‘ž(๐‘ง P |๐’š(P), ๐’™, ๐œL)

C PQA

10

slide-11
SLIDE 11

Maximum likelihood estimation (Contโ€™d)

11

} It is often easier (but equivalent) to try to maximize the

log-likelihood:

๐’™ e = argmax

๐’™

ln ๐‘ž ๐’› ๐’€, ๐’™, ๐œL

lnW ๐‘ž(๐‘ง P |๐’š(P), ๐’™, ๐œL)

C PQA

= i ln ๐’ช(๐‘ง P |๐’š P , ๐’™, ๐œL)

C PQA

= โˆ’๐‘‚ ln ๐œ โˆ’ ๐‘‚ 2 ln 2๐œŒ โˆ’ 1 2๐œL i ๐‘ง P โˆ’ ๐‘”(๐’š P ; ๐’™)

L C PQA

sum of squares error

slide-12
SLIDE 12

Maximum likelihood estimation (Contโ€™d)

12

} Maximizing log-likelihood (when we assume ๐‘ง = ๐‘” ๐’š; ๐’™ + ๐œ—,

๐œ—~๐‘‚(0, ๐œL)) is equivalent to minimizing SSE

} Let ๐’™

e be the maximum likelihood (here least squares) setting

  • f the parameters.

} What is the maximum likelihood estimate of ๐œL?

๐œ– log ๐‘€(๐’ ; ๐’™, ๐œL) ๐œ–๐œL = 0 โ‡’ ๐œ mL = 1 ๐‘‚ i ๐‘ง P โˆ’ ๐‘”(๐’š P ; ๐’™ e)

L C PQA Mean squared prediction error

slide-13
SLIDE 13

Maximum likelihood estimation (Contโ€™d)

13

} Generally, maximizing log-likelihood is equivalent to minimizing

empirical loss when the loss is defined according to:

๐‘€๐‘๐‘ก๐‘ก ๐‘ง P , ๐‘” ๐’š P , ๐’™ = โˆ’ ln ๐‘ž(๐‘ง P |๐’š(P), ๐’™, ๐œพ)

} Loss: negative log-probability

} More general distributions for ๐‘ž(๐‘ง|๐’š) can be considered

slide-14
SLIDE 14

Maximum A Posterior (MAP) estimation

14

} MAP:

} Given observations ๐’  } Find the parameters that maximize the probabilities of the

parameters after observing the data (posterior probabilities): ๐œพpqr = max

๐œพ

๐‘ž ๐œพ|๐’  ) Since ๐‘ž ๐œพ|๐’  โˆ ๐‘ž ๐’ |๐œพ ๐‘ž(๐œพ) ๐œพpqr = max

๐œพ

๐‘ž ๐’ |๐œพ ๐‘ž(๐œพ)

slide-15
SLIDE 15

Maximum A Posterior (MAP) estimation

15

} Given observations ๐’  =

๐’š P , ๐‘ง P

PQA C

max

๐’™ ๐‘ž(๐’™|๐’€, ๐’›) โˆ ๐‘ž ๐’› ๐’€, ๐’™ ๐‘ž(๐’™)

๐‘ž ๐’™ = ๐’ช ๐Ÿ, ๐›ฝL๐‘ฑ = 1 2๐œŒ

  • ๐›ฝ

[wA

๐‘“๐‘ฆ๐‘ž โˆ’ 1 2๐›ฝL ๐’™y๐’™

slide-16
SLIDE 16

Maximum A Posterior (MAP) estimation

16

} Given observations ๐’  =

๐’š P , ๐‘ง P

PQA C

max

๐’™ ln ๐‘ž ๐’› ๐’€, ๐’™, ๐œL ๐‘ž(๐’™)

min

๐’™

1 ๐œL i ๐‘ง P โˆ’ ๐‘”(๐’š P ; ๐’™)

L C PQA

+ 1 ๐›ฝL ๐’™y๐’™

} Equivalent to regularized SSE with ๐œ‡ =

{| }|

slide-17
SLIDE 17

Bayesian approach

17

} Given observations ๐’  =

๐’š P , ๐‘ง P

PQA C

} Find the parameters that maximize the probabilities of

  • bservations

๐‘ž ๐‘ง ๐’š, ๐’  = ~ ๐‘ž ๐‘ง ๐’™, ๐’š ๐‘ž ๐’™|๐’  ๐‘’๐’™

  • } Example of prior distribution: ๐‘ž ๐’™ = ๐’ช(๐Ÿ, ๐›ฝL๐‘ฑ)

} In this case: ๐‘ž ๐’™|๐’  = ๐’ช(๐’C, ๐‘ปC

โ€šA)

๐’C = 1 ๐œL ๐‘ปC

โ€šA๐’€y๐’›

๐‘ปC = 1 ๐›ฝL ๐‘ฑ + 1 ๐œL ๐’€y๐’€

slide-18
SLIDE 18

Bayesian approach

18

} Given observations ๐’  =

๐’š P , ๐‘ง P

PQA C

} Find the parameters that maximize the probabilities of

  • bservations

๐‘ž ๐’  ๐’™ = ๐‘€ ๐’ ; ๐’™, ๐œพ = W ๐‘ž ๐‘ง P ๐’™y๐’š P , ๐œพ

C PQA

๐‘ž ๐‘ง P ๐‘” ๐’š P , ๐’™ , ๐œพ = ๐’ช(๐‘ง P |๐’™y๐’š P , ๐œL) ๐‘ž ๐’™ = ๐’ช(๐Ÿ, ๐›ฝL๐‘ฑ) ๐‘ž(๐’™|๐’ ) โˆ ๐‘ž ๐’  ๐’™ ๐‘ž(๐’™) ๐‘ž(๐‘ง|๐’š, ๐’ ) = ~ ๐‘ž ๐‘ง ๐’™, ๐’š ๐‘ž ๐’™|๐’  ๐‘’๐’™

  • ๐‘ž ๐‘ง ๐’š, ๐’  = ๐‘‚ ๐’C

y ๐’š, ๐œC L(๐’š)

Predictive distribution ๐’C = 1 ๐œL ๐‘ปC

โ€šA๐’€y๐’›

๐‘ปC = 1 ๐›ฝL ๐‘ฑ + 1 ๐œL ๐’€y๐’€ ๐œC

L ๐’š = ๐œL + ๐’šy๐‘ปC โ€šA๐’š

slide-19
SLIDE 19

Predictive distribution: example

} Example: Sinusoidal data, 9 Gaussian basis functions [Bishop] Red curve shows the mean of the predictive distribution Pink region spans one standard deviation either side of the mean

19

slide-20
SLIDE 20

Predictive distribution: example

20

} Functions whose parameters are sampled from ๐‘ž(๐’™|๐’ )

[Bishop]

slide-21
SLIDE 21

Resource

21

} C. Bishop, โ€œPattern Recognition and Machine Learningโ€,

Chapter 3.3.