Regression: Probabilistic perspective CE-717: Machine Learning - - PowerPoint PPT Presentation

▶

Feb 18, 2023 220 likes •442 views

Regression: Probabilistic perspective CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018 Curve fitting: probabilistic perspective } Describing uncertainty over value of target variable as a probability distribution

SLIDE 1

Regression: Probabilistic perspective

CE-717: Machine Learning

Sharif University of Technology

M. Soleymani

Fall 2018

SLIDE 2

Curve fitting: probabilistic perspective

} Describing uncertainty over value of target variable as a

probability distribution

} Example: 𝑔(𝑦; 𝒙) 𝑔(𝑦'; 𝒙) 𝑞(𝑧|𝑦', 𝒙, 𝜏)

SLIDE 3

The learning diagram including noisy target

} Type equation here.

ℎ: 𝒴 → 𝒵 𝑔: 𝒴 → 𝒵 𝑦 A , 𝑧(A) , … , 𝑦 C , 𝑧(C) 𝑦 A , … , 𝑦 C 𝑔 𝒚 = ℎ(𝒚)

𝑄 𝑦, 𝑧 = 𝑄 𝑦 𝑄(𝑧|𝑦) Target distribution Distribution

n features

[Y.S. Abou Mostafa, 2012]

SLIDE 4

Curve fitting: probabilistic perspective (Example)

} Special case:

Observed output = function + noise

𝑧 = 𝑔 𝒚; 𝒙 + 𝜗

e.g., 𝜗~𝑂(0, 𝜏L)

} Noise: Whatever we cannot capture with our chosen

family of functions

SLIDE 5

Curve fitting: probabilistic perspective (Example)

} Best regression

𝔽 𝑧|𝒚 = 𝐹 𝑔(𝒚; 𝒙) + 𝜗 = 𝑔(𝒚; 𝒙)

} 𝑔 𝒚; 𝒙 is trying to capture the mean of the observations

𝑧 given the input 𝒚:

} 𝔽 𝑧|𝒚 : conditional expectation of 𝑧 given 𝒚

} evaluated according to the model (not according to the

underlying distribution 𝑄)

𝜗~𝑂(0,𝜏L)

SLIDE 6

Curve fitting using probabilistic estimation

} Maximum Likelihood (ML) estimation } Maximum A Posteriori (MAP) estimation } Bayesian approach

SLIDE 7

Maximum likelihood estimation

} Given observations 𝒠 =

𝒚 P , 𝑧 P

PQA R

} Find the parameters that maximize the (conditional)

likelihood of the outputs: 𝑀 𝒠; 𝜾 = 𝑞 𝒛 𝒀, 𝜾 = W 𝑞(𝑧 P |𝒚 P , 𝜾)

R PQA

𝒛 = 𝑧(A) ⋮ 𝑧(R) 𝒀 = 1 𝑦A

(A)

⋯ 𝑦[

(A)

1 ⋮ 𝑦A

(L)

⋮ ⋯ ⋱ 𝑦[

(L)

⋮ 1 𝑦A

(R)

⋯ 𝑦[

(R)

SLIDE 8

Maximum likelihood estimation (Cont’d)

𝑧 = 𝑔 𝒚; 𝒙 + 𝜗 , 𝜗~𝑂(0, 𝜏L)

} 𝑧 given 𝒚 is normally distributed with mean 𝑔(𝒚; 𝒙) and

variance 𝜏L:

} we model the uncertainty in the predictions, not just the mean

𝑞(𝑧|𝒚, 𝒙, 𝜏L) = 1 2𝜌

𝜏

exp {− 1 2𝜏L 𝑧 − 𝑔 𝒚; 𝒙

L }

SLIDE 9

Maximum likelihood estimation (Cont’d)

} Example: univariate linear function

𝑞(𝑧|𝒚, 𝒙, 𝜏L) = 1 2𝜌

𝜏

exp {− 1 2𝜏L 𝑧 − 𝑥' − 𝑥A𝑦 L }

Why is this line a bad fit according to the
likelihood criterion?
𝑞(𝑧|𝒚,𝒙,𝜏L) for most of the points will be

near zero (as they are far from this line)

SLIDE 10

Maximum likelihood estimation (Cont’d)

} Maximize the likelihood of the outputs (i.i.d):

𝑀 𝒠; 𝒙, 𝜏L = W 𝑞(𝑧 P |𝒚(P), 𝒙, 𝜏L)

C PQA

𝒙 e = argmax

𝒙

𝑀 𝒠; 𝒙, 𝜏L = argmax

𝒙

W 𝑞(𝑧 P |𝒚(P), 𝒙, 𝜏L)

C PQA

SLIDE 11

Maximum likelihood estimation (Cont’d)

} It is often easier (but equivalent) to try to maximize the

log-likelihood:

𝒙 e = argmax

𝒙

ln 𝑞 𝒛 𝒀, 𝒙, 𝜏L

lnW 𝑞(𝑧 P |𝒚(P), 𝒙, 𝜏L)

C PQA

= i ln 𝒪(𝑧 P |𝒚 P , 𝒙, 𝜏L)

C PQA

= −𝑂 ln 𝜏 − 𝑂 2 ln 2𝜌 − 1 2𝜏L i 𝑧 P − 𝑔(𝒚 P ; 𝒙)

L C PQA

sum of squares error

SLIDE 12

Maximum likelihood estimation (Cont’d)

} Maximizing log-likelihood (when we assume 𝑧 = 𝑔 𝒚; 𝒙 + 𝜗,

𝜗~𝑂(0, 𝜏L)) is equivalent to minimizing SSE

} Let 𝒙

e be the maximum likelihood (here least squares) setting

f the parameters.

} What is the maximum likelihood estimate of 𝜏L?

𝜖 log 𝑀(𝒠; 𝒙, 𝜏L) 𝜖𝜏L = 0 ⇒ 𝜏 mL = 1 𝑂 i 𝑧 P − 𝑔(𝒚 P ; 𝒙 e)

L C PQA Mean squared prediction error

SLIDE 13

Maximum likelihood estimation (Cont’d)

} Generally, maximizing log-likelihood is equivalent to minimizing

empirical loss when the loss is defined according to:

𝑀𝑝𝑡𝑡 𝑧 P , 𝑔 𝒚 P , 𝒙 = − ln 𝑞(𝑧 P |𝒚(P), 𝒙, 𝜾)

} Loss: negative log-probability

} More general distributions for 𝑞(𝑧|𝒚) can be considered

SLIDE 14

Maximum A Posterior (MAP) estimation

} MAP:

} Given observations 𝒠 } Find the parameters that maximize the probabilities of the

parameters after observing the data (posterior probabilities): 𝜾pqr = max

𝜾

𝑞 𝜾|𝒠 ) Since 𝑞 𝜾|𝒠 ∝ 𝑞 𝒠|𝜾 𝑞(𝜾) 𝜾pqr = max

𝜾

𝑞 𝒠|𝜾 𝑞(𝜾)

SLIDE 15

Maximum A Posterior (MAP) estimation

} Given observations 𝒠 =

𝒚 P , 𝑧 P

PQA C

max

𝒙 𝑞(𝒙|𝒀, 𝒛) ∝ 𝑞 𝒛 𝒀, 𝒙 𝑞(𝒙)

𝑞 𝒙 = 𝒪 𝟏, 𝛽L𝑱 = 1 2𝜌

𝛽

[wA

𝑓𝑦𝑞 − 1 2𝛽L 𝒙y𝒙

SLIDE 16

Maximum A Posterior (MAP) estimation

} Given observations 𝒠 =

𝒚 P , 𝑧 P

PQA C

max

𝒙 ln 𝑞 𝒛 𝒀, 𝒙, 𝜏L 𝑞(𝒙)

min

𝒙

1 𝜏L i 𝑧 P − 𝑔(𝒚 P ; 𝒙)

L C PQA

+ 1 𝛽L 𝒙y𝒙

} Equivalent to regularized SSE with 𝜇 =

{| }|

SLIDE 17

Bayesian approach

} Given observations 𝒠 =

𝒚 P , 𝑧 P

PQA C

} Find the parameters that maximize the probabilities of

bservations

𝑞 𝑧 𝒚, 𝒠 = ~ 𝑞 𝑧 𝒙, 𝒚 𝑞 𝒙|𝒠 𝑒𝒙

} Example of prior distribution: 𝑞 𝒙 = 𝒪(𝟏, 𝛽L𝑱)

} In this case: 𝑞 𝒙|𝒠 = 𝒪(𝒏C, 𝑻C

‚A)

𝒏C = 1 𝜏L 𝑻C

‚A𝒀y𝒛

𝑻C = 1 𝛽L 𝑱 + 1 𝜏L 𝒀y𝒀

SLIDE 18

Bayesian approach

} Given observations 𝒠 =

𝒚 P , 𝑧 P

PQA C

} Find the parameters that maximize the probabilities of

bservations

𝑞 𝒠 𝒙 = 𝑀 𝒠; 𝒙, 𝜾 = W 𝑞 𝑧 P 𝒙y𝒚 P , 𝜾

C PQA

𝑞 𝑧 P 𝑔 𝒚 P , 𝒙 , 𝜾 = 𝒪(𝑧 P |𝒙y𝒚 P , 𝜏L) 𝑞 𝒙 = 𝒪(𝟏, 𝛽L𝑱) 𝑞(𝒙|𝒠) ∝ 𝑞 𝒠 𝒙 𝑞(𝒙) 𝑞(𝑧|𝒚, 𝒠) = ~ 𝑞 𝑧 𝒙, 𝒚 𝑞 𝒙|𝒠 𝑒𝒙

𝑞 𝑧 𝒚, 𝒠 = 𝑂 𝒏C

y 𝒚, 𝜏C L(𝒚)

Predictive distribution 𝒏C = 1 𝜏L 𝑻C

‚A𝒀y𝒛

𝑻C = 1 𝛽L 𝑱 + 1 𝜏L 𝒀y𝒀 𝜏C

L 𝒚 = 𝜏L + 𝒚y𝑻C ‚A𝒚

SLIDE 19

Predictive distribution: example

} Example: Sinusoidal data, 9 Gaussian basis functions [Bishop] Red curve shows the mean of the predictive distribution Pink region spans one standard deviation either side of the mean

SLIDE 20

Predictive distribution: example

} Functions whose parameters are sampled from 𝑞(𝒙|𝒠)

[Bishop]

SLIDE 21