Statistical Machine Learning Lecture 08: Regression Kristian - - PowerPoint PPT Presentation

statistical machine learning
SMART_READER_LITE
LIVE PREVIEW

Statistical Machine Learning Lecture 08: Regression Kristian - - PowerPoint PPT Presentation

Statistical Machine Learning Lecture 08: Regression Kristian Kersting TU Darmstadt Summer Term 2020 K. Kersting based on Slides from J. Peters Statistical Machine Learning Summer Term 2020 1 / 55 Todays Objectives Make you understand


slide-1
SLIDE 1

Statistical Machine Learning

Lecture 08: Regression

Kristian Kersting TU Darmstadt

Summer Term 2020

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

1 / 55

slide-2
SLIDE 2

Today’s Objectives

Make you understand how to learn a continuous function Covered Topics

Linear Regression and its interpretations What is overfitting? Deriving Linear Regression from Maximum Likelihood Estimation Bayesian Linear Regression

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

2 / 55

slide-3
SLIDE 3

Outline

  • 1. Introduction to Linear Regression
  • 2. Maximum Likelihood Approach to Regression
  • 3. Bayesian Linear Regression
  • 4. Wrap-Up
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

3 / 55

slide-4
SLIDE 4
  • 1. Introduction to Linear Regression

Outline

  • 1. Introduction to Linear Regression
  • 2. Maximum Likelihood Approach to Regression
  • 3. Bayesian Linear Regression
  • 4. Wrap-Up
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

4 / 55

slide-5
SLIDE 5
  • 1. Introduction to Linear Regression

Reminder

Our task is to learn a mapping f from input to output f : I → O, y = f (x; θ)

Input: x ∈ I (images, text, sensor measurements, ...) Output: y ∈ O Parameters: θ ∈ Θ (what needs to be “learned”)

Regression

Learn a mapping into a continuous space

O = R O = R3 . . .

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

5 / 55

slide-6
SLIDE 6
  • 1. Introduction to Linear Regression

Motivation

You want to predict the torques of a robot arm

y = I¨ q − µ˙ q + mlg sin (q) = ¨ q ˙ q sin(q) I −µ mlg ⊺ = φ (x)⊺ θ

Can we do this with a data set?

D =

  • (xi, yi)
  • i = 1 · · · n
  • A linear regression problem!
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

6 / 55

slide-7
SLIDE 7
  • 1. Introduction to Linear Regression

Least Squares Linear Regression

We are given pairs of training data points and associated function values (xi, yi) X =

  • x1 ∈ Rd, . . . , xn
  • Y = {y1 ∈ R, . . . , yn}

Note: here we only do the case yi ∈ R. In general yi can have more than one dimension, i.e., yi ∈ Rf for some positive f

Start with linear regressor x⊺

i w + w0 = yi

∀i = 1, . . . , n

One linear equation for each training data point/label pair Exactly the same basic setup as for least-squares classification! Only the values are continuous

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

7 / 55

slide-8
SLIDE 8
  • 1. Introduction to Linear Regression

Least Squares Linear Regression

x⊺

i w + w0 = yi

∀i = 1, . . . , n Step 1: Define ˆ xi = xi 1

  • ˆ

w = w w0

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

8 / 55

slide-9
SLIDE 9
  • 1. Introduction to Linear Regression

Least Squares Linear Regression

x⊺

i w + w0 = yi

∀i = 1, . . . , n Step 1: Define ˆ xi = xi 1

  • ˆ

w = w w0

  • Step 2: Rewrite

ˆ x⊺

i ˆ

w = yi ∀i = 1, . . . , n

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

8 / 55

slide-10
SLIDE 10
  • 1. Introduction to Linear Regression

Least Squares Linear Regression

x⊺

i w + w0 = yi

∀i = 1, . . . , n Step 1: Define ˆ xi = xi 1

  • ˆ

w = w w0

  • Step 2: Rewrite

ˆ x⊺

i ˆ

w = yi ∀i = 1, . . . , n Step 3: Matrix-vector notation ˆ X⊺ ˆ w = y

where ˆ X = [ˆ x1, . . . , ˆ xn] (each ˆ xi is a vector) and y = [y1, . . . , yn]⊺

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

8 / 55

slide-11
SLIDE 11
  • 1. Introduction to Linear Regression

Least Squares Linear Regression

Step 4: Find the least squares solution ˆ w = arg min

w

  • ˆ

X⊺w − y

  • 2

∇w

  • ˆ

X⊺w − y

  • 2

= 0 ˆ w =

  • ˆ

Xˆ X⊺−1 ˆ Xy A closed form solution!

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

9 / 55

slide-12
SLIDE 12
  • 1. Introduction to Linear Regression

Least Squares Linear Regression

ˆ w =

  • ˆ

Xˆ X⊺−1 ˆ Xy Where is the costly part of this computation?

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

10 / 55

slide-13
SLIDE 13
  • 1. Introduction to Linear Regression

Least Squares Linear Regression

ˆ w =

  • ˆ

Xˆ X⊺−1 ˆ Xy Where is the costly part of this computation?

The inverse is a RD×D matrix Naive inversion takes O

  • D3

, but better methods exist

What can we do if the input dimension D is too large?

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

10 / 55

slide-14
SLIDE 14
  • 1. Introduction to Linear Regression

Least Squares Linear Regression

ˆ w =

  • ˆ

Xˆ X⊺−1 ˆ Xy Where is the costly part of this computation?

The inverse is a RD×D matrix Naive inversion takes O

  • D3

, but better methods exist

What can we do if the input dimension D is too large?

Gradient descent Work with fewer dimensions

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

10 / 55

slide-15
SLIDE 15
  • 1. Introduction to Linear Regression

Mechanical Interpretation

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

11 / 55

slide-16
SLIDE 16
  • 1. Introduction to Linear Regression

Geometric Interpretation

Predicted outputs are Linear Combinations of Features! Samples are projected in this Feature Space

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

12 / 55

slide-17
SLIDE 17
  • 1. Introduction to Linear Regression

Polynomial Regression

How can we fit arbitrary polynomials using least-squares regression?

We introduce a feature transformation as before y (x) = w⊺φ (x) =

M

  • i=0

wiφi (x) Assume φ0 (x) = 1 φi (.) are called the basis functions Still a linear model in the parameters w

E.g. fitting a cubic polynomial φ (x) =

  • 1, x, x2, x3⊺
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

13 / 55

slide-18
SLIDE 18
  • 1. Introduction to Linear Regression

Polynomial Regression

Polynomial of degree 0 (constant value)

✂☎✄✝✆

1 −1 1

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

14 / 55

slide-19
SLIDE 19
  • 1. Introduction to Linear Regression

Polynomial Regression

Polynomial of degree 1 (line)

✂☎✄✝✆

1 −1 1

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

15 / 55

slide-20
SLIDE 20
  • 1. Introduction to Linear Regression

Polynomial Regression

Polynomial of degree 3 (cubic)

✂☎✄✝✆

1 −1 1

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

16 / 55

slide-21
SLIDE 21
  • 1. Introduction to Linear Regression

Polynomial Regression

Polynomial of degree 9

✂☎✄✝✆

1 −1 1

Massive overfitting

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

17 / 55

slide-22
SLIDE 22
  • 2. Maximum Likelihood Approach to Regression

Outline

  • 1. Introduction to Linear Regression
  • 2. Maximum Likelihood Approach to Regression
  • 3. Bayesian Linear Regression
  • 4. Wrap-Up
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

18 / 55

slide-23
SLIDE 23
  • 2. Maximum Likelihood Approach to Regression

Overfitting

Relatively little data leads to

  • verfitting

1 −1 1

Enough data leads to a good estimate

1 −1 1

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

19 / 55

slide-24
SLIDE 24
  • 2. Maximum Likelihood Approach to Regression

Probabilistic Regression

Assumption 1: Our target function values are generated by adding noise to the function estimate

y = f (x, w) + ǫ y - target function value; f - regression function; x - input value; w - weights or parameters; ǫ - noise

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

20 / 55

slide-25
SLIDE 25
  • 2. Maximum Likelihood Approach to Regression

Probabilistic Regression

Assumption 1: Our target function values are generated by adding noise to the function estimate

y = f (x, w) + ǫ y - target function value; f - regression function; x - input value; w - weights or parameters; ǫ - noise

Assumption 2: The noise is a random variable that is Gaussian distributed

ǫ ∼ N

  • 0, β−1

p

  • y
  • x, w, β
  • = N
  • y
  • f (x, w) , β−1

f (x, w) is the mean; β−1 is the variance (β is the precision)

Note that y is now a random variable with underlying probability distribution p

  • y
  • x, w, β
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

20 / 55

slide-26
SLIDE 26
  • 2. Maximum Likelihood Approach to Regression

Probabilistic Regression

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

21 / 55

slide-27
SLIDE 27
  • 2. Maximum Likelihood Approach to Regression

Probabilistic Regression

Given

Training input data points X = [x1, . . . , xn] ∈ Rd×n Associated function values Y = [y1, . . . , yn]⊺

Conditional likelihood (assuming the data is i.i.d.) p

  • y
  • X, w, β
  • =

n

  • i=1

N

  • yi
  • f (xi, w) , β−1

(with linear model) =

n

  • i=1

N

  • yi
  • w⊺φ (xi) , β−1

w⊺φ (xi) is the generalized linear regression function

Maximize the likelihood w.r.t. (with respect to) w and β

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

22 / 55

slide-28
SLIDE 28
  • 2. Maximum Likelihood Approach to Regression

Maximum Likelihood Regression

Simplify using the log-likelihood log p

  • y
  • X, w, β
  • =

n

  • i=1

log N

  • yi
  • w⊺φ (xi) , β−1

=

n

  • i=1
  • log

√β √ 2π

  • − β

2 (yi − w⊺φ (xi))2

  • = n

2 log β − n 2 log (2π) − β 2

n

  • i=1

(yi − w⊺φ (xi))2

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

23 / 55

slide-29
SLIDE 29
  • 2. Maximum Likelihood Approach to Regression

Maximum Likelihood Regression

Gradient w.r.t. w ∇w log p

  • y
  • X, w, β
  • =

−β

n

  • i=1

(yi − w⊺φ (xi)) φ (xi) = Define y =    y1 . . . yn    , w =    w1 . . . wn    , Φ =   | | φ (x1) . . . φ (xn) | |  

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

24 / 55

slide-30
SLIDE 30
  • 2. Maximum Likelihood Approach to Regression

Maximum Likelihood Regression

n

  • i=1

yiφ (xi) = n

  • i=1

φ (xi) φ (xi)⊺

  • w

Φy = ΦΦ⊺w wML = (ΦΦ⊺)−1 Φy The same result as in least squares regression!

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

25 / 55

slide-31
SLIDE 31
  • 2. Maximum Likelihood Approach to Regression

Maximum Likelihood Regression

We obtain the same w as with least squares regression

Least-squares is equivalent to assuming the targets are Gaussian distributed Note: The least squares method is not distribution-free!

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

26 / 55

slide-32
SLIDE 32
  • 2. Maximum Likelihood Approach to Regression

Maximum Likelihood Regression

We obtain the same w as with least squares regression

Least-squares is equivalent to assuming the targets are Gaussian distributed Note: The least squares method is not distribution-free!

However, the Maximum Likelihood approach is much more powerful!

We can also estimate β βML =

  • 1

n

n

  • i=1
  • yi − w⊺

MLφ (xi)

2 −1 We can gauge the uncertainty of our estimate!

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

26 / 55

slide-33
SLIDE 33
  • 2. Maximum Likelihood Approach to Regression

Loss Functions in Regression

Given a new data point xt, in least squares regression the function value is yt = ˆ xt

⊺ ˆ

w But in maximum likelihood regression we have a probability distribution over the function value p

  • y
  • x, w, β
  • How do we actually estimate a function value yt for a new data

point xt?

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

27 / 55

slide-34
SLIDE 34
  • 2. Maximum Likelihood Approach to Regression

Loss Functions in Regression

Given a new data point xt, in least squares regression the function value is yt = ˆ xt

⊺ ˆ

w But in maximum likelihood regression we have a probability distribution over the function value p

  • y
  • x, w, β
  • How do we actually estimate a function value yt for a new data

point xt? We need a loss function, just as in the classification case L : R × R → R+ (yt, f (xt)) → L (yt, f (xt))

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

27 / 55

slide-35
SLIDE 35
  • 2. Maximum Likelihood Approach to Regression

Loss Functions in Regression

Minimize the expected loss Ex,y∼p(x,y) [L] = L (y, f (x)) p (x, y) dxdy

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

28 / 55

slide-36
SLIDE 36
  • 2. Maximum Likelihood Approach to Regression

Loss Functions in Regression

Minimize the expected loss Ex,y∼p(x,y) [L] = L (y, f (x)) p (x, y) dxdy Simplest case: squared loss L (y, f (x)) = (y − f (x))2 Ex,y∼p(x,y) [L] = (y − f (x))2 p (x, y) dxdy ∂E [L] ∂f (x) = −2

  • (y − f (x)) p (x, y) dy = 0
  • yp (x, y)dy = f (x)
  • p (x, y) dy
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

28 / 55

slide-37
SLIDE 37
  • 2. Maximum Likelihood Approach to Regression

Loss Functions in Regression

  • yp (x, y) dy = f (x)
  • p (x, y) dy
  • yp (x, y) dy = f (x) p (x)

f (x) =

  • y p (x, y)

p (x) dy =

  • yp (y | x) dy

f (x) = Ey∼p(y|x) [y] = E

  • y
  • x
  • Under squared loss, the optimal regression function is the mean

E

  • y
  • x
  • f the posterior p (y | x)

It is also called mean prediction

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

29 / 55

slide-38
SLIDE 38
  • 2. Maximum Likelihood Approach to Regression

Loss Functions in Regression

For our generalized linear regression function f (x) =

  • yN
  • y
  • w⊺φ (x) , β−1

dy = w⊺φ (x)

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

30 / 55

slide-39
SLIDE 39
  • 2. Maximum Likelihood Approach to Regression

Probabilistic Regression

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

31 / 55

slide-40
SLIDE 40
  • 3. Bayesian Linear Regression

Outline

  • 1. Introduction to Linear Regression
  • 2. Maximum Likelihood Approach to Regression
  • 3. Bayesian Linear Regression
  • 4. Wrap-Up
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

32 / 55

slide-41
SLIDE 41
  • 3. Bayesian Linear Regression

Avoiding Overfitting

Back to our original problem

We wanted to avoid overfitting and instabilities Maximum likelihood also leads to overfitting (in the extreme case think if you only had one data point)

What can we use to counter the problem?

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

33 / 55

slide-42
SLIDE 42
  • 3. Bayesian Linear Regression

Bayesian Linear Regression

We place a prior on the parameters w to tame the instabilities p

  • w
  • X, y
  • ∝ p
  • y
  • X, w
  • p (w)

Parameter prior: p (w) Likelihood of targets under the data and parameters (as before): p

  • y
  • X, w
  • Posterior over the parameters: p
  • w
  • X, y
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

34 / 55

slide-43
SLIDE 43
  • 3. Bayesian Linear Regression

Bayesian Linear Regression

We place a prior on the parameters w to tame the instabilities p

  • w
  • X, y
  • ∝ p
  • y
  • X, w
  • p (w)

Parameter prior: p (w) Likelihood of targets under the data and parameters (as before): p

  • y
  • X, w
  • Posterior over the parameters: p
  • w
  • X, y
  • Notice the VERY important difference: in this setting, you do not

get anymore a single value for the parameters, but rather a probability distribution over the parameters

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

34 / 55

slide-44
SLIDE 44
  • 3. Bayesian Linear Regression

Basic Idea: Prior controls the Model Class and hence what Data Sets can be explained

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

35 / 55

slide-45
SLIDE 45
  • 3. Bayesian Linear Regression

Bayesian Regression

Simple idea: Put a Gaussian prior on w It will put a “soft” limit on the coefficients and thus avoid instabilities w ∼ p

  • w
  • α
  • = N
  • w
  • 0, α−1I
  • We use a zero mean Gaussian to keep the derivation compact, but

you can use another mean

Zero mean and spherical covariance (given by the diagonal covariance matrix)

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

36 / 55

slide-46
SLIDE 46
  • 3. Bayesian Linear Regression

Bayesian Regression

Simple idea: Put a Gaussian prior on w It will put a “soft” limit on the coefficients and thus avoid instabilities w ∼ p

  • w
  • α
  • = N
  • w
  • 0, α−1I
  • We use a zero mean Gaussian to keep the derivation compact, but

you can use another mean

Zero mean and spherical covariance (given by the diagonal covariance matrix) The posterior becomes p

  • w
  • X, y, α, β
  • ∝ p
  • y
  • X, w, β
  • p
  • w
  • α
  • ∝ p
  • y
  • X, w, β
  • N
  • w
  • 0, α−1I
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

36 / 55

slide-47
SLIDE 47
  • 3. Bayesian Linear Regression

Maximum A-Posteriori (MAP)

First attempt to solve this problem: estimate w by maximizing the (log) posterior

log p

  • w
  • X, y, α, β
  • = log p
  • y
  • X, w, β
  • + log N
  • w
  • 0, α−1I
  • + const

=

n

  • i=1

log N

  • yi
  • w⊺φ (xi) , β−1

+ log N

  • w
  • 0, α−1I
  • + const

= −β 2

n

  • i=1

(yi − w⊺φ (xi))2 − α 2 w⊺w + const

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

37 / 55

slide-48
SLIDE 48
  • 3. Bayesian Linear Regression

Maximum A-Posteriori (MAP)

∇w log p

  • w
  • X, y, α, β
  • = β

n

  • i=1

(yi − w⊺φ (xi)) φ (xi) − αw = 0 β

n

  • i=1

yiφ (x) = β n

  • i=1

φ (xi) φ (xi)⊺

  • w + αw

β

n

  • i=1

yiφ (x) = β n

  • i=1

φ (xi) φ (xi)⊺ + α

  • w

βΦy = (βΦΦ⊺ + αI) w wMAP =

  • ΦΦ⊺ + α

β I −1 Φy What is the role of α/β in the expression?

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

38 / 55

slide-49
SLIDE 49
  • 3. Bayesian Linear Regression

Maximum A-Posteriori (MAP)

wMAP =

  • ΦΦ⊺ + α

β I −1 Φy The prior has the effect that it regularizes the pseudo-inverse Also called ridge regression

Intuition for the term "ridge", although these are not the historical reasons : If there is multi- collinearity, we get a "ridge" in the likelihood func- tion. This in turn yields a long "valley" in the

  • RSS. Ridge regression "fixes" the ridge. It adds a

penalty that turns the ridge into a nice peak in likelihood space.

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

39 / 55

slide-50
SLIDE 50
  • 3. Bayesian Linear Regression

Maximum A-Posteriori (MAP) vs Regularized Least-squares Linear Regression

There is another way to look at the MAP result Let us add a regularization term to our objective from Least-squares Linear Regression ˆ w = arg min

w

1 2

  • ˆ

X⊺w − y

  • 2

+ λ 2 w2 Solving for w we get a new estimate ˆ w =

  • ˆ

Xˆ X⊺ + λI −1 ˆ Xy

where λ = α/β

When you place a regularizer λ in least-squares linear regression, you are assuming the targets have Gaussian distributed noise, but also that your parameters are Gaussian distributed

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

40 / 55

slide-51
SLIDE 51
  • 3. Bayesian Linear Regression

Bayesian Regression

Polynomial of degree 9 with prior on w λ = α/β controls the complexity of the model and determines the degree of overfitting

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

41 / 55

slide-52
SLIDE 52
  • 3. Bayesian Linear Regression

Bayesian Regression

[Bishop]

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

42 / 55

slide-53
SLIDE 53
  • 3. Bayesian Linear Regression

Full Bayesian Regression

We can go further than MAP estimation Observation: We do not actually need to know w, all we want to do is to predict a function value based on the training data Idea: “Remove” w by marginalizing over it p

  • yt
  • xt, X, y
  • =
  • p
  • yt, w
  • xt, X, y
  • dw

yt - predicted value; xt - test input; X - training data points; y - training function values

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

43 / 55

slide-54
SLIDE 54
  • 3. Bayesian Linear Regression

Full Bayesian Regression

p

  • yt
  • xt, X, y
  • predictive distribution

=

  • p
  • yt, w
  • xt, X, y
  • dw

=

  • p
  • yt
  • w, xt, X, y
  • p
  • w
  • xt, X, y
  • dw

=

  • p
  • yt
  • w, xt
  • regression model

p

  • w
  • X, y
  • posterior distribution

dw For Gaussian distributions, this can be done in closed form, leading to so-called Gaussian Processes

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

44 / 55

slide-55
SLIDE 55
  • 3. Bayesian Linear Regression

Full Bayesian Regression

We can also do that in closed form: integrate out all possible parameters

p

  • y∗
  • x∗, X, y
  • =
  • p
  • y∗
  • x∗, θ
  • likelihood

p

  • θ
  • X, y
  • parameter posterior

dθ y∗ - predicted value; x∗ - test input; X, y - training data

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

45 / 55

slide-56
SLIDE 56
  • 3. Bayesian Linear Regression

Full Bayesian Regression

We can also do that in closed form: integrate out all possible parameters

p

  • y∗
  • x∗, X, y
  • =
  • p
  • y∗
  • x∗, θ
  • likelihood

p

  • θ
  • X, y
  • parameter posterior

dθ y∗ - predicted value; x∗ - test input; X, y - training data

The predictive distribution is again a Gaussian

p

  • y∗
  • x∗, X, y
  • = N
  • y∗
  • µ (x∗) , σ2 (x∗)
  • µ (x∗) = φT (x∗)

α β I + ΦΦ⊺ −1 Φ⊺y σ2 (x∗) = 1 β + φ⊺ (x∗) (αI + βΦΦ⊺)−1 φ (x∗)

The variance is state dependent

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

45 / 55

slide-57
SLIDE 57
  • 3. Bayesian Linear Regression

Bayesian (Linear) Regression

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

46 / 55

slide-58
SLIDE 58
  • 3. Bayesian Linear Regression

Bayesian (Linear) Regression

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

47 / 55

slide-59
SLIDE 59
  • 3. Bayesian Linear Regression

Bayesian (Linear) Regression

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

48 / 55

slide-60
SLIDE 60
  • 3. Bayesian Linear Regression

Bayesian (Linear) Regression

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

49 / 55

slide-61
SLIDE 61
  • 3. Bayesian Linear Regression

Gaussian Processes - Quick Preview

Essentially Kernelized Bayesian Ridge Regression is equivalent to Gaussian Processes. We will not cover them now, but here is a quick preview of what they can do

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

50 / 55

slide-62
SLIDE 62
  • 3. Bayesian Linear Regression

Gaussian Processes - Quick Preview

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

51 / 55

slide-63
SLIDE 63
  • 4. Wrap-Up

Outline

  • 1. Introduction to Linear Regression
  • 2. Maximum Likelihood Approach to Regression
  • 3. Bayesian Linear Regression
  • 4. Wrap-Up
  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

52 / 55

slide-64
SLIDE 64
  • 4. Wrap-Up
  • 4. Wrap-Up

You know now: How to formulate a linear regression problem The different methods to perform linear regression: least-squares, maximum likelihood and bayesian Derive the equations for the parameters using the different methods Why introducing a prior distribution over the parameters can combat overfitting

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

53 / 55

slide-65
SLIDE 65
  • 4. Wrap-Up

Self-Test Questions

What is regression (in general) and linear regression (in particular)? What is the cost function of regression and how can I interpret it? What is overfitting? How can I derive a Maximum-Likelihood Estimator for Regression? Why are Bayesian methods important? What is MAP and how is it different to full Bayesian regression?

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

54 / 55

slide-66
SLIDE 66
  • 4. Wrap-Up

Homework

Reading Assignment for next lecture

Murphy ch. 8 Bishop ch. 4

  • K. Kersting based on Slides from J. Peters· Statistical Machine Learning· Summer Term 2020

55 / 55