[PPT] - Linear Regression Yijun Zhao Northeastern University Fall 2016 PowerPoint Presentation

SLIDE 1

Linear Regression

Yijun Zhao

Northeastern University

Fall 2016

Yijun Zhao Linear Regression

SLIDE 2

Regression Examples

Any Attributes Continuous Value

x = ⇒ y

{age, major, gender, race} ⇒GPA {income, credit score, profession} ⇒ loan {college, major, GPA} ⇒ future income . . .

Yijun Zhao Linear Regression

SLIDE 3

Regression Examples

Data often has/can be converted into matrix form:

Age Gender Race Major GPA

20 A Art 3.85 22 C Engineer 3.90 25 1 A Engineer 3.50 24 AA Art 3.60 19 1 H Art 3.70 18 1 C Engineer 3.00 30 AA Engineer 3.80 25 C Engineer 3.95 28 1 A Art 4.00 26 C Engineer 3.20

Yijun Zhao Linear Regression

SLIDE 4

Formal Problem Setup

Given N observations {(x1, y1), (x2, y2), . . . , (xN, yN)} a regression problem tries to uncover the function yi = f (xi) ∀i = 1, 2. . . . , n such that for a new input value x∗, we can accurately predict the corresponding value y∗ = f (x∗).

Yijun Zhao Linear Regression

SLIDE 5

Linear Regression

Assume the function f is a linear combination

f components in x

Formally, let x = (1, x1, x2, . . . , xd)T, we have y = ω0 + ω1x1 + ω2x2 + · · · + ωdxd = wTx where w = (ω0, ω1, ω2, . . . , ωd)T

w is the parameter to estimate !

Prediction: y∗ = wTx∗

Yijun Zhao Linear Regression

SLIDE 6

Visual Illustration

Figure: 1D and 2D linear regression

Yijun Zhao Linear Regression

SLIDE 7

Error Measure

Mean Squared Error (MSE): E(w) = 1 N

N

n=1

(wTxn − yn)2 = 1 N Xw − y 2 where X =     — x1T — — x2T — . . . — xNT —     y =     y1 y2 . . . yN    

Yijun Zhao Linear Regression

SLIDE 8

Minimizing Error Measure

E(w) = 1

N Xw − y 2

▽E(w) = 2

NXT(Xw − y) = 0

XTXw = XTy w = X†y where X† = (XTX)−1XT is the ’pseudo-inverse’ of X

Yijun Zhao Linear Regression

SLIDE 9

LR Algorithm Summary

Ordinary Least Squares (OLS) Algorithm Construct matrix X and the vector y from the dataset {(x1, y1), x2, y2), . . . , (xN, yN)} (each x includes x0 = 1) as follows: X =     — xT

1 —

— xT

2 —

. . . — xT

N —

    y =     y1 y2 . . . yN     Compute X† = (XTX)−1XT Return w = X†y

Yijun Zhao Linear Regression

SLIDE 10

Gradient Descent

Why? Minimize our target function (E(w)) by moving down in the steepest direction

Yijun Zhao Linear Regression

SLIDE 11

Gradient Descent

Gradient Descent Algorithm Initialize the w(0) for time t = 0 for t = 0, 1, 2, . . . do Compute the gradient gt = ▽E(w(t)) Set the direction to move, vt = −gt Update w(t + 1) = w(t) + ηvt Iterate until it is time to stop Return the final weights w

Yijun Zhao Linear Regression

SLIDE 12

Gradient Descent

How η affects the algorithm? Use 0.1 (practical observation) Use variable size: ηt = η ▽E

Yijun Zhao Linear Regression

SLIDE 13

OLS or Gradient Descent?

Yijun Zhao Linear Regression

SLIDE 14

Computational Complexity

OLS Gradient Descent

¡

OLS is expensive when D is large!

Yijun Zhao Linear Regression

SLIDE 15

Linear Regression

What is the Probabilistic Interpretation?

Yijun Zhao Linear Regression

SLIDE 16

Normal Distribution

Right Skewed Left Skewed Random

Yijun Zhao Linear Regression

SLIDE 17

Normal Distribution

mean = median = mode symmetry about the center x ∼ N(µ, σ2) = ⇒ f (x) =

1 σ √ 2πe−

1 2σ2 (x−µ)2

Yijun Zhao Linear Regression

SLIDE 18

Central Limit Theorem

All things bell shaped! Random occurrences over a large population tend to wash out the asymmetry and uniformness of individual events. A more ’natural’ distribution ensues. The name for it is the Normal distribution (the bell curve). Formal definition: If (y1, . . . , yn) are i.i.d. and 0 < σ2

y < ∞, then when n is large the

distribution of ¯ y is well approximated by a normal distribution N(µy,

σ2

y

n ).

Yijun Zhao Linear Regression

SLIDE 19

Central Limit Theorem

Example:

Yijun Zhao Linear Regression

SLIDE 20

LR: Probabilistic Interpretation

Yijun Zhao Linear Regression

SLIDE 21

LR: Probabilistic Interpretation prob(yi|xi) =

1 √ 2πσe− 1

2σ2(wTxi−yi)2

Yijun Zhao Linear Regression

SLIDE 22

LR: Probabilistic Interpretation

Likelihood of the entire dataset:

L ∝

i
e− 1

2σ2(wTxi−yi)2

= e

− 1

2σ2

i

(wTxi−yi)2

Maximize L ⇐ ⇒ Minimize

i

(wTxi − yi)2

Yijun Zhao Linear Regression

SLIDE 23

Non-linear Transformation

Linear is limited: Linear models become powerful when we consider non-linear feature transformations: Xi = (1, xi, x2

i ) =

⇒ yi = ω0 + ω1xi + ω2x2

i

Yijun Zhao Linear Regression

SLIDE 24

Overfitting

Yijun Zhao Linear Regression

SLIDE 25

Overfitting

How do we know we overfitted? Ein: Error from the training data Eout: Error from the test data Example:

Yijun Zhao Linear Regression

SLIDE 26

Overfitting

How to avoid overfitting? Use more data Evaluate on a parameter tuning set Regularization

Yijun Zhao Linear Regression

SLIDE 27

Regularization

Attempts to impose ”Occam’s razor” principle Add a penalty term for model complexity Most commonly used :

L2 regularization (ridge regression) minimizes: E(w) = Xw − y 2 + λ w 2 where λ ≥ 0 and w 2 = wTw L1 regularization (LASSO) minimizes: E(w) = Xw − y 2 + λ|w|1 where λ ≥ 0 and |w|1 =

D

i=1

|ωi|

Yijun Zhao Linear Regression

SLIDE 28

Regularization

L2: closed form solution w = (XTX + λI)−1XTy L1: No closed form solution. Use quadratic programming: minimize Xw − y 2 s.t. w 1≤ s

Yijun Zhao Linear Regression

SLIDE 29

L2 Regularization Example

Yijun Zhao Linear Regression

SLIDE 30

Model Selection

Which model? A central problem in supervised learning Simple model: ”underfit” the data

Constant function Linear model applied to quadratic data

Complex model: ”overfit” the data

High degree polynomials Model with hidden logics that fits the data to completion

Yijun Zhao Linear Regression

SLIDE 31

Bias-Variance Trade-off

Consider E

1

N N

n=1

(wTxn − yn)2

let ˆ

y = wTxn E(ˆ y − yn)2 can be decomposed into (reading):

var{noise} + bias2 + var{yi}

var{noise}: can’t be reduced bias2 + var{yi} is what counts for prediction High bias2: model mismatch, often due to ”underfitting” High var{yi}: training set and test set mismatch, often due to ”overfitting”

Yijun Zhao Linear Regression

SLIDE 32

Bias-Variance Trade-off

Often: low bias ⇒ high variance low variance ⇒ high bias Trade-off:

Yijun Zhao Linear Regression

SLIDE 33

How to choose λ ?

But we still need to pick λ. Use the test set data ? NO! Set aside another evaluation set

Small evaluation set ⇒ inaccurate estimated error Large evaluation set ⇒ small training set

CrossValidation

Yijun Zhao Linear Regression

SLIDE 34

Cross Validation (CV)

Divide data into K folds Alternatively train on all except kth folds, and test on kth fold

Yijun Zhao Linear Regression

SLIDE 35

Cross Validation (CV)

How to choose K? Common choice of K = 5, 10, or N (LOOCV) Measure on average performance Cost of computation: K folds × choices of λ

Yijun Zhao Linear Regression

SLIDE 36

Learning Curve

A learning curve plots the performance of the algorithm as a function of the size of training data

Yijun Zhao Linear Regression

SLIDE 37

Learning Curve

Yijun Zhao Linear Regression