Learning: Linear Methods CE417: Introduction to Artificial - - PowerPoint PPT Presentation

learning linear methods
SMART_READER_LITE
LIVE PREVIEW

Learning: Linear Methods CE417: Introduction to Artificial - - PowerPoint PPT Presentation

Learning: Linear Methods CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2019 Soleymani Some slides are based on Klein and Abdeel, CS188, UC Berkeley. Paradigms of ML } Supervised learning (regression,


slide-1
SLIDE 1

Learning: Linear Methods

CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2019

Soleymani

Some slides are based on Klein and Abdeel, CS188, UC Berkeley.

slide-2
SLIDE 2

Paradigms of ML

} Supervised learning (regression, classification)

} predicting a target variable for which we get to see examples.

} Unsupervised learning

} revealing structure in the observed unlabeled data

} Reinforcement learning

} partial (indirect) feedback, no explicit guidance } Given rewards for a sequence of moves to learn a policy and

utility functions

2

slide-3
SLIDE 3

Components of (Supervised) Learning

3

} Unknown target function: 𝑔: 𝒴 → 𝒵

} Input space: 𝒴 } Output space: 𝒵

} Training data: 𝒚', 𝑧' , 𝒚*, 𝑧* , … , (𝒚-, 𝑧-) } Pick a formula 𝑕: 𝒴 → 𝒵 that approximates the target

function 𝑔

} selected from a set of hypotheses ℋ

slide-4
SLIDE 4

Supervised Learning: Regression vs. Classification

} Supervised Learning

} Regression: predict a continuous target variable

} E.g., 𝑧 ∈ [0,1]

} Classification: predict a discrete target variable

} E.g.,𝑧 ∈ {1,2, … , 𝐷}

4

slide-5
SLIDE 5

Regression: Example

} Housing price prediction

100 200 300 400 500 1000 1500 2000 2500

Price ($) in 1000’s Size in feet2

5

Figure adopted from slides of Andrew Ng

slide-6
SLIDE 6

Training data: Example

𝑦' 𝑦* 𝑧 0.9 2.3 1 3.5 2.6 1 2.6 3.3 1 2.7 4.1 1 1.8 3.9 1 6.5 6.8

  • 1

7.2 7.5

  • 1

7.9 8.3

  • 1

6.9 8.3

  • 1

8.8 7.9

  • 1

9.1 6.2

  • 1

x1 x2

6

Training data

slide-7
SLIDE 7

Classification: Example

} Weight (Cat, Dog)

7

1(Dog) 0(Cat)

weight weight

slide-8
SLIDE 8

Linear regression

Cost function:

100 200 300 400 500 500 1000 1500 2000 2500 3000 100 200 300 400 500 500 1000 1500 2000 2500 3000

𝑧(:) − 𝑕(𝑦 : ; 𝒙) 𝑦

𝐾 𝒙 = @ 𝑧 : − 𝑕(𝑦(:); 𝒙)

* A :B'

= @ 𝑧 : − 𝑥E − 𝑥'𝑦 :

* A :B'

8

slide-9
SLIDE 9

Cost function

100 200 300 400 500 1000 2000 3000

Price ($) in 1000’s Size in feet2 (x) 𝐾(𝒙) (function of the parameters 𝑥E, 𝑥') 𝑥E 𝑥'

9

This example has been adapted from: Prof. Andrew Ng’s slides

slide-10
SLIDE 10

Review: Iterative optimization of cost function

10

} Cost function: 𝐾(𝒙) } Optimization problem: 𝒙

F = argm𝑗𝑜

𝒙

𝐾(𝒙)

} Steps:

} Start from 𝒙E } Repeat

} Update 𝒙M to 𝒙MN' in order to reduce 𝐾 } 𝑢 ← 𝑢 + 1

} until we hopefully end up at a minimum

slide-11
SLIDE 11

Review: Gradient descent

11

} First-order optimization algorithm to find 𝒙∗ = argmin

𝒙

𝐾(𝒙)

} Also known as ”steepest descent”

} In each step, takes steps proportional to the negative of the

gradient vector of the function at the current point 𝒙M: 𝒙MN' = 𝒙M − 𝛿M 𝛼 𝐾 𝒙M

} 𝐾(𝒙) decreases fastest if one goes from 𝒙M in the direction of −𝛼𝐾 𝒙M } Assumption: 𝐾(𝒙) is defined and differentiable in a neighborhood of a

point 𝒙M

Gradient ascent takes steps proportional to (the positive

  • f) the gradient to find a local maximum of the function
slide-12
SLIDE 12

Review: Gradient descent

12

} Minimize 𝐾(𝒙)

𝒙MN' = 𝒙M − 𝜃𝛼

𝒙𝐾(𝒙M)

𝛼

𝒙𝐾 𝒙 = [𝜖𝐾 𝒙

𝜖𝑥' , 𝜖𝐾 𝒙 𝜖𝑥* , … , 𝜖𝐾 𝒙 𝜖𝑥Y ]

} If 𝜃 is small enough, then 𝐾 𝒙MN' ≤ 𝐾 𝒙M . } 𝜃 can be allowed to change at every iteration as 𝜃M. Step size (Learning rate parameter)

slide-13
SLIDE 13

Review: Gradient descent disadvantages

13

} Local minima problem } However, when 𝐾 is convex, all local minima are also global

minima ⇒ gradient descent can converge to the global solution.

slide-14
SLIDE 14

w1 w0 J(w0,w1)

Review: Problem of gradient descent with non-convex cost functions

14 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

slide-15
SLIDE 15

w0 w1 J(w0,w1)

Review: Problem of gradient descent with non-convex cost functions

15 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

slide-16
SLIDE 16

Gradient descent for SSE cost function

16

} Minimize 𝐾(𝒙)

𝒙MN' = 𝒙M − 𝜃𝛼

𝒙𝐾(𝒙M)

} 𝐾(𝒙): Sum of squares error

𝐾 𝒙 = @ 𝑧 : − 𝑕 𝒚 : ; 𝒙

* A :B'

} Weight update rule for 𝑕 𝒚; 𝒙 = 𝒙\𝒚:

𝒙MN' = 𝒙M + 𝜃 @ 𝑧 : − 𝒙M\𝒚 : 𝒚(:)

A :B'

slide-17
SLIDE 17

Gradient descent for SSE cost function

17

} Weight update rule: 𝑕 𝒚; 𝒙 = 𝒙\𝒚

𝒙MN' = 𝒙M + 𝜃 @ 𝑧 : − 𝒙\𝒚 : 𝒚(:)

A :B'

} 𝜃: too small → gradient descent can be slow. } 𝜃: too large → gradient descent can overshoot the

  • minimum. It may fail to converge, or even diverge.

Batch mode: each step considers all training data

slide-18
SLIDE 18

(function of the parameters 𝑥E, 𝑥') 𝑕 𝑦; 𝑥E, 𝑥' = 𝑥E + 𝑥'𝑦 𝐾(𝑥E, 𝑥') 𝑥E 𝑥'

This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

18

slide-19
SLIDE 19

(function of the parameters 𝑥E, 𝑥') 𝑕 𝑦; 𝑥E, 𝑥' = 𝑥E + 𝑥'𝑦 𝐾(𝑥E, 𝑥') 𝑥E 𝑥'

19 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

slide-20
SLIDE 20

(function of the parameters 𝑥E, 𝑥') 𝑕 𝑦; 𝑥E, 𝑥' = 𝑥E + 𝑥'𝑦 𝐾(𝑥E, 𝑥') 𝑥E 𝑥'

20 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

slide-21
SLIDE 21

(function of the parameters 𝑥E, 𝑥') 𝑕 𝑦; 𝑥E, 𝑥' = 𝑥E + 𝑥'𝑦 𝐾(𝑥E, 𝑥') 𝑥E 𝑥'

21 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

slide-22
SLIDE 22

(function of the parameters 𝑥E, 𝑥') 𝑕 𝑦; 𝑥E, 𝑥' = 𝑥E + 𝑥'𝑦 𝐾(𝑥E, 𝑥') 𝑥E 𝑥'

22 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

slide-23
SLIDE 23

(function of the parameters 𝑥E, 𝑥') 𝑕 𝑦; 𝑥E, 𝑥' = 𝑥E + 𝑥'𝑦 𝐾(𝑥E, 𝑥') 𝑥E 𝑥'

23 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

slide-24
SLIDE 24

(function of the parameters 𝑥E, 𝑥') 𝑕 𝑦; 𝑥E, 𝑥' = 𝑥E + 𝑥'𝑦 𝐾(𝑥E, 𝑥') 𝑥E 𝑥'

24 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

slide-25
SLIDE 25

(function of the parameters 𝑥E, 𝑥') 𝑕 𝑦; 𝑥E, 𝑥' = 𝑥E + 𝑥'𝑦 𝐾(𝑥E, 𝑥') 𝑥E 𝑥'

25 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

slide-26
SLIDE 26

(function of the parameters 𝑥E, 𝑥') 𝑕 𝑦; 𝑥E, 𝑥' = 𝑥E + 𝑥'𝑦 𝐾(𝑥E, 𝑥') 𝑥E 𝑥'

26 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

slide-27
SLIDE 27

Linear Classifiers

27

slide-28
SLIDE 28

Error-Driven Classification

28

slide-29
SLIDE 29

Errors, and What to Do

Examples of errors

Dear GlobalSCAPE Customer, GlobalSCAPE has partnered with ScanSoft to offer you the latest version of OmniPage Pro, for just $99.99* - the regular list price is $499! The most common question we've received about this offer is - Is this genuine? We would like to assure you that this offer is authorized by ScanSoft, is genuine and valid. You can get the . . . . . . To receive your $30 Amazon.com promotional certificate, click through to http://www.amazon.com/apparel and see the prominent link for the $30 offer. All details are

  • there. We hope you enjoyed receiving this message. However,

if you'd rather not receive future e-mails announcing new store launches, please click . . .

29

slide-30
SLIDE 30

What to Do About Errors

} Problem: there’s still spam in your inbox } Need more features – words aren’t enough!

} Have you emailed the sender before? } Have 1M other people just gotten the same email? } Is the sending information consistent? } Is the email in ALL CAPS? } Do inline URLs point where they say they point? } Does the email address you by (your) name?

} Naïve Bayes models can incorporate a variety of features, but tend to

do best in homogeneous cases (e.g. all features are word occurrences)

30

slide-31
SLIDE 31

Feature Vectors

Hello, Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just # free : 2 YOUR_NAME : 0 MISSPELLED : 2 FROM_FRIEND : 0 ...

SPAM

  • r

+

PIXEL-7,12 : 1 PIXEL-7,13 : 0 ... NUM_LOOPS : 1 ...

“2”

31

slide-32
SLIDE 32

Weights

} Binary case: compare features to a weight vector to identify the class } Learning: figure out the weight vector from examples

# free : 2 YOUR_NAME : 0 MISSPELLED : 2 FROM_FRIEND : 0 ... # free : 4 YOUR_NAME :-1 MISSPELLED : 1 FROM_FRIEND :-3 ... # free : 0 YOUR_NAME : 1 MISSPELLED : 1 FROM_FRIEND : 1 ...

Dot product positive means the positive class

32

slide-33
SLIDE 33

Linear Classifier example

33

} Two class example:

𝑦1 𝑦2

1 2 3 1 2 3 4

− 3 4 𝑦' − 𝑦* + 3 = 0 if 𝒙\𝒚 + 𝑥E ≥ 0 then 𝒟' else 𝒟* 𝒟' 𝒟* 𝒙 = − 3 4 −1 𝑥E = 3

slide-34
SLIDE 34

Binary Decision Rule

} In the space of feature vectors

} Examples are points } Any weight vector is a hyperplane } One side corresponds toY=+1 } Other corresponds toY=-1

BIAS : -3 free : 4 money : 2 ... 1 1 2 free money +1 = SPAM

  • 1 = HAM

34

slide-35
SLIDE 35

Distance between an 𝒚(A) and the plane

35

distance = 𝒙\𝒚(A) + 𝑥E 𝒙

𝒚(A)

slide-36
SLIDE 36

Square error loss function for classification!

36

Square error loss is not suitable for classification:

}

Least square loss penalizes ‘too correct’ predictions (that they lie a long way on the correct side

  • f the decision)

}

Least square loss also lack robustness to noise

𝐾 𝒙 = @ 𝑥𝑦 : + 𝑥E − 𝑧 :

*

  • :B'

𝐿 = 2

slide-37
SLIDE 37

Notation

37

} 𝒙 = 𝑥E, 𝑥', . . . , 𝑥Y \

} 𝒚 = 1, 𝑦', … , 𝑦Y \ } 𝑥E + 𝑥'𝑦' + ⋯ + 𝑥Y𝑦Y = 𝒙\𝒚 } We show input by 𝒚 or 𝑔(𝒚)

slide-38
SLIDE 38

SSE cost function for classification

38

𝐾(𝒙)

} Is it more suitable if we set 𝑔 𝒚; 𝒙 = 𝑕 𝒙\𝒚 ?

𝐾 𝒙 = @ sign 𝒙\𝒚 : − 𝑧 :

*

  • :B'

sign 𝑨 = j− 1, 𝑨 < 0 1, 𝑨 ≥ 0

} 𝐾 𝒙

is a piecewise constant function shows the number

  • f misclassifications

𝐿 = 2 𝒙\𝒚 𝑧 = 1 sign 𝒙\𝒚 − 𝑧 *

Training error incurred in classifying training samples

slide-39
SLIDE 39

Perceptron algorithm

39

} Linear classifier } Two-class: 𝑧 ∈ {−1,1}

} 𝑧 = −1 for 𝐷*,

𝑧 = 1 for 𝐷'

} Goal: ∀𝑗, 𝒚(:) ∈ 𝐷' ⇒ 𝒙\𝒚(:) > 0 }

∀𝑗, 𝒚 : ∈ 𝐷* ⇒ 𝒙\𝒚 : < 0

} 𝑕 𝒚; 𝒙 = sign(𝒙\𝒚)

slide-40
SLIDE 40

Perceptron criterion

40

𝐾o 𝒙 = − @ 𝒙\𝒚 : 𝑧 :

  • :∈ℳ

ℳ: subset of training data that are misclassified Many solutions? Which solution among them?

slide-41
SLIDE 41

Cost function

41

[Duda, Hart, and Stork, 2002] 𝐾(𝒙) 𝐾o(𝒙) 𝑥E 𝑥' 𝑥E 𝑥' # of misclassifications as a cost function Perceptron’s cost function There may be many solutions in these cost functions

slide-42
SLIDE 42

Batch Perceptron

42

“Gradient Descent” to solve the optimization problem: 𝒙MN' = 𝒙M − 𝜃𝛼

𝒙𝐾o(𝒙M)

𝛼

𝒙𝐾o 𝒙 = − @ 𝒚 : 𝑧 :

  • :∈ℳ

Batch Perceptron converges in finite number of steps for linearly separable data:

Initialize 𝒙 Repeat 𝒙 = 𝒙 + 𝜃 ∑ 𝒚 : 𝑧 :

  • :∈ℳ

Until convergence

slide-43
SLIDE 43

Stochastic gradient descent for Perceptron

43

} Single-sample perceptron:

} If 𝒚(:) is misclassified:

𝒙MN' = 𝒙M + 𝜃𝒚(:)𝑧(:)

} Perceptron convergence theorem: for linearly separable data

} If training data are linearly separable, the single-sample perceptron is

also guaranteed to find a solution in a finite number of steps

Initialize 𝒙, 𝑢 ← 0 repeat 𝑢 ← 𝑢 + 1 𝑗 ← 𝑢 mod 𝑂 if 𝒚(:) is misclassified then 𝒙 = 𝒙 + 𝒚(:)𝑧(:) Until all patterns properly classified

Fixed-Increment single sample Perceptron 𝜃 can be set to 1 and proof still works

slide-44
SLIDE 44

Weight Updates

44

slide-45
SLIDE 45

Learning: Binary Perceptron

}

Start with weights = 0

}

For each training instance:

}

Classify with current weights

}

If correct (i.e., y=y*), no change!

}

If wrong: adjust the weight vector

45

𝒙MN' = 𝒙M + 𝜃𝒚(:)𝑧(:)

slide-46
SLIDE 46

Example

46

slide-47
SLIDE 47

Perceptron: Example

47

Change 𝒙 in a direction that corrects the error [Bishop]

slide-48
SLIDE 48

Learning: Binary Perceptron

} Start with weights = 0 } For each training instance: } Classify with current weights } If correct (i.e., y=y*), no change! } If wrong: adjust the weight vector by

adding or subtracting the feature

  • vector. Subtract if y* is -1.

48

slide-49
SLIDE 49

Examples: Perceptron

} Separable Case

49

slide-50
SLIDE 50

Convergence of Perceptron

50

} For data sets that are not linearly separable, the single-sample

perceptron learning algorithm will never converge

[Duda, Hart & Stork, 2002]

slide-51
SLIDE 51

Multiclass Decision Rule

} If we have multiple classes:

}

A weight vector for each class:

}

Score (activation) of a class y:

}

Prediction highest score wins

Binary = multiclass where the negative class has weight zero

51

slide-52
SLIDE 52

Learning: Multiclass Perceptron

} Start with all weights = 0 } Pick up training examples one by one } Predict with current weights } If correct, no change! } If wrong: lower score of wrong answer,

raise score of right answer

52

slide-53
SLIDE 53

Example: Multiclass Perceptron

BIAS : 1 win : 0 game : 0 vote : 0 the : 0 ... BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ... BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...

“win the vote” “win the election” “win the game”

53

slide-54
SLIDE 54

Properties of Perceptrons

} Separability: true if some parameters get the training set

perfectly classified

} Convergence: if the training is separable, perceptron will

eventually converge (binary case)

} Mistake Bound: the maximum number of mistakes (binary

case) related to the margin or degree of separability Separable Non-Separable

54

slide-55
SLIDE 55

Examples: Perceptron

} Non-Separable Case

55

slide-56
SLIDE 56

Examples: Perceptron

} Non-Separable Case

56

slide-57
SLIDE 57

Discriminative approach: logistic regression

57

𝑕 𝒚; 𝒙 = 𝜏(𝒙\𝒚)

𝜏 . is an activation function

} Sigmoid (logistic) function

} Activation function

𝜏 𝑨 = 1 1 + 𝑓wx

𝐿 = 2 𝒚 = 1, 𝑦', … , 𝑦Y 𝒙 = 𝑥E, 𝑥', … , 𝑥Y

slide-58
SLIDE 58

Logistic regression: cost function

58

𝒙 F = argmin

𝒙

𝐾(𝒙) 𝐾 𝒙 = = @ −𝑧(:)log 𝜏 𝒙\𝒚(:) − (1 − 𝑧(:))log 1 − 𝜏 𝒙\𝒚(:)

A :B'

} 𝐾(𝒙) is convex w.r.t. parameters.

slide-59
SLIDE 59

Logistic regression: loss function

59

Loss 𝑧, 𝑔 𝒚; 𝒙 = −𝑧×log 𝜏 𝒚; 𝒙 − (1 − 𝑧)×log(1 − 𝜏 𝒚; 𝒙 ) Loss 𝑧, 𝜏 𝒚; 𝒙 = j −log(𝜏(𝒚; 𝒙)) if 𝑧 = 1 −log(1 − 𝜏 𝒚; 𝒙 ) if 𝑧 = 0 How is it related to zero-one loss? Loss 𝑧, 𝑧 } = j1 𝑧 ≠ 𝑧 } 𝑧 = 𝑧 }

𝜏 𝒚; 𝒙 = 1 1 + 𝑓𝑦𝑞(−𝒙\𝒚) Since 𝑧 = 1 or 𝑧 = 0 ⇒

slide-60
SLIDE 60

Logistic regression: Gradient descent

60

𝒙MN' = 𝒙M − 𝜃𝛼

𝒙𝐾(𝒙M)

𝛼

𝒙𝐾 𝒙 = @

𝜏 𝒚 : ; 𝒙 − 𝑧 : 𝒚 :

A :B'

} Is it similar to gradient of SSE for linear regression?

𝛼

𝒙𝐾 𝒙 = @

𝒙\𝒚 : − 𝑧 : 𝒚 :

A :B'

slide-61
SLIDE 61

Multi-class logistic regression

} 𝑕 𝒚; 𝑿 = 𝑕' 𝒚, 𝑿 , … , 𝑕• 𝒚, 𝑿

\

} 𝑿 = 𝒙'

⋯ 𝒙• contains one vector of parameters for each class

61

𝑕‚ 𝒚; 𝑿 = exp (𝒙‚

\𝒚 )

∑ exp (𝒙…

\𝒚 )

  • …B'
slide-62
SLIDE 62

Logistic regression: multi-class

62

𝑿 † = argmin

𝑿

𝐾(𝑿) 𝐾 𝑿 = − @ @ 𝑧‚

: log 𝑕‚ 𝒚(:); 𝑿

  • ‚B'

A :B' 𝑿 = 𝒙' ⋯ 𝒙• 𝒛 is a vector of length 𝐿 (1-of-K coding) e.g., 𝒛 = 0,0,1,0 \ when the target class is 𝐷ˆ

slide-63
SLIDE 63

Logistic regression: multi-class

63

𝒙…

MN' = 𝒙… M − 𝜃𝛼𝑿𝐾(𝑿M)

𝛼

𝒙‰𝐾 𝑿 = @

𝑕… 𝒚 : ; 𝑿 − 𝑧…

:

𝒚 :

A :B'

slide-64
SLIDE 64

Multi-class classifier

} 𝑕 𝒚; 𝑿 = 𝑕' 𝒚, 𝑿 , … , 𝑕• 𝒚, 𝑿 } 𝑿 = 𝒙'

⋯ 𝒙• contains one vector of parameters for each class

} In linear classifiers, 𝑿 is 𝑒×𝐿 where 𝑒 shows number of

features

} 𝑿\𝒚 provides us a vector

} 𝑕 𝒚; 𝑿

contains K numbers giving class scores for the input 𝒚

64

slide-65
SLIDE 65

Example

} Output obtained from 𝑿\𝒚 + 𝒄

𝒚 = 𝑦' ⋮ 𝑦•Ž•

28 ×28

𝑿\ = 𝒙' ⋮ 𝒙'E 'Eוޕ 𝒄 = 𝑐' ⋮ 𝑐'E

65 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017

slide-66
SLIDE 66

Example

How can we tell whether this W and b is good or bad?

𝑿\

66 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017

slide-67
SLIDE 67

Bias can also be included in the W matrix

67 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017

slide-68
SLIDE 68

Softmax classifier loss: example

𝑀(') = − log 0.13 = 0.89

𝑀(:) = − log 𝑓

”•(–)

∑ 𝑓”‰

  • …B'

68 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017

slide-69
SLIDE 69

Support Vector Machines

} Maximizing the margin: good according to intuition, theory, practice } Support vector machines (SVMs) find the separator with max margin 69

slide-70
SLIDE 70

Hard-margin SVM: Optimization problem

70

max

𝒙,—˜

2 𝒙

  • s. t. 𝒙\𝒚 A + 𝑥E ≥ 1 ∀𝑧 A = 1

𝒙\𝒚 A + 𝑥E ≤ −1 ∀𝑧 A = −1

𝑦2 𝑦1

1 𝒙

𝒙\𝒚 + 𝑥E = 0 𝒙\𝒚 + 𝑥E = 1 𝒙\𝒚 + 𝑥E = −1 𝒙

Margin:

* 𝒙

slide-71
SLIDE 71

Distance between an 𝒚(A) and the plane

71

distance = 𝒙\𝒚(A) + 𝑥E 𝒙

𝒚(A)

slide-72
SLIDE 72

Hard-margin SVM: Optimization problem

72

We can equivalently optimize:

min

𝒙,—˜

1 2 𝒙\𝒙

  • s. t. 𝑧 A

𝒙\𝒚 A + 𝑥E ≥ 1 𝑜 = 1, … , 𝑂

} It is a convex Quadratic Programming (QP) problem

} There are computationally efficient packages to solve it. } It has a global minimum (if any).

slide-73
SLIDE 73

Error measure

73

} Margin violation amount 𝜊A (𝜊A ≥ 0):

} 𝑧 A

𝒙\𝒚 A + 𝑥E ≥ 1 − 𝜊A

} Total violation: ∑

𝜊A

  • AB'
slide-74
SLIDE 74

Soft-margin SVM: Optimization problem

74

} SVM with slack variables: allows samples to fall within the

margin, but penalizes them

min

𝒙,—˜, š› ›œ•

ž

1 2 𝒙 * + 𝐷 @ 𝜊A

  • AB'
  • s. t. 𝑧 A

𝒙\𝒚 A + 𝑥E ≥ 1 − 𝜊A 𝑜 = 1, … , 𝑂 𝜊A ≥ 0

𝜊A: slack variables 0 < 𝜊A < 1 : if 𝒚 A is correctly classified but inside margin 𝜊A > 1: if 𝒚 A is misclassifed

𝑦2 𝑦1 𝜊 < 1 𝜊 > 1

slide-75
SLIDE 75

Soft-margin SVM: Cost function

75

min

𝒙,—˜, š› ›œ•

ž

1 2 𝒙 * + 𝐷 @ 𝜊A

  • AB'
  • s. t. 𝑧 A

𝒙\𝒚 A + 𝑥E ≥ 1 − 𝜊A 𝑜 = 1, … , 𝑂 𝜊A ≥ 0

} It

is equivalent to the unconstrained

  • ptimization

problem: min

𝒙,—˜

1 2 𝒙 * + 𝐷 @ max (0,1 − 𝑧(A)(𝒙\𝒚(A) + 𝑥E))

  • AB'
slide-76
SLIDE 76

Multi-class SVM

𝐾 𝑿 = 1 𝑂 @ 𝑀 :

  • :B'

+ 𝜇 𝑿 * 𝑀 : = @ max 0,1 + 𝑡

… − 𝑡¡(–) …¢¡(–)

= @ max 0,1 + 𝒙…

\𝒚(:) − 𝒙¡(–) \

𝒚(:)

…¢¡(–)

𝑆 𝑿 = @ @ 𝑥¤‚

* Y ¤B'

  • ‚B'

𝑡

… ≡ 𝑕… 𝒚 : ; 𝑿

= 𝒙…

\𝒚(:)

Hinge loss: L2 regularization:

76

slide-77
SLIDE 77

Multi-class SVM loss: Example

𝑀 : = @ max 0,1 + 𝑡

… − 𝑡¡(–) …¢¡(–)

𝑀(') = max 0,1 + 5.1 − 3.2 + max 0,1 − 1.7 − 3.2 = max 0,2.9 + max(0, −3.9) = 2.9 + 0 𝑀(*) = max 0,1 + 1.3 − 4.9 + max 0,1 + 2 − 4.9 = max 0, −2.6 + max(0, −1.9) = 0 + 0 1 𝑂 @ 𝑀 :

  • :B'

= 1 3 2.9 + 0 + 12.9 = 5.7

3 training examples, 3 classes. With some W the scores are 𝑋\𝑦

𝑀(ˆ) = max (0, 2.2 − (−3.1) + 1) +max (0, 2.5 − (−3.1) + 1) = max (0, 6.3) + max (0, 6.6) = 6.3 + 6.6 = 12.9

𝑡

… = 𝒙… \𝒚(:)

77 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017

slide-78
SLIDE 78

Recap

We need 𝛼ª𝑀 to update weights

78

L2 regularization 𝑆 𝑋 = ∑ ∑ 𝑥¤‚

* Y ¤B'

  • ‚B'

L1 regularization 𝑆 𝑋 = ∑ ∑ 𝑥¤‚

Y ¤B'

  • ‚B'

This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017

slide-79
SLIDE 79

Generalized linear

} Linear combination of fixed non-linear function of the

input vector 𝑔(𝒚; 𝒙) = 𝑥E + 𝑥'𝜚'(𝒚)+ . . . 𝑥¬𝜚¬(𝒚) {𝜚'(𝒚), . . . , 𝜚¬(𝒚)}: set of basis functions (or features) 𝜚: 𝒚 : ℝY → ℝ

79

slide-80
SLIDE 80

Basis functions: examples

} Linear } Polynomial (univariate) 80

slide-81
SLIDE 81

Polynomial regression: example

81

𝑛 = 1 𝑛 = 3 𝑛 = 5 𝑛 = 7

slide-82
SLIDE 82

Generalized linear classifier

82

} Assume a transformation 𝜚: ℝY → ℝ¬ on the feature

space

} 𝒚 → 𝝔 𝒚

} Find a hyper-plane in the transformed feature space:

𝑦2 𝑦1 𝜚'(𝒚) 𝜚*(𝒚) 𝜚: 𝒚 → 𝝔 𝒚 𝒙\𝝔 𝒚 + 𝑥E = 0 {𝜚'(𝒚), . . . , 𝜚¬(𝒚)}: set of basis functions (or features) 𝜚: 𝒚 : ℝY → ℝ 𝝔 𝒚 = [𝜚'(𝒚), . . . , 𝜚¬(𝒚)]

slide-83
SLIDE 83

Model complexity and overfitting

} With limited training data, models may achieve zero

training error but a large test error.

} Over-fitting: when the training loss no longer bears any

relation to the test (generalization) loss.

} Fails to generalize to unseen examples.

83

1 𝑜 @ 𝑧 : − 𝑔 𝒚 : ; 𝜾

*

≈ 0

A :B'

Training (empirical) loss Expected (true) loss E𝐲,´ 𝑧 − 𝑔 𝒚; 𝜾

* ≫ 0

slide-84
SLIDE 84

Polynomial regression

84

𝑛 = 0 𝑛 = 1 𝑛 = 3 𝑛 = 9 𝑧 𝑧 𝑧 𝑧 [Bishop]

slide-85
SLIDE 85

Over-fitting causes

85

} Model complexity

} E.g., Model with a large number of parameters (degrees of

freedom)

} Low number of training data

} Small data size compared to the complexity of the model

slide-86
SLIDE 86

Model complexity

86

} Example:

} Polynomials with larger 𝑛 are becoming increasingly tuned to

the random noise on the target values.

86

𝑛 = 0 𝑛 = 1 𝑛 = 3 𝑛 = 9 𝑧 𝑧 𝑧 𝑧 [Bishop]

slide-87
SLIDE 87

Number of training data & overfitting

87

} Over-fitting problem becomes less severe as the size of

training data increases.

𝑛 = 9 𝑛 = 9 𝑜 = 15 𝑜 = 100 [Bishop]

slide-88
SLIDE 88

How to evaluate the learner’s performance?

88

} Generalization error: true (or expected) error that we

would like to optimize

} Two ways to assess the generalization error is:

} Practical: Use a separate data set to test the model } Theoretical: Law of Large numbers

} statistical bounds on the difference between training and expected

errors

slide-89
SLIDE 89

Avoiding over-fitting

89

} Determine a suitable value for model complexity (Model

Selection)

} Simple hold-out method } Cross-validation

} Regularization (Occam’s Razor)

} Explicit preference towards simple models } Penalize for the model complexity in the objective function

slide-90
SLIDE 90

Model Selection

90

} learning algorithm defines the data-driven search over

the hypothesis space (i.e. search for good parameters)

} hyperparameters are the tunable aspects of the model,

that the learning algorithm does not select

This slide has been adopted from CMU ML course: http://www.cs.cmu.edu/~mgormley/courses/10601-s18/

slide-91
SLIDE 91

Model Selection

91

} Model selection is the process by which we choose the

“best” model from among a set of candidates

} assume access to a function capable of measuring the quality of

a model

} typically done “outside” the main training algorithm

} Model selection / hyperparameter optimization is just

another form of learning

This slide has been adopted from CMU ML course: http://www.cs.cmu.edu/~mgormley/courses/10601-s18/

slide-92
SLIDE 92

Simple hold-out: model selection

92

} Steps:

} Divide training data into training and validation set 𝑤_𝑡𝑓𝑢 } Use only the training set to train a set of models } Evaluate each learned model on the validation set

} 𝐾¸ 𝒙 =

' ¸_”¹M ∑

𝑧(:) − 𝑔 𝒚(:); 𝒙

*

  • :∈¸_”¹M

} Choose the best model based on the validation set error

} Usually, too wasteful of valuable training data

} Training data may be limited. } On the other hand, small validation set give a relatively noisy

estimate of performance.

slide-93
SLIDE 93

Simple hold out: training, validation, and test sets

93

} Simple hold-out chooses the model that minimizes error on

validation set.

} 𝐾¸ 𝒙

F is likely to be an optimistic estimate of generalization error.

} extra parameter (e.g., degree of polynomial) is fit to this set.

} Estimate generalization error for the test set

} performance of the selected model is finally evaluated on the test set

Training Validation Test

slide-94
SLIDE 94

Cross-Validation (CV): Evaluation

94

} 𝑙-fold cross-validation steps:

} Shuffle the dataset and randomly partition training data into 𝑙 groups of

approximately equal size

} for 𝑗 = 1 to 𝑙

} Choose the 𝑗-th group as the held-out validation group } Train the model on all but the 𝑗-th group of data } Evaluate the model on the held-out group

} Performance scores of the model from 𝑙 runs are averaged.

} The average error rate can be considered as an estimation of the true

performance. … … … … … First run Second run (k-1)th run k-th run

slide-95
SLIDE 95

Cross-Validation (CV): Model Selection

95

} For each model we first find the average error find by CV. } The model with the best average performance is

selected.

slide-96
SLIDE 96

Cross-validation: polynomial regression example

} 5-fold CV } 100 runs

} average

96

𝑛 = 1 CV: 𝑁𝑇𝐹 = 0.30 𝑛 = 3 CV: 𝑁𝑇𝐹 = 1.45 𝑛 = 5 CV: 𝑁𝑇𝐹 = 45.44 𝑛 = 7 CV: 𝑁𝑇𝐹 = 31759

slide-97
SLIDE 97

Regularization

97

} Adding a penalty term in the cost function to discourage

the coefficients from reaching large values.

} Ridge regression (weight decay):

𝐾 𝒙 = @ 𝑧 : − 𝒙\𝝔 𝒚 :

*

+ 𝜇𝒙\𝒙

A :B'

𝒙

¾ = 𝚾\𝚾 + 𝜇𝑱

w𝟐 𝚾\𝒛

slide-98
SLIDE 98

Polynomial order

98

} Polynomials with larger 𝑛

are becoming increasingly tuned to the random noise on the target values.

} magnitude of the coefficients typically gets larger by increasing

𝑛.

[Bishop]

slide-99
SLIDE 99

Regularization parameter

99

𝑥 F' 𝑥 F* 𝑥 Fˆ 𝑥 F• 𝑥 FÂ 𝑥 FÃ 𝑥 F• 𝑥 FŽ 𝑥 FÄ 𝑛 = 9 𝑥 FE 𝑚𝑜𝜇 = −∞ 𝑚𝑜𝜇 = −18 [Bishop]

slide-100
SLIDE 100

Regularization parameter

100

} Generalization

} 𝜇 now controls the effective complexity of the model and

hence determines the degree of over-fitting

[Bishop]

slide-101
SLIDE 101

Choosing the regularization parameter

101

} A set of models with different values of 𝜇. } Find 𝒙

F for each model based on training data

} Find 𝐾¸(𝒙

F) (or 𝐾Ǹ(𝒙 F)) for each model

} 𝐾¸ 𝒙 =

' A_¸ ∑

𝑧(:) − 𝑔 𝑦(:); 𝒙

*

  • :∈¸_”¹M

} Select the model with the best 𝐾¸(𝒙

F) (or 𝐾Ǹ(𝒙 F))

slide-102
SLIDE 102

The approximation-generailization trade-off

102

} Small true error shows good approximation of 𝑔 out of

sample

} More complex ℋ ⇒ better chance of approximating 𝑔 } Less complex ℋ ⇒ better chance of generalization out of 𝑔

slide-103
SLIDE 103

Complexity of Hypothesis Space: Example

103

Price Size Price Size Price Size

𝑥E + 𝑥'𝑦 𝑥E + 𝑥'𝑦 + 𝑥*𝑦* 𝑥E + 𝑥'𝑦 + 𝑥*𝑦* + 𝑥ˆ𝑦ˆ + 𝑥•𝑦• This example has been adapted from: Prof. Andrew Ng’s slides More complex ℋ Less complex ℋ

slide-104
SLIDE 104

Complexity of Hypothesis Space: Example

104

Price Size Price Size Price Size

𝑥E + 𝑥'𝑦 𝑥E + 𝑥'𝑦 + 𝑥*𝑦* 𝑥E + 𝑥'𝑦 + 𝑥*𝑦* + 𝑥ˆ𝑦ˆ + 𝑥•𝑦• This example has been adapted from: Prof. Andrew Ng’s slides Overfitting Underfitting

slide-105
SLIDE 105

Complexity of Hypothesis Space: Example

105

degree of polynomial 𝑛 error 𝐾¸ 𝐾MÈÉ:A 𝐾¸ 𝒙 = 1 𝑜_𝑤 @ 𝑧(:) − 𝑔 𝒚(:); 𝒙

*

  • :∈¸É¤_”¹M

𝐾MÈÉ:A 𝒙 = 1 𝑜_𝑢𝑠𝑏𝑗𝑜 @ 𝑧(:) − 𝑔 𝒚(:); 𝒙

*

  • :∈MÈÉ:A_”¹M
slide-106
SLIDE 106

Complexity of Hypothesis Space

106

} Less complex ℋ:

} 𝐾MÈÉ:A(𝒙

F) ≈ 𝐾¸(𝒙 F) and 𝐾MÈÉ:A(𝒙 F) is very high

} More complex ℋ:

} 𝐾MÈÉ:A(𝒙

F) ≪ 𝐾¸(𝒙 F) and 𝐾MÈÉ:A(𝒙 F) is low

degree of polynomial 𝑛 error 𝐾¸(𝒙 F) 𝐾MÈÉ:A(𝒙 F)

slide-107
SLIDE 107

Regularization: Example

107

𝑔 𝑦; 𝒙 = 𝑥E + 𝑥'𝑦 + 𝑥*𝑦* +𝑥ˆ 𝑦ˆ +𝑥• 𝑦• 𝐾 𝒙 = 1 𝑜 @ 𝑧 : − 𝑔 𝑦 : ; 𝒙

*

+ 𝜇𝒙\𝒙

A :B'

Large 𝜇x

(Prefer to more simple models)

Intermediate 𝜇

Price Size Price Size Price Size

Small 𝜇

(Prefer to more complex models)

𝑥' = 𝑥* ≈ 0 𝜇 = 0 This example has been adapted from: Prof. Andrew Ng’s slides

slide-108
SLIDE 108

Lesson

108

Match the model complexity to the data sources not to the complexity of the target function.

slide-109
SLIDE 109

Resources

109

} C. Bishop, “Pattern Recognition and Machine Learning”,

Chapter 1.1,1.3, 3.1, 3.2.

} Yaser S. Abu-Mostafa, Malik Maghdon-Ismail, and Hsuan

Tien Lin,“Learning from Data”, Chapter 2.3, 3.2, 3.4.