[PPT] - Learning: Linear Methods CE417: Introduction to Artificial PowerPoint Presentation

SLIDE 1

Learning: Linear Methods

CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2019

Soleymani

Some slides are based on Klein and Abdeel, CS188, UC Berkeley.

SLIDE 2

Paradigms of ML

} Supervised learning (regression, classification)

} predicting a target variable for which we get to see examples.

} Unsupervised learning

} revealing structure in the observed unlabeled data

} Reinforcement learning

} partial (indirect) feedback, no explicit guidance } Given rewards for a sequence of moves to learn a policy and

utility functions

2

SLIDE 3

Components of (Supervised) Learning

3

} Unknown target function: 𝑔: 𝒴 → 𝒵

} Input space: 𝒴 } Output space: 𝒵

} Training data: 𝒚', 𝑧' , 𝒚, 𝑧 , … , (𝒚-, 𝑧-) } Pick a formula 𝑕: 𝒴 → 𝒵 that approximates the target

function 𝑔

} selected from a set of hypotheses ℋ

SLIDE 4

Supervised Learning: Regression vs. Classification

} Supervised Learning

} Regression: predict a continuous target variable

} E.g., 𝑧 ∈ [0,1]

} Classification: predict a discrete target variable

} E.g.,𝑧 ∈ {1,2, … , 𝐷}

4

SLIDE 5

Regression: Example

} Housing price prediction

100 200 300 400 500 1000 1500 2000 2500

Price ($) in 1000’s Size in feet2

5

Figure adopted from slides of Andrew Ng

SLIDE 6

Training data: Example

𝑦' 𝑦* 𝑧 0.9 2.3 1 3.5 2.6 1 2.6 3.3 1 2.7 4.1 1 1.8 3.9 1 6.5 6.8

1

7.2 7.5

1

7.9 8.3

1

6.9 8.3

1

8.8 7.9

1

9.1 6.2

1

x1 x2

6

Training data

SLIDE 7

Classification: Example

} Weight (Cat, Dog)

7

1(Dog) 0(Cat)

weight weight

SLIDE 8

Linear regression

Cost function:

100 200 300 400 500 500 1000 1500 2000 2500 3000 100 200 300 400 500 500 1000 1500 2000 2500 3000

𝑧(:) − 𝑕(𝑦 : ; 𝒙) 𝑦

𝐾 𝒙 = @ 𝑧 : − 𝑕(𝑦(:); 𝒙)

* A :B'

= @ 𝑧 : − 𝑥E − 𝑥'𝑦 :

* A :B'

8

SLIDE 9

Cost function

100 200 300 400 500 1000 2000 3000

Price ($) in 1000’s Size in feet2 (x) 𝐾(𝒙) (function of the parameters 𝑥E, 𝑥') 𝑥E 𝑥'

9

This example has been adapted from: Prof. Andrew Ng’s slides

SLIDE 10

Review: Iterative optimization of cost function

10

} Cost function: 𝐾(𝒙) } Optimization problem: 𝒙

F = argm𝑗𝑜

𝒙

𝐾(𝒙)

} Steps:

} Start from 𝒙E } Repeat

} Update 𝒙M to 𝒙MN' in order to reduce 𝐾 } 𝑢 ← 𝑢 + 1

} until we hopefully end up at a minimum

SLIDE 11

Review: Gradient descent

11

} First-order optimization algorithm to find 𝒙∗ = argmin

𝒙

𝐾(𝒙)

} Also known as ”steepest descent”

} In each step, takes steps proportional to the negative of the

gradient vector of the function at the current point 𝒙M: 𝒙MN' = 𝒙M − 𝛿M 𝛼 𝐾 𝒙M

} 𝐾(𝒙) decreases fastest if one goes from 𝒙M in the direction of −𝛼𝐾 𝒙M } Assumption: 𝐾(𝒙) is defined and differentiable in a neighborhood of a

point 𝒙M

Gradient ascent takes steps proportional to (the positive

f) the gradient to find a local maximum of the function

SLIDE 12

Review: Gradient descent

12

} Minimize 𝐾(𝒙)

𝒙MN' = 𝒙M − 𝜃𝛼

𝒙𝐾(𝒙M)

𝛼

𝒙𝐾 𝒙 = [𝜖𝐾 𝒙

𝜖𝑥' , 𝜖𝐾 𝒙 𝜖𝑥* , … , 𝜖𝐾 𝒙 𝜖𝑥Y ]

} If 𝜃 is small enough, then 𝐾 𝒙MN' ≤ 𝐾 𝒙M . } 𝜃 can be allowed to change at every iteration as 𝜃M. Step size (Learning rate parameter)

SLIDE 13

Review: Gradient descent disadvantages

13

} Local minima problem } However, when 𝐾 is convex, all local minima are also global

minima ⇒ gradient descent can converge to the global solution.

SLIDE 14

w1 w0 J(w0,w1)

Review: Problem of gradient descent with non-convex cost functions

14 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

SLIDE 15

w0 w1 J(w0,w1)

Review: Problem of gradient descent with non-convex cost functions

15 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

SLIDE 16

Gradient descent for SSE cost function

16

} Minimize 𝐾(𝒙)

𝒙MN' = 𝒙M − 𝜃𝛼

𝒙𝐾(𝒙M)

} 𝐾(𝒙): Sum of squares error

𝐾 𝒙 = @ 𝑧 : − 𝑕 𝒚 : ; 𝒙

* A :B'

} Weight update rule for 𝑕 𝒚; 𝒙 = 𝒙\𝒚:

𝒙MN' = 𝒙M + 𝜃 @ 𝑧 : − 𝒙M\𝒚 : 𝒚(:)

A :B'

SLIDE 17

Gradient descent for SSE cost function

17

} Weight update rule: 𝑕 𝒚; 𝒙 = 𝒙\𝒚

𝒙MN' = 𝒙M + 𝜃 @ 𝑧 : − 𝒙\𝒚 : 𝒚(:)

A :B'

} 𝜃: too small → gradient descent can be slow. } 𝜃: too large → gradient descent can overshoot the

minimum. It may fail to converge, or even diverge.

Batch mode: each step considers all training data

SLIDE 18

(function of the parameters 𝑥E, 𝑥') 𝑕 𝑦; 𝑥E, 𝑥' = 𝑥E + 𝑥'𝑦 𝐾(𝑥E, 𝑥') 𝑥E 𝑥'

This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

18

SLIDE 19

(function of the parameters 𝑥E, 𝑥') 𝑕 𝑦; 𝑥E, 𝑥' = 𝑥E + 𝑥'𝑦 𝐾(𝑥E, 𝑥') 𝑥E 𝑥'

19 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

SLIDE 20

(function of the parameters 𝑥E, 𝑥') 𝑕 𝑦; 𝑥E, 𝑥' = 𝑥E + 𝑥'𝑦 𝐾(𝑥E, 𝑥') 𝑥E 𝑥'

20 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

SLIDE 21

(function of the parameters 𝑥E, 𝑥') 𝑕 𝑦; 𝑥E, 𝑥' = 𝑥E + 𝑥'𝑦 𝐾(𝑥E, 𝑥') 𝑥E 𝑥'

21 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

SLIDE 22

(function of the parameters 𝑥E, 𝑥') 𝑕 𝑦; 𝑥E, 𝑥' = 𝑥E + 𝑥'𝑦 𝐾(𝑥E, 𝑥') 𝑥E 𝑥'

22 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

SLIDE 23

(function of the parameters 𝑥E, 𝑥') 𝑕 𝑦; 𝑥E, 𝑥' = 𝑥E + 𝑥'𝑦 𝐾(𝑥E, 𝑥') 𝑥E 𝑥'

23 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

SLIDE 24

(function of the parameters 𝑥E, 𝑥') 𝑕 𝑦; 𝑥E, 𝑥' = 𝑥E + 𝑥'𝑦 𝐾(𝑥E, 𝑥') 𝑥E 𝑥'

24 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

SLIDE 25

(function of the parameters 𝑥E, 𝑥') 𝑕 𝑦; 𝑥E, 𝑥' = 𝑥E + 𝑥'𝑦 𝐾(𝑥E, 𝑥') 𝑥E 𝑥'

25 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

SLIDE 26

(function of the parameters 𝑥E, 𝑥') 𝑕 𝑦; 𝑥E, 𝑥' = 𝑥E + 𝑥'𝑦 𝐾(𝑥E, 𝑥') 𝑥E 𝑥'

26 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)

SLIDE 27

Linear Classifiers

27

SLIDE 28

Error-Driven Classification

28

SLIDE 29

Errors, and What to Do

Examples of errors

Dear GlobalSCAPE Customer, GlobalSCAPE has partnered with ScanSoft to offer you the latest version of OmniPage Pro, for just $99.99* - the regular list price is $499! The most common question we've received about this offer is - Is this genuine? We would like to assure you that this offer is authorized by ScanSoft, is genuine and valid. You can get the . . . . . . To receive your $30 Amazon.com promotional certificate, click through to http://www.amazon.com/apparel and see the prominent link for the $30 offer. All details are

there. We hope you enjoyed receiving this message. However,

if you'd rather not receive future e-mails announcing new store launches, please click . . .

29

SLIDE 30

What to Do About Errors

} Problem: there’s still spam in your inbox } Need more features – words aren’t enough!

} Have you emailed the sender before? } Have 1M other people just gotten the same email? } Is the sending information consistent? } Is the email in ALL CAPS? } Do inline URLs point where they say they point? } Does the email address you by (your) name?

} Naïve Bayes models can incorporate a variety of features, but tend to

do best in homogeneous cases (e.g. all features are word occurrences)

30

SLIDE 31

Feature Vectors

Hello, Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just # free : 2 YOUR_NAME : 0 MISSPELLED : 2 FROM_FRIEND : 0 ...

SPAM

r

+

PIXEL-7,12 : 1 PIXEL-7,13 : 0 ... NUM_LOOPS : 1 ...

“2”

31

SLIDE 32

Weights

} Binary case: compare features to a weight vector to identify the class } Learning: figure out the weight vector from examples

# free : 2 YOUR_NAME : 0 MISSPELLED : 2 FROM_FRIEND : 0 ... # free : 4 YOUR_NAME :-1 MISSPELLED : 1 FROM_FRIEND :-3 ... # free : 0 YOUR_NAME : 1 MISSPELLED : 1 FROM_FRIEND : 1 ...

Dot product positive means the positive class

32

SLIDE 33

Linear Classifier example

33

} Two class example:

𝑦1 𝑦2

1 2 3 1 2 3 4

− 3 4 𝑦' − 𝑦* + 3 = 0 if 𝒙\𝒚 + 𝑥E ≥ 0 then 𝒟' else 𝒟* 𝒟' 𝒟* 𝒙 = − 3 4 −1 𝑥E = 3

SLIDE 34

Binary Decision Rule

} In the space of feature vectors

} Examples are points } Any weight vector is a hyperplane } One side corresponds toY=+1 } Other corresponds toY=-1

BIAS : -3 free : 4 money : 2 ... 1 1 2 free money +1 = SPAM

1 = HAM

34

SLIDE 35

Distance between an 𝒚(A) and the plane

35

distance = 𝒙\𝒚(A) + 𝑥E 𝒙

𝒚(A)

SLIDE 36

Square error loss function for classification!

36

Square error loss is not suitable for classification:

}

Least square loss penalizes ‘too correct’ predictions (that they lie a long way on the correct side

f the decision)

}

Least square loss also lack robustness to noise

𝐾 𝒙 = @ 𝑥𝑦 : + 𝑥E − 𝑧 :

*

:B'

𝐿 = 2

SLIDE 37

Notation

37

} 𝒙 = 𝑥E, 𝑥', . . . , 𝑥Y \

} 𝒚 = 1, 𝑦', … , 𝑦Y \ } 𝑥E + 𝑥'𝑦' + ⋯ + 𝑥Y𝑦Y = 𝒙\𝒚 } We show input by 𝒚 or 𝑔(𝒚)

SLIDE 38

SSE cost function for classification

38

𝐾(𝒙)

} Is it more suitable if we set 𝑔 𝒚; 𝒙 = 𝑕 𝒙\𝒚 ?

𝐾 𝒙 = @ sign 𝒙\𝒚 : − 𝑧 :

*

:B'

sign 𝑨 = j− 1, 𝑨 < 0 1, 𝑨 ≥ 0

} 𝐾 𝒙

is a piecewise constant function shows the number

f misclassifications

𝐿 = 2 𝒙\𝒚 𝑧 = 1 sign 𝒙\𝒚 − 𝑧 *

Training error incurred in classifying training samples

SLIDE 39

Perceptron algorithm

39

} Linear classifier } Two-class: 𝑧 ∈ {−1,1}

} 𝑧 = −1 for 𝐷*,

𝑧 = 1 for 𝐷'

} Goal: ∀𝑗, 𝒚(:) ∈ 𝐷' ⇒ 𝒙\𝒚(:) > 0 }

∀𝑗, 𝒚 : ∈ 𝐷* ⇒ 𝒙\𝒚 : < 0

} 𝑕 𝒚; 𝒙 = sign(𝒙\𝒚)

SLIDE 40

Perceptron criterion

40

𝐾o 𝒙 = − @ 𝒙\𝒚 : 𝑧 :

:∈ℳ

ℳ: subset of training data that are misclassified Many solutions? Which solution among them?

SLIDE 41

Cost function

41

[Duda, Hart, and Stork, 2002] 𝐾(𝒙) 𝐾o(𝒙) 𝑥E 𝑥' 𝑥E 𝑥' # of misclassifications as a cost function Perceptron’s cost function There may be many solutions in these cost functions

SLIDE 42

Batch Perceptron

42

“Gradient Descent” to solve the optimization problem: 𝒙MN' = 𝒙M − 𝜃𝛼

𝒙𝐾o(𝒙M)

𝛼

𝒙𝐾o 𝒙 = − @ 𝒚 : 𝑧 :

:∈ℳ

Batch Perceptron converges in finite number of steps for linearly separable data:

Initialize 𝒙 Repeat 𝒙 = 𝒙 + 𝜃 ∑ 𝒚 : 𝑧 :

:∈ℳ

Until convergence

SLIDE 43

Stochastic gradient descent for Perceptron

43

} Single-sample perceptron:

} If 𝒚(:) is misclassified:

𝒙MN' = 𝒙M + 𝜃𝒚(:)𝑧(:)

} Perceptron convergence theorem: for linearly separable data

} If training data are linearly separable, the single-sample perceptron is

also guaranteed to find a solution in a finite number of steps

Initialize 𝒙, 𝑢 ← 0 repeat 𝑢 ← 𝑢 + 1 𝑗 ← 𝑢 mod 𝑂 if 𝒚(:) is misclassified then 𝒙 = 𝒙 + 𝒚(:)𝑧(:) Until all patterns properly classified

Fixed-Increment single sample Perceptron 𝜃 can be set to 1 and proof still works

SLIDE 44

Weight Updates

44

SLIDE 45

Learning: Binary Perceptron

}

Start with weights = 0

}

For each training instance:

}

Classify with current weights

}

If correct (i.e., y=y*), no change!

}

If wrong: adjust the weight vector

45

𝒙MN' = 𝒙M + 𝜃𝒚(:)𝑧(:)

SLIDE 46

Example

46

SLIDE 47

Perceptron: Example

47

Change 𝒙 in a direction that corrects the error [Bishop]

SLIDE 48

Learning: Binary Perceptron

} Start with weights = 0 } For each training instance: } Classify with current weights } If correct (i.e., y=y*), no change! } If wrong: adjust the weight vector by

adding or subtracting the feature

vector. Subtract if y* is -1.

48

SLIDE 49

Examples: Perceptron

} Separable Case

49

SLIDE 50

Convergence of Perceptron

50

} For data sets that are not linearly separable, the single-sample

perceptron learning algorithm will never converge

[Duda, Hart & Stork, 2002]

SLIDE 51

Multiclass Decision Rule

} If we have multiple classes:

}

A weight vector for each class:

}

Score (activation) of a class y:

}

Prediction highest score wins

Binary = multiclass where the negative class has weight zero

51

SLIDE 52

Learning: Multiclass Perceptron

} Start with all weights = 0 } Pick up training examples one by one } Predict with current weights } If correct, no change! } If wrong: lower score of wrong answer,

raise score of right answer

52

SLIDE 53

Example: Multiclass Perceptron

BIAS : 1 win : 0 game : 0 vote : 0 the : 0 ... BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ... BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...

“win the vote” “win the election” “win the game”

53

SLIDE 54

Properties of Perceptrons

} Separability: true if some parameters get the training set

perfectly classified

} Convergence: if the training is separable, perceptron will

eventually converge (binary case)

} Mistake Bound: the maximum number of mistakes (binary

case) related to the margin or degree of separability Separable Non-Separable

54

SLIDE 55

Examples: Perceptron

} Non-Separable Case

55

SLIDE 56

Examples: Perceptron

} Non-Separable Case

56

SLIDE 57

Discriminative approach: logistic regression

57

𝑕 𝒚; 𝒙 = 𝜏(𝒙\𝒚)

𝜏 . is an activation function

} Sigmoid (logistic) function

} Activation function

𝜏 𝑨 = 1 1 + 𝑓wx

𝐿 = 2 𝒚 = 1, 𝑦', … , 𝑦Y 𝒙 = 𝑥E, 𝑥', … , 𝑥Y

SLIDE 58

Logistic regression: cost function

58

𝒙 F = argmin

𝒙

𝐾(𝒙) 𝐾 𝒙 = = @ −𝑧(:)log 𝜏 𝒙\𝒚(:) − (1 − 𝑧(:))log 1 − 𝜏 𝒙\𝒚(:)

A :B'

} 𝐾(𝒙) is convex w.r.t. parameters.

SLIDE 59

Logistic regression: loss function

59

Loss 𝑧, 𝑔 𝒚; 𝒙 = −𝑧×log 𝜏 𝒚; 𝒙 − (1 − 𝑧)×log(1 − 𝜏 𝒚; 𝒙 ) Loss 𝑧, 𝜏 𝒚; 𝒙 = j −log(𝜏(𝒚; 𝒙)) if 𝑧 = 1 −log(1 − 𝜏 𝒚; 𝒙 ) if 𝑧 = 0 How is it related to zero-one loss? Loss 𝑧, 𝑧 } = j1 𝑧 ≠ 𝑧 } 𝑧 = 𝑧 }

𝜏 𝒚; 𝒙 = 1 1 + 𝑓𝑦𝑞(−𝒙\𝒚) Since 𝑧 = 1 or 𝑧 = 0 ⇒

SLIDE 60

Logistic regression: Gradient descent

60

𝒙MN' = 𝒙M − 𝜃𝛼

𝒙𝐾(𝒙M)

𝛼

𝒙𝐾 𝒙 = @

𝜏 𝒚 : ; 𝒙 − 𝑧 : 𝒚 :

A :B'

} Is it similar to gradient of SSE for linear regression?

𝛼

𝒙𝐾 𝒙 = @

𝒙\𝒚 : − 𝑧 : 𝒚 :

A :B'

SLIDE 61

Multi-class logistic regression

} 𝑕 𝒚; 𝑿 = 𝑕' 𝒚, 𝑿 , … , 𝑕• 𝒚, 𝑿

\

} 𝑿 = 𝒙'

⋯ 𝒙• contains one vector of parameters for each class

61

𝑕‚ 𝒚; 𝑿 = exp (𝒙‚

\𝒚 )

∑ exp (𝒙…

\𝒚 )

…B'

SLIDE 62

Logistic regression: multi-class

62

𝑿 † = argmin

𝑿

𝐾(𝑿) 𝐾 𝑿 = − @ @ 𝑧‚

: log 𝑕‚ 𝒚(:); 𝑿

‚B'

A :B' 𝑿 = 𝒙' ⋯ 𝒙• 𝒛 is a vector of length 𝐿 (1-of-K coding) e.g., 𝒛 = 0,0,1,0 \ when the target class is 𝐷ˆ

SLIDE 63

Logistic regression: multi-class

63

𝒙…

MN' = 𝒙… M − 𝜃𝛼𝑿𝐾(𝑿M)

𝛼

𝒙‰𝐾 𝑿 = @

𝑕… 𝒚 : ; 𝑿 − 𝑧…

:

𝒚 :

A :B'

SLIDE 64

Multi-class classifier

} 𝑕 𝒚; 𝑿 = 𝑕' 𝒚, 𝑿 , … , 𝑕• 𝒚, 𝑿 } 𝑿 = 𝒙'

⋯ 𝒙• contains one vector of parameters for each class

} In linear classifiers, 𝑿 is 𝑒×𝐿 where 𝑒 shows number of

features

} 𝑿\𝒚 provides us a vector

} 𝑕 𝒚; 𝑿

contains K numbers giving class scores for the input 𝒚

64

SLIDE 65

Example

} Output obtained from 𝑿\𝒚 + 𝒄

𝒚 = 𝑦' ⋮ 𝑦•Ž•

28 ×28

𝑿\ = 𝒙' ⋮ 𝒙'E 'E×•Ž• 𝒄 = 𝑐' ⋮ 𝑐'E

65 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017

SLIDE 66

Example

How can we tell whether this W and b is good or bad?

𝑿\

66 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017

SLIDE 67

Bias can also be included in the W matrix

67 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017

SLIDE 68

Softmax classifier loss: example

𝑀(') = − log 0.13 = 0.89

𝑀(:) = − log 𝑓

”•(–)

∑ 𝑓”‰

…B'

68 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017

SLIDE 69

Support Vector Machines

} Maximizing the margin: good according to intuition, theory, practice } Support vector machines (SVMs) find the separator with max margin 69

SLIDE 70

Hard-margin SVM: Optimization problem

70

max

𝒙,—˜

2 𝒙

s. t. 𝒙\𝒚 A + 𝑥E ≥ 1 ∀𝑧 A = 1

𝒙\𝒚 A + 𝑥E ≤ −1 ∀𝑧 A = −1

𝑦2 𝑦1

1 𝒙

𝒙\𝒚 + 𝑥E = 0 𝒙\𝒚 + 𝑥E = 1 𝒙\𝒚 + 𝑥E = −1 𝒙

Margin:

* 𝒙

SLIDE 71

Distance between an 𝒚(A) and the plane

71

distance = 𝒙\𝒚(A) + 𝑥E 𝒙

𝒚(A)

SLIDE 72

Hard-margin SVM: Optimization problem

72

We can equivalently optimize:

min

𝒙,—˜

1 2 𝒙\𝒙

s. t. 𝑧 A

𝒙\𝒚 A + 𝑥E ≥ 1 𝑜 = 1, … , 𝑂

} It is a convex Quadratic Programming (QP) problem

} There are computationally efficient packages to solve it. } It has a global minimum (if any).

SLIDE 73

Error measure

73

} Margin violation amount 𝜊A (𝜊A ≥ 0):

} 𝑧 A

𝒙\𝒚 A + 𝑥E ≥ 1 − 𝜊A

} Total violation: ∑

𝜊A

AB'

SLIDE 74

Soft-margin SVM: Optimization problem

74

} SVM with slack variables: allows samples to fall within the

margin, but penalizes them

min

𝒙,—˜, š› ›œ•

ž

1 2 𝒙 * + 𝐷 @ 𝜊A

AB'
s. t. 𝑧 A

𝒙\𝒚 A + 𝑥E ≥ 1 − 𝜊A 𝑜 = 1, … , 𝑂 𝜊A ≥ 0

𝜊A: slack variables 0 < 𝜊A < 1 : if 𝒚 A is correctly classified but inside margin 𝜊A > 1: if 𝒚 A is misclassifed

𝑦2 𝑦1 𝜊 < 1 𝜊 > 1

SLIDE 75

Soft-margin SVM: Cost function

75

min

𝒙,—˜, š› ›œ•

ž

1 2 𝒙 * + 𝐷 @ 𝜊A

AB'
s. t. 𝑧 A

𝒙\𝒚 A + 𝑥E ≥ 1 − 𝜊A 𝑜 = 1, … , 𝑂 𝜊A ≥ 0

} It

is equivalent to the unconstrained

ptimization

problem: min

𝒙,—˜

1 2 𝒙 * + 𝐷 @ max (0,1 − 𝑧(A)(𝒙\𝒚(A) + 𝑥E))

AB'

SLIDE 76

Multi-class SVM

𝐾 𝑿 = 1 𝑂 @ 𝑀 :

:B'

+ 𝜇 𝑿 * 𝑀 : = @ max 0,1 + 𝑡

… − 𝑡¡(–) …¢¡(–)

= @ max 0,1 + 𝒙…

\𝒚(:) − 𝒙¡(–) \

𝒚(:)

…¢¡(–)

𝑆 𝑿 = @ @ 𝑥¤‚

* Y ¤B'

‚B'

𝑡

… ≡ 𝑕… 𝒚 : ; 𝑿

= 𝒙…

\𝒚(:)

Hinge loss: L2 regularization:

76

SLIDE 77

Multi-class SVM loss: Example

𝑀 : = @ max 0,1 + 𝑡

… − 𝑡¡(–) …¢¡(–)

𝑀(') = max 0,1 + 5.1 − 3.2 + max 0,1 − 1.7 − 3.2 = max 0,2.9 + max(0, −3.9) = 2.9 + 0 𝑀(*) = max 0,1 + 1.3 − 4.9 + max 0,1 + 2 − 4.9 = max 0, −2.6 + max(0, −1.9) = 0 + 0 1 𝑂 @ 𝑀 :

:B'

= 1 3 2.9 + 0 + 12.9 = 5.7

3 training examples, 3 classes. With some W the scores are 𝑋\𝑦

𝑀(ˆ) = max (0, 2.2 − (−3.1) + 1) +max (0, 2.5 − (−3.1) + 1) = max (0, 6.3) + max (0, 6.6) = 6.3 + 6.6 = 12.9

𝑡

… = 𝒙… \𝒚(:)

77 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017

SLIDE 78

Recap

We need 𝛼ª𝑀 to update weights

78

L2 regularization 𝑆 𝑋 = ∑ ∑ 𝑥¤‚

* Y ¤B'

‚B'

L1 regularization 𝑆 𝑋 = ∑ ∑ 𝑥¤‚

Y ¤B'

‚B'

This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017

SLIDE 79

Generalized linear

} Linear combination of fixed non-linear function of the

input vector 𝑔(𝒚; 𝒙) = 𝑥E + 𝑥'𝜚'(𝒚)+ . . . 𝑥¬𝜚¬(𝒚) {𝜚'(𝒚), . . . , 𝜚¬(𝒚)}: set of basis functions (or features) 𝜚: 𝒚 : ℝY → ℝ

79

SLIDE 80

Basis functions: examples

} Linear } Polynomial (univariate) 80

SLIDE 81

Polynomial regression: example

81

𝑛 = 1 𝑛 = 3 𝑛 = 5 𝑛 = 7

SLIDE 82

Generalized linear classifier

82

} Assume a transformation 𝜚: ℝY → ℝ¬ on the feature

space

} 𝒚 → 𝝔 𝒚

} Find a hyper-plane in the transformed feature space:

𝑦2 𝑦1 𝜚'(𝒚) 𝜚*(𝒚) 𝜚: 𝒚 → 𝝔 𝒚 𝒙\𝝔 𝒚 + 𝑥E = 0 {𝜚'(𝒚), . . . , 𝜚¬(𝒚)}: set of basis functions (or features) 𝜚: 𝒚 : ℝY → ℝ 𝝔 𝒚 = [𝜚'(𝒚), . . . , 𝜚¬(𝒚)]

SLIDE 83

Model complexity and overfitting

} With limited training data, models may achieve zero

training error but a large test error.

} Over-fitting: when the training loss no longer bears any

relation to the test (generalization) loss.

} Fails to generalize to unseen examples.

83

1 𝑜 @ 𝑧 : − 𝑔 𝒚 : ; 𝜾

*

≈ 0

A :B'

Training (empirical) loss Expected (true) loss E𝐲,´ 𝑧 − 𝑔 𝒚; 𝜾

* ≫ 0

SLIDE 84

Polynomial regression

84

𝑛 = 0 𝑛 = 1 𝑛 = 3 𝑛 = 9 𝑧 𝑧 𝑧 𝑧 [Bishop]

SLIDE 85

Over-fitting causes

85

} Model complexity

} E.g., Model with a large number of parameters (degrees of

freedom)

} Low number of training data

} Small data size compared to the complexity of the model

SLIDE 86

Model complexity

86

} Example:

} Polynomials with larger 𝑛 are becoming increasingly tuned to

the random noise on the target values.

86

𝑛 = 0 𝑛 = 1 𝑛 = 3 𝑛 = 9 𝑧 𝑧 𝑧 𝑧 [Bishop]

SLIDE 87

Number of training data & overfitting

87

} Over-fitting problem becomes less severe as the size of

training data increases.

𝑛 = 9 𝑛 = 9 𝑜 = 15 𝑜 = 100 [Bishop]

SLIDE 88

How to evaluate the learner’s performance?

88

} Generalization error: true (or expected) error that we

would like to optimize

} Two ways to assess the generalization error is:

} Practical: Use a separate data set to test the model } Theoretical: Law of Large numbers

} statistical bounds on the difference between training and expected

errors

SLIDE 89

Avoiding over-fitting

89

} Determine a suitable value for model complexity (Model

Selection)

} Simple hold-out method } Cross-validation

} Regularization (Occam’s Razor)

} Explicit preference towards simple models } Penalize for the model complexity in the objective function

SLIDE 90

Model Selection

90

} learning algorithm defines the data-driven search over

the hypothesis space (i.e. search for good parameters)

} hyperparameters are the tunable aspects of the model,

that the learning algorithm does not select

This slide has been adopted from CMU ML course: http://www.cs.cmu.edu/~mgormley/courses/10601-s18/

SLIDE 91

Model Selection

91

} Model selection is the process by which we choose the

“best” model from among a set of candidates

} assume access to a function capable of measuring the quality of

a model

} typically done “outside” the main training algorithm

} Model selection / hyperparameter optimization is just

another form of learning

This slide has been adopted from CMU ML course: http://www.cs.cmu.edu/~mgormley/courses/10601-s18/

SLIDE 92

Simple hold-out: model selection

92

} Steps:

} Divide training data into training and validation set 𝑤_𝑡𝑓𝑢 } Use only the training set to train a set of models } Evaluate each learned model on the validation set

} 𝐾¸ 𝒙 =

' ¸_”¹M ∑

𝑧(:) − 𝑔 𝒚(:); 𝒙

*

:∈¸_”¹M

} Choose the best model based on the validation set error

} Usually, too wasteful of valuable training data

} Training data may be limited. } On the other hand, small validation set give a relatively noisy

estimate of performance.

SLIDE 93

Simple hold out: training, validation, and test sets

93

} Simple hold-out chooses the model that minimizes error on

validation set.

} 𝐾¸ 𝒙

F is likely to be an optimistic estimate of generalization error.

} extra parameter (e.g., degree of polynomial) is fit to this set.

} Estimate generalization error for the test set

} performance of the selected model is finally evaluated on the test set

Training Validation Test

SLIDE 94

Cross-Validation (CV): Evaluation

94

} 𝑙-fold cross-validation steps:

} Shuffle the dataset and randomly partition training data into 𝑙 groups of

approximately equal size

} for 𝑗 = 1 to 𝑙

} Choose the 𝑗-th group as the held-out validation group } Train the model on all but the 𝑗-th group of data } Evaluate the model on the held-out group

} Performance scores of the model from 𝑙 runs are averaged.

} The average error rate can be considered as an estimation of the true

performance. … … … … … First run Second run (k-1)th run k-th run

SLIDE 95

Cross-Validation (CV): Model Selection

95

} For each model we first find the average error find by CV. } The model with the best average performance is

selected.

SLIDE 96

Cross-validation: polynomial regression example

} 5-fold CV } 100 runs

} average

96

𝑛 = 1 CV: 𝑁𝑇𝐹 = 0.30 𝑛 = 3 CV: 𝑁𝑇𝐹 = 1.45 𝑛 = 5 CV: 𝑁𝑇𝐹 = 45.44 𝑛 = 7 CV: 𝑁𝑇𝐹 = 31759

SLIDE 97

Regularization

97

} Adding a penalty term in the cost function to discourage

the coefficients from reaching large values.

} Ridge regression (weight decay):

𝐾 𝒙 = @ 𝑧 : − 𝒙\𝝔 𝒚 :

*

+ 𝜇𝒙\𝒙

A :B'

𝒙

¾ = 𝚾\𝚾 + 𝜇𝑱

w𝟐 𝚾\𝒛

SLIDE 98

Polynomial order

98

} Polynomials with larger 𝑛

are becoming increasingly tuned to the random noise on the target values.

} magnitude of the coefficients typically gets larger by increasing

𝑛.

[Bishop]

SLIDE 99

Regularization parameter

99

𝑥 F' 𝑥 F* 𝑥 Fˆ 𝑥 F• 𝑥 FÂ 𝑥 FÃ 𝑥 F• 𝑥 FŽ 𝑥 FÄ 𝑛 = 9 𝑥 FE 𝑚𝑜𝜇 = −∞ 𝑚𝑜𝜇 = −18 [Bishop]

SLIDE 100

Regularization parameter

100

} Generalization

} 𝜇 now controls the effective complexity of the model and

hence determines the degree of over-fitting

[Bishop]

SLIDE 101

Choosing the regularization parameter

101

} A set of models with different values of 𝜇. } Find 𝒙

F for each model based on training data

} Find 𝐾¸(𝒙

F) (or 𝐾Ç¸(𝒙 F)) for each model

} 𝐾¸ 𝒙 =

' A_¸ ∑

𝑧(:) − 𝑔 𝑦(:); 𝒙

*

:∈¸_”¹M

} Select the model with the best 𝐾¸(𝒙

F) (or 𝐾Ç¸(𝒙 F))

SLIDE 102

The approximation-generailization trade-off

102

} Small true error shows good approximation of 𝑔 out of

sample

} More complex ℋ ⇒ better chance of approximating 𝑔 } Less complex ℋ ⇒ better chance of generalization out of 𝑔

SLIDE 103

Complexity of Hypothesis Space: Example

103

Price Size Price Size Price Size

𝑥E + 𝑥'𝑦 𝑥E + 𝑥'𝑦 + 𝑥𝑦 𝑥E + 𝑥'𝑦 + 𝑥𝑦 + 𝑥ˆ𝑦ˆ + 𝑥•𝑦• This example has been adapted from: Prof. Andrew Ng’s slides More complex ℋ Less complex ℋ

SLIDE 104

Complexity of Hypothesis Space: Example

104

Price Size Price Size Price Size

𝑥E + 𝑥'𝑦 𝑥E + 𝑥'𝑦 + 𝑥𝑦 𝑥E + 𝑥'𝑦 + 𝑥𝑦 + 𝑥ˆ𝑦ˆ + 𝑥•𝑦• This example has been adapted from: Prof. Andrew Ng’s slides Overfitting Underfitting

SLIDE 105

Complexity of Hypothesis Space: Example

105

degree of polynomial 𝑛 error 𝐾¸ 𝐾MÈÉ:A 𝐾¸ 𝒙 = 1 𝑜_𝑤 @ 𝑧(:) − 𝑔 𝒚(:); 𝒙

*

:∈¸É¤_”¹M

𝐾MÈÉ:A 𝒙 = 1 𝑜_𝑢𝑠𝑏𝑗𝑜 @ 𝑧(:) − 𝑔 𝒚(:); 𝒙

*

:∈MÈÉ:A_”¹M

SLIDE 106

Complexity of Hypothesis Space

106

} Less complex ℋ:

} 𝐾MÈÉ:A(𝒙

F) ≈ 𝐾¸(𝒙 F) and 𝐾MÈÉ:A(𝒙 F) is very high

} More complex ℋ:

} 𝐾MÈÉ:A(𝒙

F) ≪ 𝐾¸(𝒙 F) and 𝐾MÈÉ:A(𝒙 F) is low

degree of polynomial 𝑛 error 𝐾¸(𝒙 F) 𝐾MÈÉ:A(𝒙 F)

SLIDE 107

Regularization: Example

107

𝑔 𝑦; 𝒙 = 𝑥E + 𝑥'𝑦 + 𝑥𝑦 +𝑥ˆ 𝑦ˆ +𝑥• 𝑦• 𝐾 𝒙 = 1 𝑜 @ 𝑧 : − 𝑔 𝑦 : ; 𝒙

*

+ 𝜇𝒙\𝒙

A :B'

Large 𝜇x

(Prefer to more simple models)

Intermediate 𝜇

Price Size Price Size Price Size

Small 𝜇

(Prefer to more complex models)

𝑥' = 𝑥* ≈ 0 𝜇 = 0 This example has been adapted from: Prof. Andrew Ng’s slides

SLIDE 108

Lesson

108

Match the model complexity to the data sources not to the complexity of the target function.

Learning: Linear Methods

CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2019

Soleymani

Some slides are based on Klein and Abdeel, CS188, UC Berkeley.

Paradigms of ML

} Supervised learning (regression, classification)

} predicting a target variable for which we get to see examples.

} Unsupervised learning

} revealing structure in the observed unlabeled data

} Reinforcement learning

} partial (indirect) feedback, no explicit guidance } Given rewards for a sequence of moves to learn a policy and

utility functions

Components of (Supervised) Learning

} Unknown target function: 𝑔: 𝒴 → 𝒵

} Input space: 𝒴 } Output space: 𝒵

} Training data: 𝒚', 𝑧' , 𝒚*, 𝑧* , … , (𝒚-, 𝑧-) } Pick a formula 𝑕: 𝒴 → 𝒵 that approximates the target

function 𝑔

} selected from a set of hypotheses ℋ

Supervised Learning: Regression vs. Classification

} Supervised Learning

} Regression: predict a continuous target variable

} Classification: predict a discrete target variable

Regression: Example

} Housing price prediction

100 200 300 400 500 1000 1500 2000 2500

Price ($) in 1000’s Size in feet2

Figure adopted from slides of Andrew Ng

Training data: Example

𝑦' 𝑦* 𝑧 0.9 2.3 1 3.5 2.6 1 2.6 3.3 1 2.7 4.1 1 1.8 3.9 1 6.5 6.8

7.2 7.5

7.9 8.3

6.9 8.3

8.8 7.9

9.1 6.2

x1 x2

Training data

Classification: Example

} Weight (Cat, Dog)

1(Dog) 0(Cat)

weight weight

Linear regression

Cost function:

100 200 300 400 500 500 1000 1500 2000 2500 3000 100 200 300 400 500 500 1000 1500 2000 2500 3000

𝑧(:) − 𝑕(𝑦 : ; 𝒙) 𝑦

𝐾 𝒙 = @ 𝑧 : − 𝑕(𝑦(:); 𝒙)

* A :B'

= @ 𝑧 : − 𝑥E − 𝑥'𝑦 :

* A :B'

Cost function

100 200 300 400 500 1000 2000 3000

Price ($) in 1000’s Size in feet2 (x) 𝐾(𝒙) (function of the parameters 𝑥E, 𝑥') 𝑥E 𝑥'

This example has been adapted from: Prof. Andrew Ng’s slides

Review: Iterative optimization of cost function

} Cost function: 𝐾(𝒙) } Optimization problem: 𝒙

F = argm𝑗𝑜

𝒙

𝐾(𝒙)

} Steps:

} Start from 𝒙E } Repeat

} Update 𝒙M to 𝒙MN' in order to reduce 𝐾 } 𝑢 ← 𝑢 + 1

} until we hopefully end up at a minimum

Review: Gradient descent

} First-order optimization algorithm to find 𝒙∗ = argmin

𝒙

𝐾(𝒙)

} In each step, takes steps proportional to the negative of the

gradient vector of the function at the current point 𝒙M: 𝒙MN' = 𝒙M − 𝛿M 𝛼 𝐾 𝒙M

point 𝒙M

Gradient ascent takes steps proportional to (the positive

Review: Gradient descent

} Minimize 𝐾(𝒙)

𝒙MN' = 𝒙M − 𝜃𝛼

𝒙𝐾(𝒙M)

𝛼

𝒙𝐾 𝒙 = [𝜖𝐾 𝒙

𝜖𝑥' , 𝜖𝐾 𝒙 𝜖𝑥* , … , 𝜖𝐾 𝒙 𝜖𝑥Y ]

} If 𝜃 is small enough, then 𝐾 𝒙MN' ≤ 𝐾 𝒙M . } 𝜃 can be allowed to change at every iteration as 𝜃M. Step size (Learning rate parameter)

Review: Gradient descent disadvantages

} Local minima problem } However, when 𝐾 is convex, all local minima are also global

minima ⇒ gradient descent can converge to the global solution.

} Training data: 𝒚', 𝑧' , 𝒚, 𝑧 , … , (𝒚-, 𝑧-) } Pick a formula 𝑕: 𝒴 → 𝒵 that approximates the target