Learning: Linear Methods CE417: Introduction to Artificial - - PowerPoint PPT Presentation
Learning: Linear Methods CE417: Introduction to Artificial - - PowerPoint PPT Presentation
Learning: Linear Methods CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2019 Soleymani Some slides are based on Klein and Abdeel, CS188, UC Berkeley. Paradigms of ML } Supervised learning (regression,
Paradigms of ML
} Supervised learning (regression, classification)
} predicting a target variable for which we get to see examples.
} Unsupervised learning
} revealing structure in the observed unlabeled data
} Reinforcement learning
} partial (indirect) feedback, no explicit guidance } Given rewards for a sequence of moves to learn a policy and
utility functions
2
Components of (Supervised) Learning
3
} Unknown target function: 𝑔: 𝒴 → 𝒵
} Input space: 𝒴 } Output space: 𝒵
} Training data: 𝒚', 𝑧' , 𝒚*, 𝑧* , … , (𝒚-, 𝑧-) } Pick a formula : 𝒴 → 𝒵 that approximates the target
function 𝑔
} selected from a set of hypotheses ℋ
Supervised Learning: Regression vs. Classification
} Supervised Learning
} Regression: predict a continuous target variable
} E.g., 𝑧 ∈ [0,1]
} Classification: predict a discrete target variable
} E.g.,𝑧 ∈ {1,2, … , 𝐷}
4
Regression: Example
} Housing price prediction
100 200 300 400 500 1000 1500 2000 2500
Price ($) in 1000’s Size in feet2
5
Figure adopted from slides of Andrew Ng
Training data: Example
𝑦' 𝑦* 𝑧 0.9 2.3 1 3.5 2.6 1 2.6 3.3 1 2.7 4.1 1 1.8 3.9 1 6.5 6.8
- 1
7.2 7.5
- 1
7.9 8.3
- 1
6.9 8.3
- 1
8.8 7.9
- 1
9.1 6.2
- 1
x1 x2
6
Training data
Classification: Example
} Weight (Cat, Dog)
7
1(Dog) 0(Cat)
weight weight
Linear regression
Cost function:
100 200 300 400 500 500 1000 1500 2000 2500 3000 100 200 300 400 500 500 1000 1500 2000 2500 3000
𝑧(:) − (𝑦 : ; 𝒙) 𝑦
𝐾 𝒙 = @ 𝑧 : − (𝑦(:); 𝒙)
* A :B'
= @ 𝑧 : − 𝑥E − 𝑥'𝑦 :
* A :B'
8
Cost function
100 200 300 400 500 1000 2000 3000
Price ($) in 1000’s Size in feet2 (x) 𝐾(𝒙) (function of the parameters 𝑥E, 𝑥') 𝑥E 𝑥'
9
This example has been adapted from: Prof. Andrew Ng’s slides
Review: Iterative optimization of cost function
10
} Cost function: 𝐾(𝒙) } Optimization problem: 𝒙
F = argm𝑗𝑜
𝒙
𝐾(𝒙)
} Steps:
} Start from 𝒙E } Repeat
} Update 𝒙M to 𝒙MN' in order to reduce 𝐾 } 𝑢 ← 𝑢 + 1
} until we hopefully end up at a minimum
Review: Gradient descent
11
} First-order optimization algorithm to find 𝒙∗ = argmin
𝒙
𝐾(𝒙)
} Also known as ”steepest descent”
} In each step, takes steps proportional to the negative of the
gradient vector of the function at the current point 𝒙M: 𝒙MN' = 𝒙M − 𝛿M 𝛼 𝐾 𝒙M
} 𝐾(𝒙) decreases fastest if one goes from 𝒙M in the direction of −𝛼𝐾 𝒙M } Assumption: 𝐾(𝒙) is defined and differentiable in a neighborhood of a
point 𝒙M
Gradient ascent takes steps proportional to (the positive
- f) the gradient to find a local maximum of the function
Review: Gradient descent
12
} Minimize 𝐾(𝒙)
𝒙MN' = 𝒙M − 𝜃𝛼
𝒙𝐾(𝒙M)
𝛼
𝒙𝐾 𝒙 = [𝜖𝐾 𝒙
𝜖𝑥' , 𝜖𝐾 𝒙 𝜖𝑥* , … , 𝜖𝐾 𝒙 𝜖𝑥Y ]
} If 𝜃 is small enough, then 𝐾 𝒙MN' ≤ 𝐾 𝒙M . } 𝜃 can be allowed to change at every iteration as 𝜃M. Step size (Learning rate parameter)
Review: Gradient descent disadvantages
13
} Local minima problem } However, when 𝐾 is convex, all local minima are also global
minima ⇒ gradient descent can converge to the global solution.
w1 w0 J(w0,w1)
Review: Problem of gradient descent with non-convex cost functions
14 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
w0 w1 J(w0,w1)
Review: Problem of gradient descent with non-convex cost functions
15 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
Gradient descent for SSE cost function
16
} Minimize 𝐾(𝒙)
𝒙MN' = 𝒙M − 𝜃𝛼
𝒙𝐾(𝒙M)
} 𝐾(𝒙): Sum of squares error
𝐾 𝒙 = @ 𝑧 : − 𝒚 : ; 𝒙
* A :B'
} Weight update rule for 𝒚; 𝒙 = 𝒙\𝒚:
𝒙MN' = 𝒙M + 𝜃 @ 𝑧 : − 𝒙M\𝒚 : 𝒚(:)
A :B'
Gradient descent for SSE cost function
17
} Weight update rule: 𝒚; 𝒙 = 𝒙\𝒚
𝒙MN' = 𝒙M + 𝜃 @ 𝑧 : − 𝒙\𝒚 : 𝒚(:)
A :B'
} 𝜃: too small → gradient descent can be slow. } 𝜃: too large → gradient descent can overshoot the
- minimum. It may fail to converge, or even diverge.
Batch mode: each step considers all training data
(function of the parameters 𝑥E, 𝑥') 𝑦; 𝑥E, 𝑥' = 𝑥E + 𝑥'𝑦 𝐾(𝑥E, 𝑥') 𝑥E 𝑥'
This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
18
(function of the parameters 𝑥E, 𝑥') 𝑦; 𝑥E, 𝑥' = 𝑥E + 𝑥'𝑦 𝐾(𝑥E, 𝑥') 𝑥E 𝑥'
19 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
(function of the parameters 𝑥E, 𝑥') 𝑦; 𝑥E, 𝑥' = 𝑥E + 𝑥'𝑦 𝐾(𝑥E, 𝑥') 𝑥E 𝑥'
20 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
(function of the parameters 𝑥E, 𝑥') 𝑦; 𝑥E, 𝑥' = 𝑥E + 𝑥'𝑦 𝐾(𝑥E, 𝑥') 𝑥E 𝑥'
21 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
(function of the parameters 𝑥E, 𝑥') 𝑦; 𝑥E, 𝑥' = 𝑥E + 𝑥'𝑦 𝐾(𝑥E, 𝑥') 𝑥E 𝑥'
22 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
(function of the parameters 𝑥E, 𝑥') 𝑦; 𝑥E, 𝑥' = 𝑥E + 𝑥'𝑦 𝐾(𝑥E, 𝑥') 𝑥E 𝑥'
23 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
(function of the parameters 𝑥E, 𝑥') 𝑦; 𝑥E, 𝑥' = 𝑥E + 𝑥'𝑦 𝐾(𝑥E, 𝑥') 𝑥E 𝑥'
24 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
(function of the parameters 𝑥E, 𝑥') 𝑦; 𝑥E, 𝑥' = 𝑥E + 𝑥'𝑦 𝐾(𝑥E, 𝑥') 𝑥E 𝑥'
25 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
(function of the parameters 𝑥E, 𝑥') 𝑦; 𝑥E, 𝑥' = 𝑥E + 𝑥'𝑦 𝐾(𝑥E, 𝑥') 𝑥E 𝑥'
26 This example has been adopted from: Prof. Ng’s slides (ML Online Course, Stanford)
Linear Classifiers
27
Error-Driven Classification
28
Errors, and What to Do
Examples of errors
Dear GlobalSCAPE Customer, GlobalSCAPE has partnered with ScanSoft to offer you the latest version of OmniPage Pro, for just $99.99* - the regular list price is $499! The most common question we've received about this offer is - Is this genuine? We would like to assure you that this offer is authorized by ScanSoft, is genuine and valid. You can get the . . . . . . To receive your $30 Amazon.com promotional certificate, click through to http://www.amazon.com/apparel and see the prominent link for the $30 offer. All details are
- there. We hope you enjoyed receiving this message. However,
if you'd rather not receive future e-mails announcing new store launches, please click . . .
29
What to Do About Errors
} Problem: there’s still spam in your inbox } Need more features – words aren’t enough!
} Have you emailed the sender before? } Have 1M other people just gotten the same email? } Is the sending information consistent? } Is the email in ALL CAPS? } Do inline URLs point where they say they point? } Does the email address you by (your) name?
} Naïve Bayes models can incorporate a variety of features, but tend to
do best in homogeneous cases (e.g. all features are word occurrences)
30
Feature Vectors
Hello, Do you want free printr cartriges? Why pay more when you can get them ABSOLUTELY FREE! Just # free : 2 YOUR_NAME : 0 MISSPELLED : 2 FROM_FRIEND : 0 ...
SPAM
- r
+
PIXEL-7,12 : 1 PIXEL-7,13 : 0 ... NUM_LOOPS : 1 ...
“2”
31
Weights
} Binary case: compare features to a weight vector to identify the class } Learning: figure out the weight vector from examples
# free : 2 YOUR_NAME : 0 MISSPELLED : 2 FROM_FRIEND : 0 ... # free : 4 YOUR_NAME :-1 MISSPELLED : 1 FROM_FRIEND :-3 ... # free : 0 YOUR_NAME : 1 MISSPELLED : 1 FROM_FRIEND : 1 ...
Dot product positive means the positive class
32
Linear Classifier example
33
} Two class example:
𝑦1 𝑦2
1 2 3 1 2 3 4
− 3 4 𝑦' − 𝑦* + 3 = 0 if 𝒙\𝒚 + 𝑥E ≥ 0 then 𝒟' else 𝒟* 𝒟' 𝒟* 𝒙 = − 3 4 −1 𝑥E = 3
Binary Decision Rule
} In the space of feature vectors
} Examples are points } Any weight vector is a hyperplane } One side corresponds toY=+1 } Other corresponds toY=-1
BIAS : -3 free : 4 money : 2 ... 1 1 2 free money +1 = SPAM
- 1 = HAM
34
Distance between an 𝒚(A) and the plane
35
distance = 𝒙\𝒚(A) + 𝑥E 𝒙
𝒚(A)
Square error loss function for classification!
36
Square error loss is not suitable for classification:
}
Least square loss penalizes ‘too correct’ predictions (that they lie a long way on the correct side
- f the decision)
}
Least square loss also lack robustness to noise
𝐾 𝒙 = @ 𝑥𝑦 : + 𝑥E − 𝑧 :
*
- :B'
𝐿 = 2
Notation
37
} 𝒙 = 𝑥E, 𝑥', . . . , 𝑥Y \
} 𝒚 = 1, 𝑦', … , 𝑦Y \ } 𝑥E + 𝑥'𝑦' + ⋯ + 𝑥Y𝑦Y = 𝒙\𝒚 } We show input by 𝒚 or 𝑔(𝒚)
SSE cost function for classification
38
𝐾(𝒙)
} Is it more suitable if we set 𝑔 𝒚; 𝒙 = 𝒙\𝒚 ?
𝐾 𝒙 = @ sign 𝒙\𝒚 : − 𝑧 :
*
- :B'
sign 𝑨 = j− 1, 𝑨 < 0 1, 𝑨 ≥ 0
} 𝐾 𝒙
is a piecewise constant function shows the number
- f misclassifications
𝐿 = 2 𝒙\𝒚 𝑧 = 1 sign 𝒙\𝒚 − 𝑧 *
Training error incurred in classifying training samples
Perceptron algorithm
39
} Linear classifier } Two-class: 𝑧 ∈ {−1,1}
} 𝑧 = −1 for 𝐷*,
𝑧 = 1 for 𝐷'
} Goal: ∀𝑗, 𝒚(:) ∈ 𝐷' ⇒ 𝒙\𝒚(:) > 0 }
∀𝑗, 𝒚 : ∈ 𝐷* ⇒ 𝒙\𝒚 : < 0
} 𝒚; 𝒙 = sign(𝒙\𝒚)
Perceptron criterion
40
𝐾o 𝒙 = − @ 𝒙\𝒚 : 𝑧 :
- :∈ℳ
ℳ: subset of training data that are misclassified Many solutions? Which solution among them?
Cost function
41
[Duda, Hart, and Stork, 2002] 𝐾(𝒙) 𝐾o(𝒙) 𝑥E 𝑥' 𝑥E 𝑥' # of misclassifications as a cost function Perceptron’s cost function There may be many solutions in these cost functions
Batch Perceptron
42
“Gradient Descent” to solve the optimization problem: 𝒙MN' = 𝒙M − 𝜃𝛼
𝒙𝐾o(𝒙M)
𝛼
𝒙𝐾o 𝒙 = − @ 𝒚 : 𝑧 :
- :∈ℳ
Batch Perceptron converges in finite number of steps for linearly separable data:
Initialize 𝒙 Repeat 𝒙 = 𝒙 + 𝜃 ∑ 𝒚 : 𝑧 :
- :∈ℳ
Until convergence
Stochastic gradient descent for Perceptron
43
} Single-sample perceptron:
} If 𝒚(:) is misclassified:
𝒙MN' = 𝒙M + 𝜃𝒚(:)𝑧(:)
} Perceptron convergence theorem: for linearly separable data
} If training data are linearly separable, the single-sample perceptron is
also guaranteed to find a solution in a finite number of steps
Initialize 𝒙, 𝑢 ← 0 repeat 𝑢 ← 𝑢 + 1 𝑗 ← 𝑢 mod 𝑂 if 𝒚(:) is misclassified then 𝒙 = 𝒙 + 𝒚(:)𝑧(:) Until all patterns properly classified
Fixed-Increment single sample Perceptron 𝜃 can be set to 1 and proof still works
Weight Updates
44
Learning: Binary Perceptron
}
Start with weights = 0
}
For each training instance:
}
Classify with current weights
}
If correct (i.e., y=y*), no change!
}
If wrong: adjust the weight vector
45
𝒙MN' = 𝒙M + 𝜃𝒚(:)𝑧(:)
Example
46
Perceptron: Example
47
Change 𝒙 in a direction that corrects the error [Bishop]
Learning: Binary Perceptron
} Start with weights = 0 } For each training instance: } Classify with current weights } If correct (i.e., y=y*), no change! } If wrong: adjust the weight vector by
adding or subtracting the feature
- vector. Subtract if y* is -1.
48
Examples: Perceptron
} Separable Case
49
Convergence of Perceptron
50
} For data sets that are not linearly separable, the single-sample
perceptron learning algorithm will never converge
[Duda, Hart & Stork, 2002]
Multiclass Decision Rule
} If we have multiple classes:
}
A weight vector for each class:
}
Score (activation) of a class y:
}
Prediction highest score wins
Binary = multiclass where the negative class has weight zero
51
Learning: Multiclass Perceptron
} Start with all weights = 0 } Pick up training examples one by one } Predict with current weights } If correct, no change! } If wrong: lower score of wrong answer,
raise score of right answer
52
Example: Multiclass Perceptron
BIAS : 1 win : 0 game : 0 vote : 0 the : 0 ... BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ... BIAS : 0 win : 0 game : 0 vote : 0 the : 0 ...
“win the vote” “win the election” “win the game”
53
Properties of Perceptrons
} Separability: true if some parameters get the training set
perfectly classified
} Convergence: if the training is separable, perceptron will
eventually converge (binary case)
} Mistake Bound: the maximum number of mistakes (binary
case) related to the margin or degree of separability Separable Non-Separable
54
Examples: Perceptron
} Non-Separable Case
55
Examples: Perceptron
} Non-Separable Case
56
Discriminative approach: logistic regression
57
𝒚; 𝒙 = 𝜏(𝒙\𝒚)
𝜏 . is an activation function
} Sigmoid (logistic) function
} Activation function
𝜏 𝑨 = 1 1 + 𝑓wx
𝐿 = 2 𝒚 = 1, 𝑦', … , 𝑦Y 𝒙 = 𝑥E, 𝑥', … , 𝑥Y
Logistic regression: cost function
58
𝒙 F = argmin
𝒙
𝐾(𝒙) 𝐾 𝒙 = = @ −𝑧(:)log 𝜏 𝒙\𝒚(:) − (1 − 𝑧(:))log 1 − 𝜏 𝒙\𝒚(:)
A :B'
} 𝐾(𝒙) is convex w.r.t. parameters.
Logistic regression: loss function
59
Loss 𝑧, 𝑔 𝒚; 𝒙 = −𝑧×log 𝜏 𝒚; 𝒙 − (1 − 𝑧)×log(1 − 𝜏 𝒚; 𝒙 ) Loss 𝑧, 𝜏 𝒚; 𝒙 = j −log(𝜏(𝒚; 𝒙)) if 𝑧 = 1 −log(1 − 𝜏 𝒚; 𝒙 ) if 𝑧 = 0 How is it related to zero-one loss? Loss 𝑧, 𝑧 } = j1 𝑧 ≠ 𝑧 } 𝑧 = 𝑧 }
𝜏 𝒚; 𝒙 = 1 1 + 𝑓𝑦𝑞(−𝒙\𝒚) Since 𝑧 = 1 or 𝑧 = 0 ⇒
Logistic regression: Gradient descent
60
𝒙MN' = 𝒙M − 𝜃𝛼
𝒙𝐾(𝒙M)
𝛼
𝒙𝐾 𝒙 = @
𝜏 𝒚 : ; 𝒙 − 𝑧 : 𝒚 :
A :B'
} Is it similar to gradient of SSE for linear regression?
𝛼
𝒙𝐾 𝒙 = @
𝒙\𝒚 : − 𝑧 : 𝒚 :
A :B'
Multi-class logistic regression
} 𝒚; 𝑿 = ' 𝒚, 𝑿 , … , • 𝒚, 𝑿
\
} 𝑿 = 𝒙'
⋯ 𝒙• contains one vector of parameters for each class
61
‚ 𝒚; 𝑿 = exp (𝒙‚
\𝒚 )
∑ exp (𝒙…
\𝒚 )
- …B'
Logistic regression: multi-class
62
𝑿 † = argmin
𝑿
𝐾(𝑿) 𝐾 𝑿 = − @ @ 𝑧‚
: log ‚ 𝒚(:); 𝑿
- ‚B'
A :B' 𝑿 = 𝒙' ⋯ 𝒙• 𝒛 is a vector of length 𝐿 (1-of-K coding) e.g., 𝒛 = 0,0,1,0 \ when the target class is 𝐷ˆ
Logistic regression: multi-class
63
𝒙…
MN' = 𝒙… M − 𝜃𝛼𝑿𝐾(𝑿M)
𝛼
𝒙‰𝐾 𝑿 = @
… 𝒚 : ; 𝑿 − 𝑧…
:
𝒚 :
A :B'
Multi-class classifier
} 𝒚; 𝑿 = ' 𝒚, 𝑿 , … , • 𝒚, 𝑿 } 𝑿 = 𝒙'
⋯ 𝒙• contains one vector of parameters for each class
} In linear classifiers, 𝑿 is 𝑒×𝐿 where 𝑒 shows number of
features
} 𝑿\𝒚 provides us a vector
} 𝒚; 𝑿
contains K numbers giving class scores for the input 𝒚
64
Example
} Output obtained from 𝑿\𝒚 + 𝒄
𝒚 = 𝑦' ⋮ 𝑦•Ž•
28 ×28
𝑿\ = 𝒙' ⋮ 𝒙'E 'Eוޕ 𝒄 = 𝑐' ⋮ 𝑐'E
65 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017
Example
How can we tell whether this W and b is good or bad?
𝑿\
66 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017
Bias can also be included in the W matrix
67 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017
Softmax classifier loss: example
𝑀(') = − log 0.13 = 0.89
𝑀(:) = − log 𝑓
”•(–)
∑ 𝑓”‰
- …B'
68 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017
Support Vector Machines
} Maximizing the margin: good according to intuition, theory, practice } Support vector machines (SVMs) find the separator with max margin 69
Hard-margin SVM: Optimization problem
70
max
𝒙,—˜
2 𝒙
- s. t. 𝒙\𝒚 A + 𝑥E ≥ 1 ∀𝑧 A = 1
𝒙\𝒚 A + 𝑥E ≤ −1 ∀𝑧 A = −1
𝑦2 𝑦1
1 𝒙
𝒙\𝒚 + 𝑥E = 0 𝒙\𝒚 + 𝑥E = 1 𝒙\𝒚 + 𝑥E = −1 𝒙
Margin:
* 𝒙
Distance between an 𝒚(A) and the plane
71
distance = 𝒙\𝒚(A) + 𝑥E 𝒙
𝒚(A)
Hard-margin SVM: Optimization problem
72
We can equivalently optimize:
min
𝒙,—˜
1 2 𝒙\𝒙
- s. t. 𝑧 A
𝒙\𝒚 A + 𝑥E ≥ 1 𝑜 = 1, … , 𝑂
} It is a convex Quadratic Programming (QP) problem
} There are computationally efficient packages to solve it. } It has a global minimum (if any).
Error measure
73
} Margin violation amount 𝜊A (𝜊A ≥ 0):
} 𝑧 A
𝒙\𝒚 A + 𝑥E ≥ 1 − 𝜊A
} Total violation: ∑
𝜊A
- AB'
Soft-margin SVM: Optimization problem
74
} SVM with slack variables: allows samples to fall within the
margin, but penalizes them
min
𝒙,—˜, š› ›œ•
ž
1 2 𝒙 * + 𝐷 @ 𝜊A
- AB'
- s. t. 𝑧 A
𝒙\𝒚 A + 𝑥E ≥ 1 − 𝜊A 𝑜 = 1, … , 𝑂 𝜊A ≥ 0
𝜊A: slack variables 0 < 𝜊A < 1 : if 𝒚 A is correctly classified but inside margin 𝜊A > 1: if 𝒚 A is misclassifed
𝑦2 𝑦1 𝜊 < 1 𝜊 > 1
Soft-margin SVM: Cost function
75
min
𝒙,—˜, š› ›œ•
ž
1 2 𝒙 * + 𝐷 @ 𝜊A
- AB'
- s. t. 𝑧 A
𝒙\𝒚 A + 𝑥E ≥ 1 − 𝜊A 𝑜 = 1, … , 𝑂 𝜊A ≥ 0
} It
is equivalent to the unconstrained
- ptimization
problem: min
𝒙,—˜
1 2 𝒙 * + 𝐷 @ max (0,1 − 𝑧(A)(𝒙\𝒚(A) + 𝑥E))
- AB'
Multi-class SVM
𝐾 𝑿 = 1 𝑂 @ 𝑀 :
- :B'
+ 𝜇 𝑿 * 𝑀 : = @ max 0,1 + 𝑡
… − 𝑡¡(–) …¢¡(–)
= @ max 0,1 + 𝒙…
\𝒚(:) − 𝒙¡(–) \
𝒚(:)
…¢¡(–)
𝑆 𝑿 = @ @ 𝑥¤‚
* Y ¤B'
- ‚B'
𝑡
… ≡ … 𝒚 : ; 𝑿
= 𝒙…
\𝒚(:)
Hinge loss: L2 regularization:
76
Multi-class SVM loss: Example
𝑀 : = @ max 0,1 + 𝑡
… − 𝑡¡(–) …¢¡(–)
𝑀(') = max 0,1 + 5.1 − 3.2 + max 0,1 − 1.7 − 3.2 = max 0,2.9 + max(0, −3.9) = 2.9 + 0 𝑀(*) = max 0,1 + 1.3 − 4.9 + max 0,1 + 2 − 4.9 = max 0, −2.6 + max(0, −1.9) = 0 + 0 1 𝑂 @ 𝑀 :
- :B'
= 1 3 2.9 + 0 + 12.9 = 5.7
3 training examples, 3 classes. With some W the scores are 𝑋\𝑦
𝑀(ˆ) = max (0, 2.2 − (−3.1) + 1) +max (0, 2.5 − (−3.1) + 1) = max (0, 6.3) + max (0, 6.6) = 6.3 + 6.6 = 12.9
𝑡
… = 𝒙… \𝒚(:)
77 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017
Recap
We need 𝛼ª𝑀 to update weights
78
L2 regularization 𝑆 𝑋 = ∑ ∑ 𝑥¤‚
* Y ¤B'
- ‚B'
L1 regularization 𝑆 𝑋 = ∑ ∑ 𝑥¤‚
Y ¤B'
- ‚B'
This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017
Generalized linear
} Linear combination of fixed non-linear function of the
input vector 𝑔(𝒚; 𝒙) = 𝑥E + 𝑥'𝜚'(𝒚)+ . . . 𝑥¬𝜚¬(𝒚) {𝜚'(𝒚), . . . , 𝜚¬(𝒚)}: set of basis functions (or features) 𝜚: 𝒚 : ℝY → ℝ
79
Basis functions: examples
} Linear } Polynomial (univariate) 80
Polynomial regression: example
81
𝑛 = 1 𝑛 = 3 𝑛 = 5 𝑛 = 7
Generalized linear classifier
82
} Assume a transformation 𝜚: ℝY → ℝ¬ on the feature
space
} 𝒚 → 𝝔 𝒚
} Find a hyper-plane in the transformed feature space:
𝑦2 𝑦1 𝜚'(𝒚) 𝜚*(𝒚) 𝜚: 𝒚 → 𝝔 𝒚 𝒙\𝝔 𝒚 + 𝑥E = 0 {𝜚'(𝒚), . . . , 𝜚¬(𝒚)}: set of basis functions (or features) 𝜚: 𝒚 : ℝY → ℝ 𝝔 𝒚 = [𝜚'(𝒚), . . . , 𝜚¬(𝒚)]
Model complexity and overfitting
} With limited training data, models may achieve zero
training error but a large test error.
} Over-fitting: when the training loss no longer bears any
relation to the test (generalization) loss.
} Fails to generalize to unseen examples.
83
1 𝑜 @ 𝑧 : − 𝑔 𝒚 : ; 𝜾
*
≈ 0
A :B'
Training (empirical) loss Expected (true) loss E𝐲,´ 𝑧 − 𝑔 𝒚; 𝜾
* ≫ 0
Polynomial regression
84
𝑛 = 0 𝑛 = 1 𝑛 = 3 𝑛 = 9 𝑧 𝑧 𝑧 𝑧 [Bishop]
Over-fitting causes
85
} Model complexity
} E.g., Model with a large number of parameters (degrees of
freedom)
} Low number of training data
} Small data size compared to the complexity of the model
Model complexity
86
} Example:
} Polynomials with larger 𝑛 are becoming increasingly tuned to
the random noise on the target values.
86
𝑛 = 0 𝑛 = 1 𝑛 = 3 𝑛 = 9 𝑧 𝑧 𝑧 𝑧 [Bishop]
Number of training data & overfitting
87
} Over-fitting problem becomes less severe as the size of
training data increases.
𝑛 = 9 𝑛 = 9 𝑜 = 15 𝑜 = 100 [Bishop]
How to evaluate the learner’s performance?
88
} Generalization error: true (or expected) error that we
would like to optimize
} Two ways to assess the generalization error is:
} Practical: Use a separate data set to test the model } Theoretical: Law of Large numbers
} statistical bounds on the difference between training and expected
errors
Avoiding over-fitting
89
} Determine a suitable value for model complexity (Model
Selection)
} Simple hold-out method } Cross-validation
} Regularization (Occam’s Razor)
} Explicit preference towards simple models } Penalize for the model complexity in the objective function
Model Selection
90
} learning algorithm defines the data-driven search over
the hypothesis space (i.e. search for good parameters)
} hyperparameters are the tunable aspects of the model,
that the learning algorithm does not select
This slide has been adopted from CMU ML course: http://www.cs.cmu.edu/~mgormley/courses/10601-s18/
Model Selection
91
} Model selection is the process by which we choose the
“best” model from among a set of candidates
} assume access to a function capable of measuring the quality of
a model
} typically done “outside” the main training algorithm
} Model selection / hyperparameter optimization is just
another form of learning
This slide has been adopted from CMU ML course: http://www.cs.cmu.edu/~mgormley/courses/10601-s18/
Simple hold-out: model selection
92
} Steps:
} Divide training data into training and validation set 𝑤_𝑡𝑓𝑢 } Use only the training set to train a set of models } Evaluate each learned model on the validation set
} 𝐾¸ 𝒙 =
' ¸_”¹M ∑
𝑧(:) − 𝑔 𝒚(:); 𝒙
*
- :∈¸_”¹M
} Choose the best model based on the validation set error
} Usually, too wasteful of valuable training data
} Training data may be limited. } On the other hand, small validation set give a relatively noisy
estimate of performance.
Simple hold out: training, validation, and test sets
93
} Simple hold-out chooses the model that minimizes error on
validation set.
} 𝐾¸ 𝒙
F is likely to be an optimistic estimate of generalization error.
} extra parameter (e.g., degree of polynomial) is fit to this set.
} Estimate generalization error for the test set
} performance of the selected model is finally evaluated on the test set
Training Validation Test
Cross-Validation (CV): Evaluation
94
} 𝑙-fold cross-validation steps:
} Shuffle the dataset and randomly partition training data into 𝑙 groups of
approximately equal size
} for 𝑗 = 1 to 𝑙
} Choose the 𝑗-th group as the held-out validation group } Train the model on all but the 𝑗-th group of data } Evaluate the model on the held-out group
} Performance scores of the model from 𝑙 runs are averaged.
} The average error rate can be considered as an estimation of the true
performance. … … … … … First run Second run (k-1)th run k-th run
Cross-Validation (CV): Model Selection
95
} For each model we first find the average error find by CV. } The model with the best average performance is
selected.
Cross-validation: polynomial regression example
} 5-fold CV } 100 runs
} average
96
𝑛 = 1 CV: 𝑁𝑇𝐹 = 0.30 𝑛 = 3 CV: 𝑁𝑇𝐹 = 1.45 𝑛 = 5 CV: 𝑁𝑇𝐹 = 45.44 𝑛 = 7 CV: 𝑁𝑇𝐹 = 31759
Regularization
97
} Adding a penalty term in the cost function to discourage
the coefficients from reaching large values.
} Ridge regression (weight decay):
𝐾 𝒙 = @ 𝑧 : − 𝒙\𝝔 𝒚 :
*
+ 𝜇𝒙\𝒙
A :B'
𝒙
¾ = 𝚾\𝚾 + 𝜇𝑱
w𝟐 𝚾\𝒛
Polynomial order
98
} Polynomials with larger 𝑛
are becoming increasingly tuned to the random noise on the target values.
} magnitude of the coefficients typically gets larger by increasing
𝑛.
[Bishop]
Regularization parameter
99
𝑥 F' 𝑥 F* 𝑥 Fˆ 𝑥 F• 𝑥 FÂ 𝑥 FÃ 𝑥 F• 𝑥 FŽ 𝑥 FÄ 𝑛 = 9 𝑥 FE 𝑚𝑜𝜇 = −∞ 𝑚𝑜𝜇 = −18 [Bishop]
Regularization parameter
100
} Generalization
} 𝜇 now controls the effective complexity of the model and
hence determines the degree of over-fitting
[Bishop]
Choosing the regularization parameter
101
} A set of models with different values of 𝜇. } Find 𝒙
F for each model based on training data
} Find 𝐾¸(𝒙
F) (or 𝐾Ǹ(𝒙 F)) for each model
} 𝐾¸ 𝒙 =
' A_¸ ∑
𝑧(:) − 𝑔 𝑦(:); 𝒙
*
- :∈¸_”¹M
} Select the model with the best 𝐾¸(𝒙
F) (or 𝐾Ǹ(𝒙 F))
The approximation-generailization trade-off
102
} Small true error shows good approximation of 𝑔 out of
sample
} More complex ℋ ⇒ better chance of approximating 𝑔 } Less complex ℋ ⇒ better chance of generalization out of 𝑔
Complexity of Hypothesis Space: Example
103
Price Size Price Size Price Size
𝑥E + 𝑥'𝑦 𝑥E + 𝑥'𝑦 + 𝑥*𝑦* 𝑥E + 𝑥'𝑦 + 𝑥*𝑦* + 𝑥ˆ𝑦ˆ + 𝑥•𝑦• This example has been adapted from: Prof. Andrew Ng’s slides More complex ℋ Less complex ℋ
Complexity of Hypothesis Space: Example
104
Price Size Price Size Price Size
𝑥E + 𝑥'𝑦 𝑥E + 𝑥'𝑦 + 𝑥*𝑦* 𝑥E + 𝑥'𝑦 + 𝑥*𝑦* + 𝑥ˆ𝑦ˆ + 𝑥•𝑦• This example has been adapted from: Prof. Andrew Ng’s slides Overfitting Underfitting
Complexity of Hypothesis Space: Example
105
degree of polynomial 𝑛 error 𝐾¸ 𝐾MÈÉ:A 𝐾¸ 𝒙 = 1 𝑜_𝑤 @ 𝑧(:) − 𝑔 𝒚(:); 𝒙
*
- :∈¸É¤_”¹M
𝐾MÈÉ:A 𝒙 = 1 𝑜_𝑢𝑠𝑏𝑗𝑜 @ 𝑧(:) − 𝑔 𝒚(:); 𝒙
*
- :∈MÈÉ:A_”¹M
Complexity of Hypothesis Space
106
} Less complex ℋ:
} 𝐾MÈÉ:A(𝒙
F) ≈ 𝐾¸(𝒙 F) and 𝐾MÈÉ:A(𝒙 F) is very high
} More complex ℋ:
} 𝐾MÈÉ:A(𝒙
F) ≪ 𝐾¸(𝒙 F) and 𝐾MÈÉ:A(𝒙 F) is low
degree of polynomial 𝑛 error 𝐾¸(𝒙 F) 𝐾MÈÉ:A(𝒙 F)
Regularization: Example
107
𝑔 𝑦; 𝒙 = 𝑥E + 𝑥'𝑦 + 𝑥*𝑦* +𝑥ˆ 𝑦ˆ +𝑥• 𝑦• 𝐾 𝒙 = 1 𝑜 @ 𝑧 : − 𝑔 𝑦 : ; 𝒙
*
+ 𝜇𝒙\𝒙
A :B'
Large 𝜇x
(Prefer to more simple models)
Intermediate 𝜇
Price Size Price Size Price Size
Small 𝜇
(Prefer to more complex models)
𝑥' = 𝑥* ≈ 0 𝜇 = 0 This example has been adapted from: Prof. Andrew Ng’s slides
Lesson
108
Match the model complexity to the data sources not to the complexity of the target function.
Resources
109