[PPT] - Learning Theory CE-717: Machine Learning Sharif University of PowerPoint Presentation

SLIDE 1

Learning Theory

CE-717: Machine Learning

Sharif University of Technology

M. Soleymani

Fall 2016

SLIDE 2

Topics

2

 Feasibility of learning  PAC learning  VC dimension  Structural Risk Minimization (SRM)

SLIDE 3

Feasibility of learning

3

 Does the training set 𝒠 tell us anything out of 𝒠?

 𝒠 does not tells us something certain about 𝑔 outside of 𝒠  However, it can tell us something likely about 𝑔 outside of 𝒠

 Probability helps us to find learning theory

SLIDE 4

Feasibility of learning

4

 These two questions:

 Can we make sure 𝐹𝑢𝑠𝑣𝑓(𝑔) is close to 𝐹𝑢𝑠𝑏𝑗𝑜(𝑔)?  Can we make 𝐹𝑢𝑠𝑏𝑗𝑜(𝑔) small enough?

SLIDE 5

Generalizability of Learning

5

 Generalization error is important to us  Why should doing well on the training set tell us anything

about generalization error?

 Can we relate error on training set to generalization error?

 Which are conditions under which we can actually prove

that learning algorithms will work well?

SLIDE 6

A related example

6

 Value of 𝜈 is unknown to us  We pick 𝑂 marbles independently  The fraction of red marbles in sample =

𝜈

Pr picking a red marble = 𝜈 Pr picking a green marble = 1 − 𝜈

SLIDE 7

Does 𝜈 say anything about 𝜈?

7

 No:

 Samples can be mostly green while bin is mostly red

 Yes:

 Sample frequency

𝜈 is likely close to bin frequency 𝜈

SLIDE 8

What does 𝜈 say about 𝜈?

8

 In a big sample (large 𝑂), 𝜉 is probably close to 𝜈 (within

𝜗): Pr

𝜈 − 𝜈 > 𝜗 ≤ 2𝑓−2𝜗2𝑂

 Valid for all 𝑂 and 𝜗  Bound does not depend on 𝜈  Tradeoff: 𝑂, 𝜗, and the bound

 In the other words, “

𝜈 = 𝜈” is Probably Approximately

Correct (PAC)

Hoeffding’s Inequality

SLIDE 9

Recall: Learning diagram

9

𝑑: 𝑑 ≈g

𝑦 1 , … , 𝑦 𝑂 𝑦 1 , 𝑧(1) , … , 𝑦 𝑂 , 𝑧(𝑂)

[Y.S. Abou Mostafa, et. al, “Learning From Data”, 2012]

We assume that some random process proposes instances, and teacher labels them (i.e., instances drawn i.i.d. according to a distribution 𝑄(𝒚))

SLIDE 10

Learning: Problem settings

10

 Set of all instances 𝒴  Set of hypotheses ℋ  Set of possible target functions 𝐷 = {𝑑: 𝒴 → 𝒵}  Sequence of 𝑂 training instances 𝒠 =

𝒚(𝑜), 𝑑 𝒚(𝑜)

𝑜=1 𝑂

 𝒚 drawn at random from unknown distribution 𝑄 𝒚  Teacher provides noise-free label 𝑑(𝒚) for it

 Learner observes a set of training examples 𝒠 for target

function 𝑑 and outputs a hypothesis ℎ ∈ ℋ estimating 𝑑

SLIDE 11

Connection of Hoeffding inequality to learning

11

 In the bin example, the unknown is 𝜈  In the learning problem the unknown is a function 𝑑: 𝒴

→ 𝒵

SLIDE 12

Two notions of error

12

 Training error of 𝒊: how often ℎ(𝒚) ≠ 𝑑(𝒚) on training

instances 𝐸 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ ≡ 𝐹𝒚~𝒠 𝐽 ℎ(𝒚) ≠ 𝑑(𝒚) = 1 𝒠

𝒚∈𝒠

𝐽 ℎ 𝒚 ≠ 𝑑 𝒚

 T

est error of 𝒊: how often ℎ(𝒚) ≠ 𝑑(𝒚) over future instances drawn at random from 𝑄(𝑌) 𝐹𝑢𝑠𝑣𝑓(ℎ) ≡ 𝐹𝒚~𝑄(𝑌) 𝐽 ℎ(𝒚) ≠ 𝑑(𝒚)

Training data Probability distribution

SLIDE 13

Notation for learning

13

 Both 𝜈 and

𝜈 depend on which hypothesis ℎ



𝜈 is “in sample” denoted by 𝐹𝑢𝑠𝑏𝑗𝑜(ℎ)

 𝜈 is “out of sample” denoted by 𝐹𝑢𝑠𝑣𝑓(ℎ)  The Hoeffding inequality becomes:

𝐹𝑢𝑠𝑣𝑓(ℎ) 𝐹𝑢𝑠𝑏𝑗𝑜(ℎ)

Pr 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ − 𝐹𝑢𝑠𝑣𝑓 ℎ > 𝜗 ≤ 2𝑓−2𝜗2𝑂

SLIDE 14

Are we done?

14

 We cannot use this bound for the learned 𝑔 from data.  Indeed, ℎ is assumed fixed in this inequality and for this ℎ,

𝐹𝑢𝑠𝑏𝑗𝑜(ℎ) generalizes to 𝐹𝑢𝑠𝑣𝑓(ℎ) .

 “verification” of ℎ, not learning

 We need to choose from multiple ℎ's and 𝑔 is not fixed

and instead is found according to the samples.

SLIDE 15

Hypothesis space as multiple bins

15

 Generalizing the bin model to more than one hypothesis:

SLIDE 16

Hypothesis space: Coin example

16

 Question: if you toss a fair coin 10 times, what is the

probability that it will get 10 heads?

 Answer: ≈ 0.1%

 Question: if you toss 1000 fair coins 10 times, what is the

probability that some of them will get 10 heads?

 Answer: ≈ 63%

SLIDE 17

A bound for the learning problem: Using Hoeffding inequality

17

Pr 𝐹𝑢𝑠𝑣𝑓 𝑔 − 𝐹𝑢𝑠𝑏𝑗𝑜 𝑔 > 𝜗 ≤ Pr 𝐹𝑢𝑠𝑣𝑓 ℎ1 − 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ1 > 𝜗

r 𝐹𝑢𝑠𝑣𝑓 ℎ2 − 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ2

> 𝜗 …

r 𝐹𝑢𝑠𝑣𝑓 ℎ𝑁 − 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ𝑁

> 𝜗 ≤

𝑗=1 𝑁

Pr 𝐹𝑢𝑠𝑣𝑓 ℎ𝑗 − 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ𝑗 > 𝜗 ≤

𝑗=1 𝑁

2𝑓−2𝜗2𝑂 ≤ 2 ℋ 𝑓−2𝜗2𝑂

ℋ = 𝑁

SLIDE 18

PAC bound: Using Hoeffding inequality

18

Pr 𝐹𝑢𝑠𝑣𝑓 ℎ − 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ > 𝜗 ≤ 2 ℋ 𝑓−2𝜗2𝑂 = 𝜀 ⇒ Pr 𝐹𝑢𝑠𝑣𝑓 ℎ − 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ ≤ 𝜗 ≥ 1 − 𝜀

 With probability at least (1 − 𝜀) every ℎ satisfies

𝐹𝑢𝑠𝑣𝑓 ℎ < 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ + ln2 ℋ + ln 1 𝜀 2𝑂

Thus, we can we bound 𝐹𝑢𝑠𝑣𝑓 ℎ − 𝐹𝑢𝑠𝑏𝑗𝑜(ℎ) that shows the amount of overfiting

SLIDE 19

Sample complexity

19

 How many training examples suffice?

 Given 𝜗 and 𝜀, yields sample complexity:

𝑂 ≥ 1 2𝜗2 ln 2 ℋ + ln 1 𝜀

 Thus, we found a theory that relates

 Number of training examples  Complexity of hypothesis space  Accuracy to which target function is approximated  Probability that learner outputs a successful hypothesis

SLIDE 20

An other problem setting

20

 Finite number of possible hypothesis (e.g., decision trees

f depth 𝑒0)

 A learner finds a hypothesis ℎ that is consistent with

training data

 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ = 0

 What is the probability that the true error of ℎ will be

more than 𝜗?

 𝐹𝑢𝑠𝑣𝑓 ℎ ≥ 𝜗

SLIDE 21

True error of a hypothesis

21

 True error of ℎ: probability that it will misclassify an example

drawn at random from 𝑄 𝒚

𝐹𝑢𝑠𝑣𝑓(ℎ) ≡ 𝐹𝒚~𝑄(𝑌) 𝐽 ℎ(𝒚) ≠ 𝑑(𝒚)

Target 𝑑(𝒚)

SLIDE 22

How likely is a consistent learner to pick a bad hypothesis?

22

 Bound on the probability that any consistent learner will

utput ℎ with 𝐹𝑢𝑠𝑣𝑓 ℎ > 𝜗

 Theorem [Haussler, 1988]: For target concept 𝑑, ∀ 0 ≤ 𝜗 ≤ 1

 If 𝐼 is finite and 𝒠 contains 𝑂 ≥ 1 independent random samples

Pr ∃ℎ ∈ ℋ, 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ = 0 ∧ 𝐹𝑢𝑠𝑣𝑓 ℎ > 𝜗 ≤ ℋ 𝑓−𝜗𝑂

SLIDE 23

Haussler bound: Proof

23

 What does the theorem mean?

Pr ∃ℎ ∈ ℋ, 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ = 0 ∧ 𝐹𝑢𝑠𝑣𝑓 ℎ > 𝜗 ≤ ℋ 𝑓−𝜗𝑂

 For a fixed ℎ, how likely is a bad hypothesis (i.e., 𝐹𝑢𝑠𝑣𝑓 ℎ

> 𝜗) to label 𝑂 training data points right?

 Pr(ℎ labels one data point correctly|𝐹𝑢𝑠𝑣𝑓 ℎ > 𝜗) ≤ (1 − 𝜗)  Pr(ℎ labels 𝑂 i. i. d data points correctly|𝐹𝑢𝑠𝑣𝑓 ℎ > 𝜗) ≤ 1 − 𝜗 𝑂

SLIDE 24

Haussler bound: Proof (Cont’d)

24

 There may be many bad hypotheses ℎ1, … , ℎ𝑙 (i.e., 𝐹𝑢𝑓𝑡𝑢 ℎ1 > 𝜗,

…, 𝐹𝑢𝑓𝑡𝑢 ℎ𝑙 > 𝜗) that are consistent with 𝑂 training data

 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ1 = 0, 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ2 = 0, …, 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ𝑙 = 0

 How likely is the learner pick a bad hypothesis (𝐹𝑢𝑓𝑡𝑢 ℎ > 𝜗)

among consistent ones {ℎ1, … , ℎ𝑙}?

 Pr ∃ℎ ∈ 𝐼, 𝐹𝑢𝑠𝑣𝑓 ℎ > 𝜗 ∧ 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ = 0  = Pr

𝐹𝑢𝑠𝑣𝑓 ℎ1 > 𝜗 ∧ 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ1 = 0 or … or 𝐹𝑢𝑠𝑣𝑓 ℎ𝑙 > 𝜗 ∧ 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ𝑙 = 0

 ≤ 𝑗=1

𝑙

Pr(𝐹𝑢𝑠𝑏𝑗𝑜 ℎ𝑗 = 0 ∧ 𝐹𝑢𝑠𝑣𝑓 ℎ𝑗 > 𝜗)

 ≤ 𝑗=1

𝑙

Pr(𝐹𝑢𝑠𝑏𝑗𝑜 ℎ𝑗 = 0|𝐹𝑢𝑠𝑣𝑓 ℎ𝑗 > 𝜗) ≤ 𝑗=1

𝑙

1 − 𝜗 𝑂

 ≤ ℋ 1 − 𝜗 𝑂  ≤ ℋ 𝑓−𝜗𝑂

[𝑄 A ∪ 𝐶 ≤ 𝑄 𝐵 + 𝑄 𝐶 ] [𝑙 ≤ ℋ ] [1 − 𝜗 ≤ 𝑓−𝜗 0 ≤ 𝜗 ≤ 1]

SLIDE 25

Haussler PAC Bound

25

 Theorem

[Haussler’88]: Consider finite hypothesis space 𝐼, training set 𝐸 with m i.i.d. samples, 0 < 𝜗 < 1: Pr ∃ℎ ∈ ℋ, 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ = 0|𝐹𝑢𝑠𝑣𝑓 ℎ > 𝜗 ≤ ℋ 𝑓−𝜗𝑂 ≤ 𝜀

 For any learned hypothesis ℎ ∈ ℋ that is consistent on the

training set 𝒠 (i.e., 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ = 0), with probability at least (1 − 𝜀):

𝐹𝑢𝑠𝑣𝑓(ℎ) ≤ ϵ

Suppose we want this probability to be at most 𝜀.

SLIDE 26

Haussler PAC bound: Sample complexity

26

 How many training examples suffice?

 Given 𝜗 and 𝜀, yields sample complexity:

𝑂 ≥ 1 𝜗 ln ℋ + ln 1 𝜀

 Given 𝑂 and 𝜀, yields error bound:

𝜗 ≤ 1 𝑂 ln ℋ + ln 1 𝜀

There are enough training examples to guarantee that any consistent hypothesis has error at most 𝜗 with probability 1 − 𝜀. Error bound linear in 1

𝑂 and only logarithmic in ℋ .

SLIDE 27

Example: Conjunction of up to 𝑒 Boolean literals

27

 Consider a Boolean classification problem 𝑑: 𝒴 → 𝒵  Hypothesis space: rules that are in the form of conjunction of

up to 𝑒 Boolean literals

 Example: (𝑒 = 5 boolean features)  if 𝒚 = [0 ? 1 ? ? ] then 𝑧 = 1 else 𝑧 = 0

 How many training examples 𝑂?  “Any consistent learner using ℋ with probability≥ 0.99 will

utput a hypothesis with 𝐹𝑢𝑠𝑣𝑓 ≤ 0.05”?

 𝑒 = 5 ⇒ 𝑂 > 201  𝑒 = 10 ⇒ 𝑂 > 312  𝑒 = 100 ⇒ 𝑂 > 2290

¬𝑦1 ∧ 𝑦3 𝜀 = 0.01 𝜗 = 0.05 ℋ = 3𝑒

SLIDE 28

Example: decision trees of limited depth

28

 Consider a Boolean classification problem

 instances: vectors of 𝑒 boolean features

 Hypothesis space: decision trees of depth 2  How many training examples 𝑛?  “Any consistent learner using ℋ with probability≥ 0.99 will

utput a hypothesis with 𝐹𝑢𝑠𝑣𝑓 ≤ 0.05”?

 𝑒 = 4 ⇒ 𝑛 > 184  𝑒 = 4 ⇒ 𝑂 > 219  𝑒 = 10 ⇒ 𝑂 > 281  𝑒 = 100 ⇒ 𝑂 > 423  𝑒 = 1000 ⇒ 𝑂 > 562

𝜀 = 0.01 𝜗 = 0.05 ℋ = 16 × 𝑒 × 𝑒 − 1 2

SLIDE 29

Limitations of Haussler’88 bound

29

 There are consistent classifiers in the hypothesis space: ℎ

such that 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ = 0

 Dependence on the size of hypothesis space:

 What if |ℋ| is too big or ℋ is continuous?

SLIDE 30

Limitation of the bounds

30

 Until now, we find bounds for two cases:

 Haussler’s bound with the assumption ∃ℎ ∈ ℋ, 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ = 0  Hoeffding’s bound

 If ℋ = {ℎ | ℎ: 𝒴 → 𝒵} is infinite,

 We seek a measure of complexity instead of |ℋ|?

 The largest subset of 𝒴 for which ℋ can guarantee zero training

error (regardless of the target function)

 VC dimension of ℋ is the size of this subset

SLIDE 31

Definitions

31

 Dichotomy:

 An N-tuple of ±1 assigned to samples 𝒚(1), … , 𝒚(𝑂) ∈ 𝒴

 The dichotomies generated by ℋ on the data points

𝒚(1), … , 𝒚(𝑂): ℋ 𝒚(1), … , 𝒚(𝑂) = ℎ 𝒚(1), … , 𝒚(𝑂) |ℎ ∈ ℋ

 The growth function of a hypothesis set ℋ is defined

as: 𝑛ℋ 𝑂 = max

𝒚(1),…,𝒚(𝑂)∈𝒴 ℋ 𝒚(1), … , 𝒚(𝑂)

SLIDE 32

Shattering a set of instances

32

𝑛ℋ 𝑂 ≤ 2𝑂

 A set 𝒚(1), … , 𝒚(𝑂) is shattered by ℋ iff for every

labeling of these samples there exists some hypotheses in ℋ consistent with this labeling

 (i.e., there exist hypotheses in ℋ that can realize this labeling)

𝑛ℋ 𝑂 = 2𝑂

 ℋ is as diverse as can be on this particular sample.

SLIDE 33

Perceptron in a 2-dim feature space

33

 𝐼 = {(𝑥0 + 𝑥1𝑦1 + 𝑥2𝑦2) > 0 → 𝑧 = 1)}

SLIDE 34

Polynomial bound on 𝑛ℋ 𝑙

34

 Break point: If no data set of size 𝑙 can be shattered by ℋ,

then 𝑙 is said to be a break point for ℋ . 𝑛ℋ 𝑙 < 2𝑙

 We can bound 𝑛ℋ 𝑙

for all values of 𝑂 by a simple polynomial based on this break point.

 Theorem: If 𝑛ℋ 𝑙 < 2𝑙 for some value 𝑙, then:

𝑛ℋ 𝑂 ≤

𝑗=0 𝑙−1

𝑂 𝑗

Maximum power is 𝑂𝑙−1 Sauer’s Lemma

SLIDE 35

Break point: Example

35

 Example: None of 4 points can be shattered by the two-

dimensional perceptron

 This puts a significant constraint on # of dichotomies that

can be realized by the perceptron on 5 or more points.

SLIDE 36

Growth function example: 1-D intervals

36

 𝑑: 𝑦 → {0,1}  What isVC dimension of:

 Positive rays:

 H1(open intervals to right):  if 𝑦 > 𝑏 then 𝑧 = 1 else 𝑧 = 0

 Positive intervals:

 H2 (inside intervals): if 𝑏 < 𝑦 < 𝑐 then 𝑧 = 1 else 𝑧 = 0

𝑛𝐼1 𝑂 = 𝑂 + 1 𝑛𝐼2 𝑂 = 𝑂 + 1 2 + 1

SLIDE 37

Generalization bound using growth function

37

Pr 𝐹𝑢𝑠𝑣𝑓 ℎ − 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ > 𝜗 ≤ 4𝑛ℋ 2𝑂 𝑓−1

8𝜗2𝑂

 With probability at least (1 − 𝜀) every ℎ ∈ 𝐼 satisfies

𝐹𝑢𝑠𝑣𝑓 ≤ 𝐹𝑢𝑠𝑏𝑗𝑜 + 8 ln 𝑛ℋ 2𝑂 + 8 ln 4 𝜀 𝑂

 In many cases, this bounds will be tighter than the

previous bound for finite hypothesis spaces too.

Vapnik-Chervonenkis inequality

SLIDE 38

𝑛ℋ 𝑂 relates to overlaps

38

SLIDE 39

Vapnick-Chervonenkis (VC) dimension

39

 The smaller break point, the tighter bound  Vapnik-Chervonenkis 𝑊𝐷(ℋ): the size of the largest set of

samples that can be shattered by ℋ.

 𝑊𝐷(ℋ) is the largest value of 𝑂 for which 𝑛ℋ 𝑂 = 2𝑂

 In order to prove that 𝑊𝐷(ℋ) is 𝑙:

 There’s at least one set of size 𝑙 that ℋ can shatter.  And there is no set of 𝑙 + 1 points that can be shattered.

 for all 𝑙 + 1 points, there exists a labeling that cannot be shattered

SLIDE 40

VC dimension: 1-D intervals

40

 𝑑: 𝑌 → {0,1}  What isVC dimension of:

 Positive rays:

 H1(open intervals to right):  if 𝑦 > 𝑏 then 𝑧 = 1 else 𝑧 = 0

 Positive intervals:

 H2 (inside intervals): if 𝑏 < 𝑦 < 𝑐 then 𝑧 = 1 else 𝑧 = 0

𝑊𝐷 𝐼1 = 1 𝑛𝐼1 𝑂 = 𝑂 + 1 𝑊𝐷 𝐼2 = 2 𝑛𝐼2 𝑂 = 𝑂 + 1 2 + 1

SLIDE 41

Bound on 𝑛ℋ 𝑙 using VC

41

 Since 𝑙 = 𝑊𝐷(ℋ) + 1 is a break point for 𝑛ℋ 𝑂 :

𝑛ℋ 𝑂 ≤

𝑗=0 𝑊𝐷 ℋ

𝑂 𝑗

𝑗=0 𝑙

𝑂 𝑗 ≤ 𝑂𝑙 + 1 ⇒ 𝑛ℋ 𝑂 ≤ 𝑂𝑊𝐷 ℋ + 1

SLIDE 42

VC dimension: Perceptron in a 2-D space

42

Can be shattered by linear boundaries Cannot be shattered by linear boundaries However, we seek the set of points with the most possible dichotomies

SLIDE 43

VC dimension: Perceptron in a 2-D space

43

 𝑊𝐷(𝐼) ≥ 3

 None of 4 points in a 2-D space can be shattered by perceptron

 𝑊𝐷(𝐼) ≤ 3

⇒ 𝑊𝐷 𝐼 = 3

SLIDE 44

VC of Perceptron

44

 𝑒 = 2 ⟹ 𝑊𝐷 = 3  In general 𝑊𝐷 = 𝑒 + 1

SLIDE 45

45

SLIDE 46

46

SLIDE 47

47

SLIDE 48

For any 𝑒 + 2 points

48

 𝒚(1), … , 𝒚 𝑒+1 , 𝒚(𝑒+2)  Since we have more points than dimensions, thus:

∃𝑛, 𝒚 𝑛 =

𝑜≠𝑛

𝑏𝑜𝒚 𝑜

where not all the 𝑏𝑜’s are zero

SLIDE 49

For any 𝑒 + 2 points, we cannot reach all dichotomies

49

𝒚 𝑛 =

𝑜≠𝑛

𝑏𝑜𝒚 𝑜 ⇒ 𝒙𝑈𝒚 𝑛 =

𝑜≠𝑛

𝑏𝑜𝒙𝑈𝒚 𝑜

 If 𝑧(𝑜) = sign 𝒙𝑈𝒚 𝑜

= sign(𝑏𝑗) then: 𝑏𝑜𝒙𝑈𝒚 𝑜 > 0

 This forces 𝒙𝑈𝒚 𝑛 = 𝑜≠𝑛 𝑏𝑜𝒙𝑈𝒚 𝑜 > 0  Therefore, y(m) = sign 𝒙𝑈𝒚 𝑛

= +1

SLIDE 50

VC of perceptron in d-dimensional space

50

 We showed that 𝑊𝐷 ≥ 𝑒 + 1 and 𝑊𝐷 ≤ 𝑒 + 1 thus 𝑊𝐷

= 𝑒 + 1

 In Perceptron the VC is the number of parameters

(𝑥0, 𝑥1, … , 𝑥𝑒)

SLIDE 51

Other examples

51

 Positive rays  Positive intervals

𝑏 𝑏 𝑐

SLIDE 52

VC dimension as degrees of freedom

52

 Parameters creates degrees of freedom  VC as effective degrees of freedom

 How expressive is this model  Not just the # of parameters  The effective number of parameters

SLIDE 53

𝑊𝐷 𝐼 = ∞

53

 If 𝑛ℋ 𝑂 = 2𝑂 for all 𝑂 then 𝑊𝐷 𝐼 = ∞

 If 𝑊𝐷 𝐼 = ∞ then no matter how large the data set is,

we cannot make generalization conclusions based on the VC analysis.

SLIDE 54

Consistent learning

54

 𝐹𝑢𝑠𝑣𝑓converges 𝐹𝑢𝑠𝑏𝑗𝑜 when 𝑂 increases

𝑂 𝑂 𝐹𝑢𝑠𝑏𝑗𝑜 𝐹𝑢𝑠𝑣𝑓 𝐹𝑢𝑠𝑣𝑓

𝐹𝑢𝑠𝑏𝑗𝑜

SLIDE 55

Vapnik main theorem

55

 A model is consistent if and only if the H has finite VC

dimension

 A finite VC dimension not only guarantees consistency,

but this is the only way to build a model that generalizes.

SLIDE 56

Main result

56

 No break point ⟹ 𝑛ℋ 𝑂 = 2𝑂  Any break point⟹ 𝑛ℋ 𝑂 is polynomial in 𝑂  Finite 𝑊𝐷(ℋ) ⇒ 𝑔 ∈ ℋ will generalize

SLIDE 57

VC dimension and learning

57

 Independent of learning algorithm  Independent of target function  Independent of input distribution

SLIDE 58

Practical issues

58

 The obtained bounds are loose.  Although bound is loose, it can be useful for comparing

the generalization of different methods

 In real application, models with lower VC tends to

generalize better

SLIDE 59

Practical: how many samples do I need?

59

 Rule of thumb: requiring 𝑂 to be at least 10 × 𝑊𝐷(𝐼) to

get decent generalization

SLIDE 60

VC vs. bias-variance

60

𝐹𝑢𝑠𝑣𝑓 𝐹𝑢𝑠𝑣𝑓 = 𝐹𝒠 𝐹𝑢𝑠𝑣𝑓 𝑔𝒠 𝐹𝑢𝑠𝑣𝑓 𝐹𝑢𝑠𝑏𝑗𝑜 = 𝐹𝒠 𝐹𝑢𝑠𝑏𝑗𝑜 𝑔𝒠 𝐹𝑢𝑠𝑏𝑗𝑜 𝐹𝑢𝑠𝑏𝑗𝑜

SLIDE 61

Summary of PAC bounds

61

With probability ≥ 1 − 𝜀

 For all ℎ ∈ 𝐼 s.t. 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ = 0

𝐹𝑢𝑠𝑣𝑓 ℎ ≤ 𝜗 = ln 𝐼 + ln 1 𝜀 2𝑂

 For all ℎ ∈ 𝐼

𝐹𝑢𝑠𝑣𝑓 ℎ − 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ ≤ 𝜗 = ln 2𝐼 + ln 1 𝜀 2𝑂

 For all ℎ ∈ 𝐼

𝐹𝑢𝑠𝑣𝑓 ℎ − 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ ≤ 𝜗 = 8 ln 𝑛ℋ 2𝑂 + 8 ln 4 𝜀 𝑂

Finite hypothesis space Infinite hypothesis space

SLIDE 62

Using PAC bounds for model selection

62

 Consider nested model spaces 𝐼1, 𝐼2, … , 𝐼𝑙, … in order of

increasing complexity:

 Finite hypothesis spaces: 𝐼1 ≤ 𝐼2 ≤ ⋯ ≤ 𝐼𝑙 ≤ ⋯  Infinite hypothesis spaces: 𝑊𝐷(𝐼1) ≤ 𝑊𝐷(𝐼2) ≤ ⋯ ≤ 𝑊𝐷(𝐼𝑙) ≤ ⋯

 For each hypothesis space 𝐼𝑙, we know with high probability

(≥ 1 − 𝜀𝑙), for all ℎ ∈ 𝐼𝑙: 𝐹𝑢𝑠𝑣𝑓(ℎ) ≤ 𝐹𝑢𝑠𝑏𝑗𝑜(ℎ) + 𝜗(𝐼𝑙)

 As complexity 𝑙

increases, 𝐹𝑢𝑠𝑏𝑗𝑜 decreases but 𝜗(𝐼𝑙) increases (Bias variance tradeoff)

𝜗(𝐼𝑙) : capacity term that depends on |𝐼𝑙| or 𝑊𝐷(𝐼𝑙)

SLIDE 63

Model selection by SRM

63

trade-off between hypothesis space complexity and the quality of fitting the training data

 SRM finds the subset of functions which minimizes the bound

n the true error (risk)

error 𝜗(ℎ) Capacity term 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ Bound on 𝐹𝑢𝑠𝑣𝑓(ℎ) is 𝐹𝑢𝑠𝑏𝑗𝑜(ℎ) + 𝜗(ℎ) 𝐼4 𝐼1 𝐼2 𝐼3 𝐹𝑢𝑠𝑣𝑓 ℎ < 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ + VC(𝐼) ln 2𝑂 𝑊𝐷(𝐼) + 1 + ln 4 𝜀 𝑂 𝜗(ℎ)

SLIDE 64

Model selection by SRM

64

 Structural Risk Minimization (SRM):

 Within each model space, find the best hypothesis using

Empirical Risk Minimization (ERM):

ℎ = argmin

ℎ∈𝐼

𝐹𝑢𝑠𝑏𝑗𝑜(ℎ)

 Choose model space that minimizes the upper bound on

𝐹𝑢𝑠𝑣𝑓: 𝑙 = argmin

𝑙≥1

𝐹𝑢𝑠𝑏𝑗𝑜 ℎ𝑙 + 𝜗 𝐼𝑙

 Final hypothesis is

ℎ = ℎ

𝑙

SLIDE 65

Summary

65

 PAC bounds on true error in terms of training error and complexity

f hypothesis space

 Bound for perfectly consistent learner (𝐹𝑢𝑠𝑏𝑗𝑜(ℎ∗) = 0)  Bound for agnostic learning (𝐹𝑢𝑠𝑏𝑗𝑜(ℎ∗) > 0)  |𝐼| = ∞ ⇒VC dimension  VC provides much tighter bounds in many cases

 Complexity of the classifier depends on number of points that can

be classified exactly

 Finite case: Number of hypothesis  Infinite case:VC dimension

 SRM

 Bias-Variance tradeoff in learning theory  Model selection using SRM

 Bounds are often too loose in practice

SLIDE 66

References

66