Learning Theory
CE-717: Machine Learning
Sharif University of Technology
- M. Soleymani
Learning Theory CE-717: Machine Learning Sharif University of - - PowerPoint PPT Presentation
Learning Theory CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Feasibility of learning PAC learning VC dimension Structural Risk Minimization (SRM) 2 Feasibility of learning Does the
2
3
4
5
6
7
8
9
𝑦 1 , … , 𝑦 𝑂 𝑦 1 , 𝑧(1) , … , 𝑦 𝑂 , 𝑧(𝑂)
10
𝒚 drawn at random from unknown distribution 𝑄 𝒚 Teacher provides noise-free label 𝑑(𝒚) for it
11
12
13
14
15
16
Answer: ≈ 0.1%
Answer: ≈ 63%
17
18
19
20
21
22
23
Pr(ℎ labels one data point correctly|𝐹𝑢𝑠𝑣𝑓 ℎ > 𝜗) ≤ (1 − 𝜗) Pr(ℎ labels 𝑂 i. i. d data points correctly|𝐹𝑢𝑠𝑣𝑓 ℎ > 𝜗) ≤ 1 − 𝜗 𝑂
24
𝐹𝑢𝑠𝑏𝑗𝑜 ℎ1 = 0, 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ2 = 0, …, 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ𝑙 = 0
Pr ∃ℎ ∈ 𝐼, 𝐹𝑢𝑠𝑣𝑓 ℎ > 𝜗 ∧ 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ = 0 = Pr
≤ 𝑗=1
𝑙
≤ 𝑗=1
𝑙
𝑙
≤ ℋ 1 − 𝜗 𝑂 ≤ ℋ 𝑓−𝜗𝑂
25
26
𝑂 and only logarithmic in ℋ .
27
Example: (𝑒 = 5 boolean features) if 𝒚 = [0 ? 1 ? ? ] then 𝑧 = 1 else 𝑧 = 0
𝑒 = 5 ⇒ 𝑂 > 201 𝑒 = 10 ⇒ 𝑂 > 312 𝑒 = 100 ⇒ 𝑂 > 2290
28
instances: vectors of 𝑒 boolean features
𝑒 = 4 ⇒ 𝑛 > 184 𝑒 = 4 ⇒ 𝑂 > 219 𝑒 = 10 ⇒ 𝑂 > 281 𝑒 = 100 ⇒ 𝑂 > 423 𝑒 = 1000 ⇒ 𝑂 > 562
29
30
Haussler’s bound with the assumption ∃ℎ ∈ ℋ, 𝐹𝑢𝑠𝑏𝑗𝑜 ℎ = 0 Hoeffding’s bound
The largest subset of 𝒴 for which ℋ can guarantee zero training
VC dimension of ℋ is the size of this subset
31
32
33
34
35
36
H1(open intervals to right): if 𝑦 > 𝑏 then 𝑧 = 1 else 𝑧 = 0
H2 (inside intervals): if 𝑏 < 𝑦 < 𝑐 then 𝑧 = 1 else 𝑧 = 0
37
38
39
𝑊𝐷(ℋ) is the largest value of 𝑂 for which 𝑛ℋ 𝑂 = 2𝑂
There’s at least one set of size 𝑙 that ℋ can shatter. And there is no set of 𝑙 + 1 points that can be shattered.
for all 𝑙 + 1 points, there exists a labeling that cannot be shattered
40
H1(open intervals to right): if 𝑦 > 𝑏 then 𝑧 = 1 else 𝑧 = 0
H2 (inside intervals): if 𝑏 < 𝑦 < 𝑐 then 𝑧 = 1 else 𝑧 = 0
41
42
43
𝑊𝐷(𝐼) ≤ 3
44
45
46
47
48
49
50
51
52
53
54
𝐹𝑢𝑠𝑏𝑗𝑜
55
56
57
58
59
60
61
62
Finite hypothesis spaces: 𝐼1 ≤ 𝐼2 ≤ ⋯ ≤ 𝐼𝑙 ≤ ⋯ Infinite hypothesis spaces: 𝑊𝐷(𝐼1) ≤ 𝑊𝐷(𝐼2) ≤ ⋯ ≤ 𝑊𝐷(𝐼𝑙) ≤ ⋯
63
64
65
Bound for perfectly consistent learner (𝐹𝑢𝑠𝑏𝑗𝑜(ℎ∗) = 0) Bound for agnostic learning (𝐹𝑢𝑠𝑏𝑗𝑜(ℎ∗) > 0) |𝐼| = ∞ ⇒VC dimension VC provides much tighter bounds in many cases
Finite case: Number of hypothesis Infinite case:VC dimension
Bias-Variance tradeoff in learning theory Model selection using SRM
Bounds are often too loose in practice
66