Learning From Data Lecture 5 Training Versus Testing The Two - - PowerPoint PPT Presentation

▶

Oct 24, 2023 499 likes •705 views

Learning From Data Lecture 5 Training Versus Testing The Two Questions of Learning Theory of Generalization ( E in E out ) An Effective Number of Hypotheses A Combinatorial Puzzle M. Magdon-Ismail CSCI 4100/6100 recap: The Two Questions

SLIDE 1

Learning From Data Lecture 5 Training Versus Testing

The Two Questions of Learning Theory of Generalization (Ein ≈ Eout) An Effective Number of Hypotheses A Combinatorial Puzzle

M. Magdon-Ismail

CSCI 4100/6100

SLIDE 2

recap: The Two Questions of Learning

1. Can we make sure that Eout(g) is close enough to Ein(g)?
2. Can we make Ein(g) small enough?

The Hoeffding generalization bound:

Eout(g) ≤ Ein(g) +

2N ln 2|H| δ

generalization error bar

in-sample error model complexity

ut-of-sample error

|H| Error |H|∗

Ein: training (eg. the practice exam) Eout: testing (eg. the real exam)

There is a tradeoff when picking |H|.

c A M L Creator: Malik Magdon-Ismail

Training Versus Testing: 2 /18

Goal of generalization theory − →

SLIDE 3

What Will The Theory of Generalization Achieve?

Eout(g) ≤ Ein(g) +

2N ln 2|H| δ

in-sample error model complexity

ut-of-sample error

|H| Error |H|∗

↓

Eout(g) ≤ Ein(g) +

N ln 4mH δ

in-sample error model complexity

ut-of-sample error

model complexity Error

The new bound will be applicable to infinite H.

c A M L Creator: Malik Magdon-Ismail

Training Versus Testing: 3 /18

|H| is overkill − →

SLIDE 4

Why is |H| an Overkill

How did |H| come in? Bad events Bg = {|Eout(g) − Ein(g)| > ǫ} Bm = {|Eout(hm) − Ein(hm)| > ǫ}

We do not know which g, so use a worst case union bound.

P[Bg] ≤ P[any Bm] ≤

|H|

P[Bm].

B3 B1 B2

Bm are events (sets of outcomes); they can overlap.
If the Bm overlap, the union bound is loose.
If many hm are similar, the Bm overlap.
There are “effectively” fewer than |H| hypotheses,.
We can replace |H| by something smaller.

|H| fails to account for similarity between hypotheses.

c A M L Creator: Malik Magdon-Ismail

Training Versus Testing: 4 /18

Measuring diversity on N points − →

SLIDE 5

Measuring the Diversity (Size) of H

We need a way to measure the diversity of H. A simple idea: Fix any set of N data points. If H is diverse it should be able to implement all functions . . . on these N points.

c A M L Creator: Malik Magdon-Ismail

Training Versus Testing: 5 /18

Example: large H − →

SLIDE 6

A Data Set Reveals the True Colors of an H

H

c A M L Creator: Malik Magdon-Ismail

Training Versus Testing: 6 /18

. . . through the eyes of D − →

SLIDE 7

A Data Set Reveals the True Colors of an H

H H through the eyes of the D

c A M L Creator: Malik Magdon-Ismail

Training Versus Testing: 7 /18

Just one dichotomy − →

SLIDE 8

A Data Set Reveals the True Colors of an H

From the point of view of D, the entire H is just one dichotomy.

c A M L Creator: Malik Magdon-Ismail

Training Versus Testing: 8 /18

An effective number of hypotheses − →

SLIDE 9

An Effective Number of Hypotheses

If H is diverse it should be able to implement many dichotomys. |H| only captures the maximum possible diversity of H. Consider an h ∈ H, and a data set x1, . . . , xN. h gives us an N-tuple of ±1’s:

(h(x1), . . . , h(xN)). A dichotomy of the inputs.

If H is diverse, we get many different dichotomies. If H contains similar functions, we only get a few dichotomies.

dichotomy The growth function quantifies this.

c A M L Creator: Malik Magdon-Ismail

Training Versus Testing: 9 /18

Growth function − →

SLIDE 10

The Growth Function mH(N)

Define the the restriction of H to the inputs x1, x2, . . . , xN: H(x1, . . . , xN) = {(h(x1), . . . , h(xN)) | h ∈ H}

(set of dichotomies induced by H)

The Growth Function mH(N) The largest set of dichotomies induced by H: mH(N) = max

x1,...,xN |H(x1, . . . , xN)|.

mH(N) ≤ 2N. Can we replace |H| by mH, an effective number of hypotheses?

Replacing |H| with 2N is no help in the bound. (why?)
the error bar is
1

2N ln 2|H| δ

We want mH(N) ≤ poly(N) to get a useful error bar.

c A M L Creator: Malik Magdon-Ismail

Training Versus Testing: 10 /18

Example: 2-d perceptron − →

SLIDE 11

Example: 2-D Perceptron Model

Cannot implement Can implement all 8 Can implement at most 14

mH(3) = 8 = 23. mH(4) = 14 < 24. What is mH(5)?

c A M L Creator: Malik Magdon-Ismail

Training Versus Testing: 11 /18

Example: 1-d positive ray − →

SLIDE 12

Example: 1-D Positive Ray Model

w0 · · · x1 x2 + · · · xN

h(x) = sign(x − w0)
Consider N points.
There are N + 1 dichotomies depending on where you put w0.
mH(N) = N + 1.

c A M L Creator: Malik Magdon-Ismail

Training Versus Testing: 12 /18

Example: 2-d positive rectangle − →

SLIDE 13

Example: Positive Rectangles in 2-D

N = 4 N = 5

x1 x2 x3 x4 x1 x2 x3 x4 x4

H implements all dichotomies some point will be inside a rectangle defined by others

4 dichotomies is max.

x1 x2 x3 x4

. .

If N = 4 how many possible dichotomys with no 2 points shattered?

c A M L Creator: Malik Magdon-Ismail

Training Versus Testing: 18 /18