Computational Learning Theory: Probably Approximately Correct (PAC) - - PowerPoint PPT Presentation

▶

Dec 30, 2022 116 likes •418 views

Computational Learning Theory: Probably Approximately Correct (PAC) Learning Machine Learning 1 Slides based on material from Dan Roth, Avrim Blum, Tom Mitchell and others Computational Learning Theory The Theory of Generalization

SLIDE 1

Machine Learning

Computational Learning Theory: Probably Approximately Correct (PAC) Learning

Slides based on material from Dan Roth, Avrim Blum, Tom Mitchell and others

SLIDE 2

Computational Learning Theory

The Theory of Generalization
Probably Approximately Correct (PAC) learning
Positive and negative learnability results
Agnostic Learning
Shattering and the VC dimension

SLIDE 3

Where are we?

The Theory of Generalization
Probably Approximately Correct (PAC) learning
Positive and negative learnability results
Agnostic Learning
Shattering and the VC dimension

SLIDE 4

This section

1. Define the PAC model of learning
2. Make formal connections to the principle of Occam’s razor

SLIDE 5

This section

1. Define the PAC model of learning
2. Make formal connections to the principle of Occam’s razor

SLIDE 6

Recall: The setup

Instance Space: 𝑌, the set of examples
Concept Space: 𝐷, the set of possible target functions: 𝑔 ∈ 𝐷 is the hidden

target function – Eg: all 𝑜-conjunctions; all 𝑜-dimensional linear functions, …

Hypothesis Space: 𝐼, the set of possible hypotheses

– This is the set that the learning algorithm explores

Training instances: 𝑇×{−1,1}: positive and negative examples of the target
concept. (𝑇 is a finite subset of 𝑌)

– Training instances are generated by a fixed unknown probability distribution 𝐸 over 𝑌

What we want: A hypothesis h ∈ 𝐼 such that ℎ 𝑦 = 𝑔(𝑦)

– Evaluate h on subsequent examples 𝑦 ∈ 𝑌 drawn according to 𝐸

SLIDE 7

Formulating the theory of prediction

In the general case, we have

– 𝑌: instance space, 𝑍: output space = {+1, -1} – 𝐸: an unknown distribution over 𝑌 – 𝑔: an unknown target function X → 𝑍, taken from a concept class 𝐷 – ℎ: a hypothesis function X → 𝑍 that the learning algorithm selects from a hypothesis class 𝐼 – 𝑇: a set of m training examples drawn from 𝐸, labeled with f – err! ℎ : The true error of any hypothesis ℎ – err" ℎ : The empirical error or training error or observed error of ℎ

All the notation we have so far on one slide

SLIDE 8

Theoretical questions

Can we describe or bound the true error (errD) given the

empirical error (errS)?

Is a concept class C learnable?
Is it possible to learn C using only the functions in H using the

supervised protocol?

How many examples does an algorithm need to guarantee

good performance?

SLIDE 9

Expectations of learning

We cannot expect a learner to learn a concept exactly

– There will generally be multiple concepts consistent with the available data (which represent a small fraction of the available instance space) – Unseen examples could potentially have any label – Let’s “agree” to misclassify uncommon examples that do not show up in the training set

We cannot always expect to learn a close approximation

to the target concept

– Sometimes (hopefully only rarely) the training set will not be representative (will contain uncommon examples)

SLIDE 10

Expectations of learning

We cannot expect a learner to learn a concept exactly

– There will generally be multiple concepts consistent with the available data (which represent a small fraction of the available instance space) – Unseen examples could potentially have any label – Let’s “agree” to misclassify uncommon examples that do not show up in the training set

We cannot always expect to learn a close approximation

to the target concept

– Sometimes (hopefully only rarely) the training set will not be representative (will contain uncommon examples)

SLIDE 11

Expectations of learning

We cannot expect a learner to learn a concept exactly

– There will generally be multiple concepts consistent with the available data (which represent a small fraction of the available instance space) – Unseen examples could potentially have any label – Let’s “agree” to misclassify uncommon examples that do not show up in the training set

We cannot always expect to learn a close approximation

to the target concept

– Sometimes (hopefully only rarely) the training set will not be representative (will contain uncommon examples)

The only realistic expectation of a good learner is that with high probability it will learn a close approximation to the target concept

SLIDE 12

Probably approximately correctness

The only realistic expectation of a good learner is that with high probability it will learn a close approximation to the target concept

In Probably Approximately Correct (PAC) learning, one

requires that

– given small parameters ² and ±, – With probability at least 1 - ±, a learner produces a hypothesis with error at most ²

The only reason we can hope for this is the consistent

distribution assumption

SLIDE 13

Probably approximately correctness

The only realistic expectation of a good learner is that with high probability it will learn a close approximation to the target concept

In Probably Approximately Correct (PAC) learning, one

requires that

– given small parameters 𝜗 and 𝜀, – With probability at least 1 − 𝜀, a learner produces a hypothesis with error at most 𝜗

The only reason we can hope for this is the consistent

distribution assumption

SLIDE 14

Probably approximately correctness

The only realistic expectation of a good learner is that with high probability it will learn a close approximation to the target concept

In Probably Approximately Correct (PAC) learning, one

requires that

– given small parameters 𝜗 and 𝜀, – With probability at least 1 − 𝜀, a learner produces a hypothesis with error at most 𝜗

The only reason we can hope for this is the consistent

distribution assumption

SLIDE 15

PAC Learnability

Consider a concept class 𝐷 defined over an instance space 𝑌 (containing instances of length 𝑜), and a learner 𝑀 using a hypothesis space 𝐼 The concept class 𝐷 is PAC learnable by 𝑀 using 𝐼 if for all 𝑔 ∈ 𝐷, for all distribution 𝐸 over 𝑌, and fixed 0 < 𝜗, 𝜀 < 1, given 𝑛 examples sampled independently according to 𝐸, with probability at least (1 − 𝜀), the algorithm 𝑀 produces a hypothesis ℎ ∈ 𝐼 that has error at most 𝜗, where 𝑛 is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼). The concept class 𝐷 is efficiently learnable if 𝑀 can produce the hypothesis in time that is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼).

recall that ErrD(h) = PrD[f(x) ≠ h(x)]

SLIDE 16

PAC Learnability

Consider a concept class 𝐷 defined over an instance space 𝑌 (containing instances of length 𝑜), and a learner 𝑀 using a hypothesis space 𝐼 The concept class 𝐷 is PAC learnable by 𝑀 using 𝐼 if for all 𝑔 ∈ 𝐷, for all distribution 𝐸 over 𝑌, and fixed 0 < 𝜗, 𝜀 < 1, given 𝑛 examples sampled independently according to 𝐸, with probability at least (1 − 𝜀), the algorithm 𝑀 produces a hypothesis ℎ ∈ 𝐼 that has error at most 𝜗, where 𝑛 is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼). The concept class 𝐷 is efficiently learnable if 𝑀 can produce the hypothesis in time that is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼).

recall that ErrD(h) = PrD[f(x) ≠ h(x)]

SLIDE 17

PAC Learnability

Consider a concept class 𝐷 defined over an instance space 𝑌 (containing instances of length 𝑜), and a learner 𝑀 using a hypothesis space 𝐼 The concept class 𝐷 is PAC learnable by 𝑀 using 𝐼 if for all 𝑔 ∈ 𝐷, for all distribution 𝐸 over 𝑌, and fixed 0 < 𝜗, 𝜀 < 1, given 𝑛 examples sampled independently according to 𝐸, with probability at least (1 − 𝜀), the algorithm 𝑀 produces a hypothesis ℎ ∈ 𝐼 that has error at most 𝜗, where 𝑛 is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼). The concept class 𝐷 is efficiently learnable if 𝑀 can produce the hypothesis in time that is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼).

recall that ErrD(h) = PrD[f(x) ≠ h(x)]

SLIDE 18

PAC Learnability

Consider a concept class 𝐷 defined over an instance space 𝑌 (containing instances of length 𝑜), and a learner 𝑀 using a hypothesis space 𝐼 The concept class 𝐷 is PAC learnable by 𝑀 using 𝐼 if for all 𝑔 ∈ 𝐷, for all distribution 𝐸 over 𝑌, and fixed 0 < 𝜗, 𝜀 < 1, given 𝑛 examples sampled independently according to 𝐸, with probability at least (1 − 𝜀), the algorithm 𝑀 produces a hypothesis ℎ ∈ 𝐼 that has error at most 𝜗, where 𝑛 is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼). The concept class 𝐷 is efficiently learnable if 𝑀 can produce the hypothesis in time that is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼).

recall that ErrD(h) = PrD[f(x) ≠ h(x)]

SLIDE 19

PAC Learnability

Consider a concept class 𝐷 defined over an instance space 𝑌 (containing instances of length 𝑜), and a learner 𝑀 using a hypothesis space 𝐼 The concept class 𝐷 is PAC learnable by 𝑀 using 𝐼 if for all 𝑔 ∈ 𝐷, for all distribution 𝐸 over 𝑌, and fixed 0 < 𝜗, 𝜀 < 1, given 𝑛 examples sampled independently according to 𝐸, with probability at least (1 − 𝜀), the algorithm 𝑀 produces a hypothesis ℎ ∈ 𝐼 that has error at most 𝜗, where 𝑛 is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼). The concept class 𝐷 is efficiently learnable if 𝑀 can produce the hypothesis in time that is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼).

recall that ErrD(h) = PrD[f(x) ≠ h(x)]

Given a small enough number of examples

SLIDE 20

PAC Learnability

Consider a concept class 𝐷 defined over an instance space 𝑌 (containing instances of length 𝑜), and a learner 𝑀 using a hypothesis space 𝐼 The concept class 𝐷 is PAC learnable by 𝑀 using 𝐼 if for all 𝑔 ∈ 𝐷, for all distribution 𝐸 over 𝑌, and fixed 0 < 𝜗, 𝜀 < 1, given 𝑛 examples sampled independently according to 𝐸, with probability at least (1 − 𝜀), the algorithm 𝑀 produces a hypothesis ℎ ∈ 𝐼 that has error at most 𝜗, where 𝑛 is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼). The concept class 𝐷 is efficiently learnable if 𝑀 can produce the hypothesis in time that is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼).

recall that ErrD(h) = PrD[f(x) ≠ h(x)]

Given a small enough number of examples with high probability

SLIDE 21

PAC Learnability

Consider a concept class 𝐷 defined over an instance space 𝑌 (containing instances of length 𝑜), and a learner 𝑀 using a hypothesis space 𝐼 The concept class 𝐷 is PAC learnable by 𝑀 using 𝐼 if for all 𝑔 ∈ 𝐷, for all distribution 𝐸 over 𝑌, and fixed 0 < 𝜗, 𝜀 < 1, given 𝑛 examples sampled independently according to 𝐸, with probability at least (1 − 𝜀), the algorithm 𝑀 produces a hypothesis ℎ ∈ 𝐼 that has error at most 𝜗, where 𝑛 is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼). The concept class 𝐷 is efficiently learnable if 𝑀 can produce the hypothesis in time that is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼).

recall that ErrD(h) = PrD[f(x) ≠ h(x)]

Given a small enough number of examples the learner will produce a “good enough” classifier. with high probability

SLIDE 22

PAC Learnability

Consider a concept class 𝐷 defined over an instance space 𝑌 (containing instances of length 𝑜), and a learner 𝑀 using a hypothesis space 𝐼 The concept class 𝐷 is PAC learnable by 𝑀 using 𝐼 if for all 𝑔 ∈ 𝐷, for all distribution 𝐸 over 𝑌, and fixed 0 < 𝜗, 𝜀 < 1, given 𝑛 examples sampled independently according to 𝐸, with probability at least (1 − 𝜀), the algorithm 𝑀 produces a hypothesis ℎ ∈ 𝐼 that has error at most 𝜗, where 𝑛 is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼). The concept class 𝐷 is efficiently learnable if 𝑀 can produce the hypothesis in time that is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼).

recall that 𝐹𝑠𝑠# ℎ = 𝑄𝑠$~#[𝑔 𝑦 ≠ ℎ 𝑦 ]

Given a small enough number of examples with high probability the learner will produce a “good enough” classifier.

SLIDE 23

PAC Learnability

Consider a concept class 𝐷 defined over an instance space 𝑌 (containing instances of length 𝑜), and a learner 𝑀 using a hypothesis space 𝐼 The concept class 𝐷 is PAC learnable by 𝑀 using 𝐼 if for all 𝑔 ∈ 𝐷, for all distribution 𝐸 over 𝑌, and fixed 0 < 𝜗, 𝜀 < 1, given 𝑛 examples sampled independently according to 𝐸, with probability at least (1 − 𝜀), the algorithm 𝑀 produces a hypothesis ℎ ∈ 𝐼 that has error at most 𝜗, where 𝑛 is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼). The concept class 𝐷 is efficiently learnable if 𝑀 can produce the hypothesis in time that is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼).

recall that 𝐹𝑠𝑠# ℎ = Pr

# [𝑔 𝑦 ≠ ℎ 𝑦 ]

SLIDE 24

PAC Learnability

We impose two limitations

Polynomial sample complexity (information theoretic constraint)

– Is there enough information in the sample to distinguish a hypothesis h that approximate f ?

Polynomial time complexity (computational complexity)

– Is there an efficient algorithm that can process the sample and produce a good hypothesis h ?

To be PAC learnable, there must be a hypothesis h Î H with arbitrary small error for every f Î C. We assume H Ê C. (Properly PAC learnable if H=C) Worst Case definition: the algorithm must meet its accuracy

– for every distribution (The distribution free assumption) – for every target function f in the class C

SLIDE 25

PAC Learnability

We impose two limitations

Polynomial sample complexity (information theoretic constraint)

– Is there enough information in the sample to distinguish a hypothesis h that approximates f ?

Polynomial time complexity (computational complexity)

– Is there an efficient algorithm that can process the sample and produce a good hypothesis h ?

To be PAC learnable, there must be a hypothesis h Î H with arbitrary small error for every f Î C. We assume H Ê C. (Properly PAC learnable if H=C) Worst Case definition: the algorithm must meet its accuracy

– for every distribution (The distribution free assumption) – for every target function f in the class C

SLIDE 26

PAC Learnability

We impose two limitations

Polynomial sample complexity (information theoretic constraint)

– Is there enough information in the sample to distinguish a hypothesis h that approximates f ?

Polynomial time complexity (computational complexity)

– Is there an efficient algorithm that can process the sample and produce a good hypothesis h ?

To be PAC learnable, there must be a hypothesis h Î H with arbitrary small error for every f Î C. We assume H Ê C. (Properly PAC learnable if H=C) Worst Case definition: the algorithm must meet its accuracy

– for every distribution (The distribution free assumption) – for every target function f in the class C

SLIDE 27

PAC Learnability

We impose two limitations

Polynomial sample complexity (information theoretic constraint)

– Is there enough information in the sample to distinguish a hypothesis h that approximates f ?

Polynomial time complexity (computational complexity)

– Is there an efficient algorithm that can process the sample and produce a good hypothesis h ?

To be PAC learnable, there must be a hypothesis h Î H with arbitrary small error for every f Î C. We assume H Ê C. (Properly PAC learnable if H=C) Worst Case definition: the algorithm must meet its accuracy

– for every distribution (The distribution free assumption) – for every target function f in the class C

SLIDE 28

PAC Learnability

We impose two limitations

Polynomial sample complexity (information theoretic constraint)

– Is there enough information in the sample to distinguish a hypothesis h that approximates f ?

Polynomial time complexity (computational complexity)

– Is there an efficient algorithm that can process the sample and produce a good hypothesis h ?

To be PAC learnable, there must be a hypothesis h Î H with arbitrary small error for every f Î C. We assume H Ê C. (Properly PAC learnable if H=C) Worst Case definition: the algorithm must meet its accuracy

– for every distribution (The distribution free assumption) – for every target function f in the class C