Computational Learning Theory: Probably Approximately Correct (PAC) - - PowerPoint PPT Presentation

β–Ά
computational learning theory probably approximately
SMART_READER_LITE
LIVE PREVIEW

Computational Learning Theory: Probably Approximately Correct (PAC) - - PowerPoint PPT Presentation

Computational Learning Theory: Probably Approximately Correct (PAC) Learning Machine Learning 1 Slides based on material from Dan Roth, Avrim Blum, Tom Mitchell and others Computational Learning Theory The Theory of Generalization


slide-1
SLIDE 1

Machine Learning

Computational Learning Theory: Probably Approximately Correct (PAC) Learning

1

Slides based on material from Dan Roth, Avrim Blum, Tom Mitchell and others

slide-2
SLIDE 2

Computational Learning Theory

  • The Theory of Generalization
  • Probably Approximately Correct (PAC) learning
  • Positive and negative learnability results
  • Agnostic Learning
  • Shattering and the VC dimension

2

slide-3
SLIDE 3

Where are we?

  • The Theory of Generalization
  • Probably Approximately Correct (PAC) learning
  • Positive and negative learnability results
  • Agnostic Learning
  • Shattering and the VC dimension

3

slide-4
SLIDE 4

This section

  • 1. Define the PAC model of learning
  • 2. Make formal connections to the principle of Occam’s razor

4

slide-5
SLIDE 5

This section

  • 1. Define the PAC model of learning
  • 2. Make formal connections to the principle of Occam’s razor

5

slide-6
SLIDE 6

Recall: The setup

  • Instance Space: π‘Œ, the set of examples
  • Concept Space: 𝐷, the set of possible target functions: 𝑔 ∈ 𝐷 is the hidden

target function – Eg: all π‘œ-conjunctions; all π‘œ-dimensional linear functions, …

  • Hypothesis Space: 𝐼, the set of possible hypotheses

– This is the set that the learning algorithm explores

  • Training instances: 𝑇×{βˆ’1,1}: positive and negative examples of the target
  • concept. (𝑇 is a finite subset of π‘Œ)

– Training instances are generated by a fixed unknown probability distribution 𝐸 over π‘Œ

  • What we want: A hypothesis h ∈ 𝐼 such that β„Ž 𝑦 = 𝑔(𝑦)

– Evaluate h on subsequent examples 𝑦 ∈ π‘Œ drawn according to 𝐸

6

slide-7
SLIDE 7

Formulating the theory of prediction

In the general case, we have

– π‘Œ: instance space, 𝑍: output space = {+1, -1} – 𝐸: an unknown distribution over π‘Œ – 𝑔: an unknown target function X β†’ 𝑍, taken from a concept class 𝐷 – β„Ž: a hypothesis function X β†’ 𝑍 that the learning algorithm selects from a hypothesis class 𝐼 – 𝑇: a set of m training examples drawn from 𝐸, labeled with f – err! β„Ž : The true error of any hypothesis β„Ž – err" β„Ž : The empirical error or training error or observed error of β„Ž

7

All the notation we have so far on one slide

slide-8
SLIDE 8

Theoretical questions

  • Can we describe or bound the true error (errD) given the

empirical error (errS)?

  • Is a concept class C learnable?
  • Is it possible to learn C using only the functions in H using the

supervised protocol?

  • How many examples does an algorithm need to guarantee

good performance?

8

slide-9
SLIDE 9

Expectations of learning

  • We cannot expect a learner to learn a concept exactly

– There will generally be multiple concepts consistent with the available data (which represent a small fraction of the available instance space) – Unseen examples could potentially have any label – Let’s β€œagree” to misclassify uncommon examples that do not show up in the training set

  • We cannot always expect to learn a close approximation

to the target concept

– Sometimes (hopefully only rarely) the training set will not be representative (will contain uncommon examples)

9

slide-10
SLIDE 10

Expectations of learning

  • We cannot expect a learner to learn a concept exactly

– There will generally be multiple concepts consistent with the available data (which represent a small fraction of the available instance space) – Unseen examples could potentially have any label – Let’s β€œagree” to misclassify uncommon examples that do not show up in the training set

  • We cannot always expect to learn a close approximation

to the target concept

– Sometimes (hopefully only rarely) the training set will not be representative (will contain uncommon examples)

10

slide-11
SLIDE 11

Expectations of learning

  • We cannot expect a learner to learn a concept exactly

– There will generally be multiple concepts consistent with the available data (which represent a small fraction of the available instance space) – Unseen examples could potentially have any label – Let’s β€œagree” to misclassify uncommon examples that do not show up in the training set

  • We cannot always expect to learn a close approximation

to the target concept

– Sometimes (hopefully only rarely) the training set will not be representative (will contain uncommon examples)

11

The only realistic expectation of a good learner is that with high probability it will learn a close approximation to the target concept

slide-12
SLIDE 12

Probably approximately correctness

The only realistic expectation of a good learner is that with high probability it will learn a close approximation to the target concept

  • In Probably Approximately Correct (PAC) learning, one

requires that

– given small parameters Β² and Β±, – With probability at least 1 - Β±, a learner produces a hypothesis with error at most Β²

  • The only reason we can hope for this is the consistent

distribution assumption

12

slide-13
SLIDE 13

Probably approximately correctness

The only realistic expectation of a good learner is that with high probability it will learn a close approximation to the target concept

  • In Probably Approximately Correct (PAC) learning, one

requires that

– given small parameters πœ— and πœ€, – With probability at least 1 βˆ’ πœ€, a learner produces a hypothesis with error at most πœ—

  • The only reason we can hope for this is the consistent

distribution assumption

13

slide-14
SLIDE 14

Probably approximately correctness

The only realistic expectation of a good learner is that with high probability it will learn a close approximation to the target concept

  • In Probably Approximately Correct (PAC) learning, one

requires that

– given small parameters πœ— and πœ€, – With probability at least 1 βˆ’ πœ€, a learner produces a hypothesis with error at most πœ—

  • The only reason we can hope for this is the consistent

distribution assumption

14

slide-15
SLIDE 15

PAC Learnability

Consider a concept class 𝐷 defined over an instance space π‘Œ (containing instances of length π‘œ), and a learner 𝑀 using a hypothesis space 𝐼 The concept class 𝐷 is PAC learnable by 𝑀 using 𝐼 if for all 𝑔 ∈ 𝐷, for all distribution 𝐸 over π‘Œ, and fixed 0 < πœ—, πœ€ < 1, given 𝑛 examples sampled independently according to 𝐸, with probability at least (1 βˆ’ πœ€), the algorithm 𝑀 produces a hypothesis β„Ž ∈ 𝐼 that has error at most πœ—, where 𝑛 is polynomial in ⁄ 1 πœ— , ⁄ 1 πœ€ , π‘œ and 𝑑𝑗𝑨𝑓(𝐼). The concept class 𝐷 is efficiently learnable if 𝑀 can produce the hypothesis in time that is polynomial in ⁄ 1 πœ— , ⁄ 1 πœ€ , π‘œ and 𝑑𝑗𝑨𝑓(𝐼).

15

recall that ErrD(h) = PrD[f(x) β‰  h(x)]

slide-16
SLIDE 16

PAC Learnability

Consider a concept class 𝐷 defined over an instance space π‘Œ (containing instances of length π‘œ), and a learner 𝑀 using a hypothesis space 𝐼 The concept class 𝐷 is PAC learnable by 𝑀 using 𝐼 if for all 𝑔 ∈ 𝐷, for all distribution 𝐸 over π‘Œ, and fixed 0 < πœ—, πœ€ < 1, given 𝑛 examples sampled independently according to 𝐸, with probability at least (1 βˆ’ πœ€), the algorithm 𝑀 produces a hypothesis β„Ž ∈ 𝐼 that has error at most πœ—, where 𝑛 is polynomial in ⁄ 1 πœ— , ⁄ 1 πœ€ , π‘œ and 𝑑𝑗𝑨𝑓(𝐼). The concept class 𝐷 is efficiently learnable if 𝑀 can produce the hypothesis in time that is polynomial in ⁄ 1 πœ— , ⁄ 1 πœ€ , π‘œ and 𝑑𝑗𝑨𝑓(𝐼).

16

recall that ErrD(h) = PrD[f(x) β‰  h(x)]

slide-17
SLIDE 17

PAC Learnability

Consider a concept class 𝐷 defined over an instance space π‘Œ (containing instances of length π‘œ), and a learner 𝑀 using a hypothesis space 𝐼 The concept class 𝐷 is PAC learnable by 𝑀 using 𝐼 if for all 𝑔 ∈ 𝐷, for all distribution 𝐸 over π‘Œ, and fixed 0 < πœ—, πœ€ < 1, given 𝑛 examples sampled independently according to 𝐸, with probability at least (1 βˆ’ πœ€), the algorithm 𝑀 produces a hypothesis β„Ž ∈ 𝐼 that has error at most πœ—, where 𝑛 is polynomial in ⁄ 1 πœ— , ⁄ 1 πœ€ , π‘œ and 𝑑𝑗𝑨𝑓(𝐼). The concept class 𝐷 is efficiently learnable if 𝑀 can produce the hypothesis in time that is polynomial in ⁄ 1 πœ— , ⁄ 1 πœ€ , π‘œ and 𝑑𝑗𝑨𝑓(𝐼).

17

recall that ErrD(h) = PrD[f(x) β‰  h(x)]

slide-18
SLIDE 18

PAC Learnability

Consider a concept class 𝐷 defined over an instance space π‘Œ (containing instances of length π‘œ), and a learner 𝑀 using a hypothesis space 𝐼 The concept class 𝐷 is PAC learnable by 𝑀 using 𝐼 if for all 𝑔 ∈ 𝐷, for all distribution 𝐸 over π‘Œ, and fixed 0 < πœ—, πœ€ < 1, given 𝑛 examples sampled independently according to 𝐸, with probability at least (1 βˆ’ πœ€), the algorithm 𝑀 produces a hypothesis β„Ž ∈ 𝐼 that has error at most πœ—, where 𝑛 is polynomial in ⁄ 1 πœ— , ⁄ 1 πœ€ , π‘œ and 𝑑𝑗𝑨𝑓(𝐼). The concept class 𝐷 is efficiently learnable if 𝑀 can produce the hypothesis in time that is polynomial in ⁄ 1 πœ— , ⁄ 1 πœ€ , π‘œ and 𝑑𝑗𝑨𝑓(𝐼).

18

recall that ErrD(h) = PrD[f(x) β‰  h(x)]

slide-19
SLIDE 19

PAC Learnability

Consider a concept class 𝐷 defined over an instance space π‘Œ (containing instances of length π‘œ), and a learner 𝑀 using a hypothesis space 𝐼 The concept class 𝐷 is PAC learnable by 𝑀 using 𝐼 if for all 𝑔 ∈ 𝐷, for all distribution 𝐸 over π‘Œ, and fixed 0 < πœ—, πœ€ < 1, given 𝑛 examples sampled independently according to 𝐸, with probability at least (1 βˆ’ πœ€), the algorithm 𝑀 produces a hypothesis β„Ž ∈ 𝐼 that has error at most πœ—, where 𝑛 is polynomial in ⁄ 1 πœ— , ⁄ 1 πœ€ , π‘œ and 𝑑𝑗𝑨𝑓(𝐼). The concept class 𝐷 is efficiently learnable if 𝑀 can produce the hypothesis in time that is polynomial in ⁄ 1 πœ— , ⁄ 1 πœ€ , π‘œ and 𝑑𝑗𝑨𝑓(𝐼).

19

recall that ErrD(h) = PrD[f(x) β‰  h(x)]

Given a small enough number of examples

slide-20
SLIDE 20

PAC Learnability

Consider a concept class 𝐷 defined over an instance space π‘Œ (containing instances of length π‘œ), and a learner 𝑀 using a hypothesis space 𝐼 The concept class 𝐷 is PAC learnable by 𝑀 using 𝐼 if for all 𝑔 ∈ 𝐷, for all distribution 𝐸 over π‘Œ, and fixed 0 < πœ—, πœ€ < 1, given 𝑛 examples sampled independently according to 𝐸, with probability at least (1 βˆ’ πœ€), the algorithm 𝑀 produces a hypothesis β„Ž ∈ 𝐼 that has error at most πœ—, where 𝑛 is polynomial in ⁄ 1 πœ— , ⁄ 1 πœ€ , π‘œ and 𝑑𝑗𝑨𝑓(𝐼). The concept class 𝐷 is efficiently learnable if 𝑀 can produce the hypothesis in time that is polynomial in ⁄ 1 πœ— , ⁄ 1 πœ€ , π‘œ and 𝑑𝑗𝑨𝑓(𝐼).

20

recall that ErrD(h) = PrD[f(x) β‰  h(x)]

Given a small enough number of examples with high probability

slide-21
SLIDE 21

PAC Learnability

Consider a concept class 𝐷 defined over an instance space π‘Œ (containing instances of length π‘œ), and a learner 𝑀 using a hypothesis space 𝐼 The concept class 𝐷 is PAC learnable by 𝑀 using 𝐼 if for all 𝑔 ∈ 𝐷, for all distribution 𝐸 over π‘Œ, and fixed 0 < πœ—, πœ€ < 1, given 𝑛 examples sampled independently according to 𝐸, with probability at least (1 βˆ’ πœ€), the algorithm 𝑀 produces a hypothesis β„Ž ∈ 𝐼 that has error at most πœ—, where 𝑛 is polynomial in ⁄ 1 πœ— , ⁄ 1 πœ€ , π‘œ and 𝑑𝑗𝑨𝑓(𝐼). The concept class 𝐷 is efficiently learnable if 𝑀 can produce the hypothesis in time that is polynomial in ⁄ 1 πœ— , ⁄ 1 πœ€ , π‘œ and 𝑑𝑗𝑨𝑓(𝐼).

21

recall that ErrD(h) = PrD[f(x) β‰  h(x)]

Given a small enough number of examples the learner will produce a β€œgood enough” classifier. with high probability

slide-22
SLIDE 22

PAC Learnability

Consider a concept class 𝐷 defined over an instance space π‘Œ (containing instances of length π‘œ), and a learner 𝑀 using a hypothesis space 𝐼 The concept class 𝐷 is PAC learnable by 𝑀 using 𝐼 if for all 𝑔 ∈ 𝐷, for all distribution 𝐸 over π‘Œ, and fixed 0 < πœ—, πœ€ < 1, given 𝑛 examples sampled independently according to 𝐸, with probability at least (1 βˆ’ πœ€), the algorithm 𝑀 produces a hypothesis β„Ž ∈ 𝐼 that has error at most πœ—, where 𝑛 is polynomial in ⁄ 1 πœ— , ⁄ 1 πœ€ , π‘œ and 𝑑𝑗𝑨𝑓(𝐼). The concept class 𝐷 is efficiently learnable if 𝑀 can produce the hypothesis in time that is polynomial in ⁄ 1 πœ— , ⁄ 1 πœ€ , π‘œ and 𝑑𝑗𝑨𝑓(𝐼).

22

recall that 𝐹𝑠𝑠# β„Ž = 𝑄𝑠$~#[𝑔 𝑦 β‰  β„Ž 𝑦 ]

Given a small enough number of examples with high probability the learner will produce a β€œgood enough” classifier.

slide-23
SLIDE 23

PAC Learnability

Consider a concept class 𝐷 defined over an instance space π‘Œ (containing instances of length π‘œ), and a learner 𝑀 using a hypothesis space 𝐼 The concept class 𝐷 is PAC learnable by 𝑀 using 𝐼 if for all 𝑔 ∈ 𝐷, for all distribution 𝐸 over π‘Œ, and fixed 0 < πœ—, πœ€ < 1, given 𝑛 examples sampled independently according to 𝐸, with probability at least (1 βˆ’ πœ€), the algorithm 𝑀 produces a hypothesis β„Ž ∈ 𝐼 that has error at most πœ—, where 𝑛 is polynomial in ⁄ 1 πœ— , ⁄ 1 πœ€ , π‘œ and 𝑑𝑗𝑨𝑓(𝐼). The concept class 𝐷 is efficiently learnable if 𝑀 can produce the hypothesis in time that is polynomial in ⁄ 1 πœ— , ⁄ 1 πœ€ , π‘œ and 𝑑𝑗𝑨𝑓(𝐼).

23

recall that 𝐹𝑠𝑠# β„Ž = Pr

# [𝑔 𝑦 β‰  β„Ž 𝑦 ]

slide-24
SLIDE 24

PAC Learnability

We impose two limitations

  • Polynomial sample complexity (information theoretic constraint)

– Is there enough information in the sample to distinguish a hypothesis h that approximate f ?

  • Polynomial time complexity (computational complexity)

– Is there an efficient algorithm that can process the sample and produce a good hypothesis h ?

To be PAC learnable, there must be a hypothesis h Î H with arbitrary small error for every f Î C. We assume H Ê C. (Properly PAC learnable if H=C) Worst Case definition: the algorithm must meet its accuracy

– for every distribution (The distribution free assumption) – for every target function f in the class C

24

slide-25
SLIDE 25

PAC Learnability

We impose two limitations

  • Polynomial sample complexity (information theoretic constraint)

– Is there enough information in the sample to distinguish a hypothesis h that approximates f ?

  • Polynomial time complexity (computational complexity)

– Is there an efficient algorithm that can process the sample and produce a good hypothesis h ?

To be PAC learnable, there must be a hypothesis h Î H with arbitrary small error for every f Î C. We assume H Ê C. (Properly PAC learnable if H=C) Worst Case definition: the algorithm must meet its accuracy

– for every distribution (The distribution free assumption) – for every target function f in the class C

25

slide-26
SLIDE 26

PAC Learnability

We impose two limitations

  • Polynomial sample complexity (information theoretic constraint)

– Is there enough information in the sample to distinguish a hypothesis h that approximates f ?

  • Polynomial time complexity (computational complexity)

– Is there an efficient algorithm that can process the sample and produce a good hypothesis h ?

To be PAC learnable, there must be a hypothesis h Î H with arbitrary small error for every f Î C. We assume H Ê C. (Properly PAC learnable if H=C) Worst Case definition: the algorithm must meet its accuracy

– for every distribution (The distribution free assumption) – for every target function f in the class C

26

slide-27
SLIDE 27

PAC Learnability

We impose two limitations

  • Polynomial sample complexity (information theoretic constraint)

– Is there enough information in the sample to distinguish a hypothesis h that approximates f ?

  • Polynomial time complexity (computational complexity)

– Is there an efficient algorithm that can process the sample and produce a good hypothesis h ?

To be PAC learnable, there must be a hypothesis h Î H with arbitrary small error for every f Î C. We assume H Ê C. (Properly PAC learnable if H=C) Worst Case definition: the algorithm must meet its accuracy

– for every distribution (The distribution free assumption) – for every target function f in the class C

27

slide-28
SLIDE 28

PAC Learnability

We impose two limitations

  • Polynomial sample complexity (information theoretic constraint)

– Is there enough information in the sample to distinguish a hypothesis h that approximates f ?

  • Polynomial time complexity (computational complexity)

– Is there an efficient algorithm that can process the sample and produce a good hypothesis h ?

To be PAC learnable, there must be a hypothesis h Î H with arbitrary small error for every f Î C. We assume H Ê C. (Properly PAC learnable if H=C) Worst Case definition: the algorithm must meet its accuracy

– for every distribution (The distribution free assumption) – for every target function f in the class C

28