CS 730/730W/830: Intro AI Naive Bayes Boosting 1 handout: slides - - PowerPoint PPT Presentation

▶

Jan 28, 2024 638 likes •851 views

CS 730/730W/830: Intro AI Naive Bayes Boosting 1 handout: slides asst 5 milestone was due Wheeler Ruml (UNH) Lecture 22, CS 730 1 / 14 Supervised Learning: Summary So Far learning as function approximation Naive Bayes Boosting k -NN:

SLIDE 1

CS 730/730W/830: Intro AI

Naive Bayes Boosting

Wheeler Ruml (UNH) Lecture 22, CS 730 – 1 / 14

1 handout: slides asst 5 milestone was due

SLIDE 2

Supervised Learning: Summary So Far

Naive Bayes Boosting

Wheeler Ruml (UNH) Lecture 22, CS 730 – 2 / 14

learning as function approximation k-NN: distance function (any attributes), any labels Neural network: numeric attributes, numeric or binary labels Regression: incremental training with LMS 3-Layer ANN: train with BackProp Inductive Logic Programming: logical concepts Decision Trees: easier with discrete attributes and labels

SLIDE 3

Naive Bayes

Naive Bayes ■ Bayes’ Theorem ■ The NB Model ■ The NB Classifier ■ Break Boosting

Wheeler Ruml (UNH) Lecture 22, CS 730 – 3 / 14

SLIDE 4

Bayes’ Theorem

Naive Bayes ■ Bayes’ Theorem ■ The NB Model ■ The NB Classifier ■ Break Boosting

Wheeler Ruml (UNH) Lecture 22, CS 730 – 4 / 14

P(H|D) = P(H)P(D|H) P(D)

SLIDE 5

Bayes’ Theorem

Naive Bayes ■ Bayes’ Theorem ■ The NB Model ■ The NB Classifier ■ Break Boosting

Wheeler Ruml (UNH) Lecture 22, CS 730 – 4 / 14

P(H|D) = P(H)P(D|H) P(D) P(H) = 0.0001 P(D|H) = 0.99 P(D) = 0.01 P(H|D) =

SLIDE 6

Bayes’ Theorem

Naive Bayes ■ Bayes’ Theorem ■ The NB Model ■ The NB Classifier ■ Break Boosting

Wheeler Ruml (UNH) Lecture 22, CS 730 – 4 / 14

P(H|D) = P(H)P(D|H) P(D) P(H) = 0.0001 P(D|H) = 0.99 P(D) = 0.01 P(H|D) = If you don’t have P(D),

SLIDE 7

Bayes’ Theorem

Naive Bayes ■ Bayes’ Theorem ■ The NB Model ■ The NB Classifier ■ Break Boosting

Wheeler Ruml (UNH) Lecture 22, CS 730 – 4 / 14

P(H|D) = P(H)P(D|H) P(D) P(H) = 0.0001 P(D|H) = 0.99 P(D) = 0.01 P(H|D) = If you don’t have P(D), somtimes it helps to note that P(D) = P(D|H)P(H) + P(D|¬H)P(¬H)

SLIDE 8

A Naive Bayesian Model

Naive Bayes ■ Bayes’ Theorem ■ The NB Model ■ The NB Classifier ■ Break Boosting

Wheeler Ruml (UNH) Lecture 22, CS 730 – 5 / 14

Bayes’ Theorem: P(H|D) = P(H)P(D|H) P(D)

SLIDE 9

A Naive Bayesian Model

Naive Bayes ■ Bayes’ Theorem ■ The NB Model ■ The NB Classifier ■ Break Boosting

Wheeler Ruml (UNH) Lecture 22, CS 730 – 5 / 14

Bayes’ Theorem: P(H|D) = P(H)P(D|H) P(D) naive model: P(D|H) = P(xi, . . . , xn|H) =

P(xi|H)

SLIDE 10

A Naive Bayesian Model

Naive Bayes ■ Bayes’ Theorem ■ The NB Model ■ The NB Classifier ■ Break Boosting

Wheeler Ruml (UNH) Lecture 22, CS 730 – 5 / 14

Bayes’ Theorem: P(H|D) = P(H)P(D|H) P(D) naive model: P(D|H) = P(xi, . . . , xn|H) =

P(xi|H) attributes independent, given class

SLIDE 11

A Naive Bayesian Model

Naive Bayes ■ Bayes’ Theorem ■ The NB Model ■ The NB Classifier ■ Break Boosting

Wheeler Ruml (UNH) Lecture 22, CS 730 – 5 / 14

Bayes’ Theorem: P(H|D) = P(H)P(D|H) P(D) naive model: P(D|H) = P(xi, . . . , xn|H) =

P(xi|H) attributes independent, given class P(H|x1, . . . , xn) = αP(H)

P(xi|H)

SLIDE 12

The ‘Naive Bayes’ Classifier

Naive Bayes ■ Bayes’ Theorem ■ The NB Model ■ The NB Classifier ■ Break Boosting

Wheeler Ruml (UNH) Lecture 22, CS 730 – 6 / 14

P(H|x1, . . . , xn) = αP(H)

P(xi|H) attributes independent, given class maximum a posteriori = pick highest maximum likelihood = ignore prior watch for sparse data when learning! learning as density estimation

SLIDE 13

Break

Naive Bayes ■ Bayes’ Theorem ■ The NB Model ■ The NB Classifier ■ Break Boosting

Wheeler Ruml (UNH) Lecture 22, CS 730 – 7 / 14

■

asst 5

■

exam 2

■

projects

SLIDE 14

Boosting

Naive Bayes Boosting ■ Ensembles ■ AdaBoost ■ Behavior ■ Summary ■ EOLQs

Wheeler Ruml (UNH) Lecture 22, CS 730 – 8 / 14

SLIDE 15

Ensemble Learning

Naive Bayes Boosting ■ Ensembles ■ AdaBoost ■ Behavior ■ Summary ■ EOLQs

Wheeler Ruml (UNH) Lecture 22, CS 730 – 9 / 14

committees, ensembles weak vs strong learners reduce variance, expand hypothesis space (eg, half-spaces)

SLIDE 16

AdaBoost

Naive Bayes Boosting ■ Ensembles ■ AdaBoost ■ Behavior ■ Summary ■ EOLQs

Wheeler Ruml (UNH) Lecture 22, CS 730 – 10 / 14

N examples, T rounds, L a weak learner on weighted examples p ← uniform distribution over the N examples for t = 1 to T do ht ← call L with weights p ǫt ← ht’s weighted misclassification probability if ǫt = 0, return ht αt ← 1

2 ln( 1−ǫt ǫt )

for each example i if ht(i) is correct, pi ← pie−αt else, pi ← pieαt normalize p to sum to 1 return the h weighted by the α to classify, choose label with highest sum of weighted votes

SLIDE 17

Boosting Function

Naive Bayes Boosting ■ Ensembles ■ AdaBoost ■ Behavior ■ Summary ■ EOLQs

Wheeler Ruml (UNH) Lecture 22, CS 730 – 11 / 14

SLIDE 18

Behavior

Naive Bayes Boosting ■ Ensembles ■ AdaBoost ■ Behavior ■ Summary ■ EOLQs

Wheeler Ruml (UNH) Lecture 22, CS 730 – 12 / 14

doesn’t overfit (maximizes margin even when no error)

utliers get high weight, can be inspected

problems:

■

not enough data

■

hypothesis class too small

■

boosting: learner too weak, too strong

SLIDE 19

Supervised Learning: Summary

Naive Bayes Boosting ■ Ensembles ■ AdaBoost ■ Behavior ■ Summary ■ EOLQs

Wheeler Ruml (UNH) Lecture 22, CS 730 – 13 / 14

k-NN: distance function (any attributes), any labels Neural network: numeric attributes, numeric or binary labels Perceptron: equivalent to linear regression 3-Layer ANN: BackProp learning Decision Trees: easier with discrete attributes and labels Inductive Logic Programming: logical concepts Naive Bayes: easier with discrete attributes and labels Boosting: general wrapper to improve performance Didn’t cover: RBFs, EBL, SVMs

SLIDE 20

EOLQs

Naive Bayes Boosting ■ Ensembles ■ AdaBoost ■ Behavior ■ Summary ■ EOLQs

CS 730/730W/830: Intro AI

Wheeler Ruml (UNH) Lecture 22, CS 730 – 1 / 14

1 handout: slides asst 5 milestone was due

Supervised Learning: Summary So Far

Wheeler Ruml (UNH) Lecture 22, CS 730 – 2 / 14

Naive Bayes

Wheeler Ruml (UNH) Lecture 22, CS 730 – 3 / 14

Bayes’ Theorem

Wheeler Ruml (UNH) Lecture 22, CS 730 – 4 / 14

P(H|D) = P(H)P(D|H) P(D)

Bayes’ Theorem

Wheeler Ruml (UNH) Lecture 22, CS 730 – 4 / 14

P(H|D) = P(H)P(D|H) P(D) P(H) = 0.0001 P(D|H) = 0.99 P(D) = 0.01 P(H|D) =

Bayes’ Theorem

Wheeler Ruml (UNH) Lecture 22, CS 730 – 4 / 14

P(H|D) = P(H)P(D|H) P(D) P(H) = 0.0001 P(D|H) = 0.99 P(D) = 0.01 P(H|D) = If you don’t have P(D),

Bayes’ Theorem

Wheeler Ruml (UNH) Lecture 22, CS 730 – 4 / 14

P(H|D) = P(H)P(D|H) P(D) P(H) = 0.0001 P(D|H) = 0.99 P(D) = 0.01 P(H|D) = If you don’t have P(D), somtimes it helps to note that P(D) = P(D|H)P(H) + P(D|¬H)P(¬H)

A Naive Bayesian Model

Wheeler Ruml (UNH) Lecture 22, CS 730 – 5 / 14

Bayes’ Theorem: P(H|D) = P(H)P(D|H) P(D)

A Naive Bayesian Model

Wheeler Ruml (UNH) Lecture 22, CS 730 – 5 / 14

Bayes’ Theorem: P(H|D) = P(H)P(D|H) P(D) naive model: P(D|H) = P(xi, . . . , xn|H) =

P(xi|H)

A Naive Bayesian Model

Wheeler Ruml (UNH) Lecture 22, CS 730 – 5 / 14

Bayes’ Theorem: P(H|D) = P(H)P(D|H) P(D) naive model: P(D|H) = P(xi, . . . , xn|H) =

P(xi|H) attributes independent, given class

A Naive Bayesian Model

Wheeler Ruml (UNH) Lecture 22, CS 730 – 5 / 14

Bayes’ Theorem: P(H|D) = P(H)P(D|H) P(D) naive model: P(D|H) = P(xi, . . . , xn|H) =

P(xi|H) attributes independent, given class P(H|x1, . . . , xn) = αP(H)

P(xi|H)

The ‘Naive Bayes’ Classifier

Wheeler Ruml (UNH) Lecture 22, CS 730 – 6 / 14

P(H|x1, . . . , xn) = αP(H)

P(xi|H) attributes independent, given class maximum a posteriori = pick highest maximum likelihood = ignore prior watch for sparse data when learning! learning as density estimation

Break

Wheeler Ruml (UNH) Lecture 22, CS 730 – 7 / 14

■

asst 5

■

exam 2

■

projects

Boosting

Wheeler Ruml (UNH) Lecture 22, CS 730 – 8 / 14

Ensemble Learning

Wheeler Ruml (UNH) Lecture 22, CS 730 – 9 / 14

committees, ensembles weak vs strong learners reduce variance, expand hypothesis space (eg, half-spaces)

AdaBoost

Wheeler Ruml (UNH) Lecture 22, CS 730 – 10 / 14

N examples, T rounds, L a weak learner on weighted examples p ← uniform distribution over the N examples for t = 1 to T do ht ← call L with weights p ǫt ← ht’s weighted misclassification probability if ǫt = 0, return ht αt ← 1

2 ln( 1−ǫt ǫt )

for each example i if ht(i) is correct, pi ← pie−αt else, pi ← pieαt normalize p to sum to 1 return the h weighted by the α to classify, choose label with highest sum of weighted votes

Boosting Function

Wheeler Ruml (UNH) Lecture 22, CS 730 – 11 / 14

Behavior

Wheeler Ruml (UNH) Lecture 22, CS 730 – 12 / 14

doesn’t overfit (maximizes margin even when no error)

problems:

■

not enough data

■

hypothesis class too small

■

boosting: learner too weak, too strong

Supervised Learning: Summary

Wheeler Ruml (UNH) Lecture 22, CS 730 – 13 / 14

EOLQs

Wheeler Ruml (UNH) Lecture 22, CS 730 – 14 / 14

■

What question didn’t you get to ask today?

■

What’s still confusing?

■

What would you like to hear more about? Please write down your most pressing question about AI and put it in the box on your way out. Thanks!