Nave Bayes, Perceptron CMSC 470 Marine Carpuat Slides credit: - - PowerPoint PPT Presentation

na ve bayes perceptron
SMART_READER_LITE
LIVE PREVIEW

Nave Bayes, Perceptron CMSC 470 Marine Carpuat Slides credit: - - PowerPoint PPT Presentation

Linear Models: Nave Bayes, Perceptron CMSC 470 Marine Carpuat Slides credit: Jacob Eisenstein Linear Models for Multiclass Classification Feature function representation Weights Nave Bayes recap Prediction with Nave Bayes


slide-1
SLIDE 1

Linear Models: Naïve Bayes, Perceptron

CMSC 470 Marine Carpuat

Slides credit: Jacob Eisenstein

slide-2
SLIDE 2

Linear Models for Multiclass Classification

Feature function representation Weights

slide-3
SLIDE 3

Naïve Bayes recap

slide-4
SLIDE 4

Prediction with Naïve Bayes

Score(x,y) Definition of conditional probability Generative story assumptions This is a linear model!

slide-5
SLIDE 5
  • Naïve Bayes worked example on board
slide-6
SLIDE 6

The perceptron

  • A linear model for classification
  • Prediction rule
  • An algorithm to learn feature weights given labeled data
  • online algorithm
  • error-driven
slide-7
SLIDE 7

Multiclass perceptron

slide-8
SLIDE 8

Online vs batch learning algorithms

  • In an online algorithm, parameter values are updated after every

example

  • E.g., perceptron
  • In a batch algorithm, parameter values are set after observing the

entire training set

  • E.g., naïve Bayes
slide-9
SLIDE 9

Multiclass perceptron: a simple algorithm with some theoretical guarantees

Theorem: If the data is linearly separable, then the perceptron algorithm will find a separator (Novikoff, 1962)

slide-10
SLIDE 10

Practical considerations

  • In which order should we select instances?
  • Shuffling before learning to randomize order helps
  • How do we decide when to stop?
  • When the weight values don’t change much
  • E.g., norm of the difference between previous and current weight vectors falls below

some threshold

  • When the accuracy on held out data starts to decrease
  • Early stopping
slide-11
SLIDE 11

ML fundamentals aside:

  • verfitting/underfitting/generalization
slide-12
SLIDE 12

Training error is not sufficient

  • We care about generalization to new examples
  • A classifier can classify training data perfectly, yet classify new

examples incorrectly

  • Because training examples are only a sample of data distribution
  • a feature might correlate with class by coincidence
  • Because training examples could be noisy
  • e.g., accident in labeling
slide-13
SLIDE 13

Overfitting

  • Consider a model 𝜄 and its:
  • Error rate over training data 𝑓𝑠𝑠𝑝𝑠

𝑢𝑠𝑏𝑗𝑜(𝜄)

  • True error rate over all data 𝑓𝑠𝑠𝑝𝑠

𝑢𝑠𝑣𝑓 𝜄

  • We say ℎ overfits the training data if

𝑓𝑠𝑠𝑝𝑠

𝑢𝑠𝑏𝑗𝑜 𝜄 < 𝑓𝑠𝑠𝑝𝑠 𝑢𝑠𝑣𝑓 𝜄

slide-14
SLIDE 14

Evaluating on test data

  • Problem: we don’t know 𝑓𝑠𝑠𝑝𝑠𝑢𝑠𝑣𝑓 𝜄 !
  • Solution:
  • we set aside a test set
  • some examples that will be used for evaluation
  • we don’t look at them during training!
  • after learning a classifier 𝜄, we calculate

𝑓𝑠𝑠𝑝𝑠

𝑢𝑓𝑡𝑢 𝜄

slide-15
SLIDE 15

Overfitting

  • Another way of putting it
  • A classifier 𝜄 is said to overfit the training data, if there is another

hypothesis 𝜄′, such that

  • 𝜄 has a smaller error than 𝜄′ on the training data
  • but 𝜄 has larger error on the test data than 𝜄′.
slide-16
SLIDE 16

Underfitting/Overfitting

  • Underfitting
  • Learning algorithm had the opportunity to learn more from training data, but

didn’t

  • Overfitting
  • Learning algorithm paid too much attention to idiosyncracies of the training

data; the resulting classifier doesn’t generalize

slide-17
SLIDE 17

Back to the Perceptron

slide-18
SLIDE 18

Averaged Perceptron improves generalization

slide-19
SLIDE 19

Properties of Linear Models we’ve seen so far

Naïve Bayes

  • Batch learning
  • Generative model p(x,y)
  • Grounded in probability
  • Assumes features are

independent given class

  • Learning = find parameters that

maximize likelihood of training data

Perceptron

  • Online learning
  • Discriminative model score(y|x),

Guaranteed to converge if data is linearly separable

  • But might overfit the training set
  • Error-driven learning
slide-20
SLIDE 20

What you should know about linear models

  • Their properties, strengths and weaknesses (see previous slides)
  • How to make a prediction given a model
  • How to train a model given a dataset