ECE 4524 Artificial Intelligence and Engineering Applications - - PowerPoint PPT Presentation

ece 4524 artificial intelligence and engineering
SMART_READER_LITE
LIVE PREVIEW

ECE 4524 Artificial Intelligence and Engineering Applications - - PowerPoint PPT Presentation

ECE 4524 Artificial Intelligence and Engineering Applications Lecture 22: Introduction to Learning Reading: AIAMA 18.1-18.3 Todays Schedule: Motivation for Learning Types of Learning Supervised Learning and Hypothesis spaces


slide-1
SLIDE 1

ECE 4524 Artificial Intelligence and Engineering Applications

Lecture 22: Introduction to Learning Reading: AIAMA 18.1-18.3 Today’s Schedule:

◮ Motivation for Learning ◮ Types of Learning ◮ Supervised Learning and Hypothesis spaces ◮ Example: Decision Trees

slide-2
SLIDE 2

Why learning?

◮ not all information is known at design time ◮ it might be impractical to program all possibilities directly ◮ some agents need to be able to adapt over time ◮ we might not know how to solve a problem directly by design

This area in general is referred to as Machine Learning.

slide-3
SLIDE 3

Learning is a very general concept.

It can be applied to all elements of an agents design, e.g. we might

◮ learn functions mapping percepts to internal states ◮ learn functions mapping states to actions ◮ learn the agent model itself ◮ learn probabilities ◮ learn utilities of internal states or actions

Any agent component with a representation, prior knowledge of the representation, and a way to update the representation using feedback can use learning methods.

slide-4
SLIDE 4

Categorization of Learning

The most basic distinction in learning is the difference between

◮ Deductive Learning ◮ Inductive Learning

Within inductive learning there is

◮ unsupervised learning ◮ reinforcement learning ◮ supervised learning

slide-5
SLIDE 5

Supervised Learning

Supervised learning is conceptually very simple, but has many practical and subtle issues.

◮ Given a training set consisting of examples

D = {(x1, y1), (x2, y2), · · · , (xn, yn)} where each example

  • beys

yi = f (xi) for some unknown function f (·).

◮ Find a function, the hypothesis h(·)

y = h(x) that approximates the true f .

slide-6
SLIDE 6

The quality of the approximation is measured using the Test Set.

T = {(x1, y1), (x2, y2), · · · , (xm, ym)} where m < n and T ∩ D = ∅

◮ Collecting training and testing sets is often hard and expensive ◮ a h that performs well on the test set is said to generalize well. ◮ an h that performs well on the training set (said to be

consistent) but poorly on the test set is said to be

  • ver-trained.

Note the test set is independent of the training set!

slide-7
SLIDE 7

Some Nomenclature

◮ When y is finite with a categorical interpretation, this is a

classification problem

◮ If y is binary it is a binary classification problem ◮ If y is continuous then it is a regression problem.

slide-8
SLIDE 8

Hypothesis Space

In y = h(x), h is a hypothesis in some space of functions H.

◮ Goal is to find a consistent h with smallest testing error and

the simplest representation (Ockham’s Razor)

◮ If we restrict the space H then it may be that no h can be

found which approximates f sufficiently (unrealizable).

◮ The complexity/expressiveness of H and the generalization of

h ∈ H is related through the bias-variance dilemma.

slide-9
SLIDE 9

Bayesian analysis gives us a useful framework for supervised learning

◮ Let h ∈ H be parameterized by θ, and the training data given

by D, then the posterior of the parameters is p(θ|D, h) = p(D|θ, h)p(θ|h) P(D|h)

◮ The posterior of the model is the evidence for h

p(h|D) = p(D|h)p(h) P(D) where the denominator integrates over all models in H

slide-10
SLIDE 10

Bayesian analysis gives us a useful framework for supervised learning

◮ The maximum likelihood model ignores the prior over models

argmax

h

P(D|h) and is the model with the most evidence.

◮ The maximum a-posteriori (MAP) model includes the prior

  • ver models

argmax

h

p(h|D) = argmax

h

p(D|h)p(h) () where the denominator P(D) is common to all models and so irrelevant to the model selection. We can also average models by choosing the top models rather than a single model. This is particularly useful in binary classification, where the models can simply vote on the final classifier output.

slide-11
SLIDE 11

Utility of models

◮ We assume the true f(x) is stationary and samples are IID. ◮ The error rate is the proportion of incorrect classifications. ◮ Note the error rate may be misleading since it makes no

distinction about utility differences. Example: Binary classifier has 4 cases: TP, FP, TN, FN

◮ The cost of a FP or TN may not be the same. ◮ This is accounted for via a utility/loss function.

slide-12
SLIDE 12

Sources of Model Error

◮ The estimated h may differ from the true f because

  • 1. the space H is overly restrictive (unrealizable)
  • 2. the variance is large (high degrees of freedom)
  • 3. f itself may be non-deterministic (noisy)
  • 4. f is ”too complex”

◮ Most of Machine Learning has been focused on 1 and 2. ◮ A large open area in machine learning now is 4, ”learning in

the large” (e.g. neuroscience, bioinformatics, sociology, networks)

slide-13
SLIDE 13

An example learning method: Decision Trees

Consider a simple reflex agent that reasons by testing a series of attribute = value pairs.

◮ Let x be a vector of attributes ◮ Let y be a +/- or 0/1 assignment for a Goal (a binary

classifier)

◮ Given D = (xi, yi) for i = 1 · · · N build the tree of decisions

formed by testing the attributes of x individually.

slide-14
SLIDE 14
slide-15
SLIDE 15

Implementing the importance function

The idea is that we want to select the attribute that maximizes our ”surprise”

◮ The entropy of a R.V. V with values vk measures it’s

uncertainty H(V ) = −

  • k

p(vk) log2(p(vk)) in bits

◮ For a Boolean R.V. with probability of true = q the Entropy is

B(q) = −(q log2 q + (1 − q) log2(1 − q)) where q ≈ p/(p + n) for p positive and n negative samples.

slide-16
SLIDE 16

Implementing the importance function

Now suppose we choose attribute A from x

◮ For each possible value of A we divide the training set into k

subsets with pk positive and nk negative examples

◮ After testing A, the remaining entropy is

remainder(A) =

d

  • k=1

pk + nk p + n B

  • pk

pk + nk

  • ◮ The information gain associated with selecting A is then

gain(A) = B

  • p

p + n

  • − remainder(A)

We choose the attribute with the highest gain in information.

slide-17
SLIDE 17

Next Actions

◮ Reading on Learning Theory (AIAMA 18.4-18.5) ◮ No warmup.

Reminders:

◮ Quiz 3 will be Thurday 4/12. ◮ PS 3 is due tonight.