Supervised Learning Decision Trees and Linear Models Marco - - PowerPoint PPT Presentation

supervised learning decision trees and linear models
SMART_READER_LITE
LIVE PREVIEW

Supervised Learning Decision Trees and Linear Models Marco - - PowerPoint PPT Presentation

Lecture 10 Supervised Learning Decision Trees and Linear Models Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Slides by Stuart Russell and Peter Norvig Decision Trees k -Nearest Neighbor


slide-1
SLIDE 1

Lecture 10

Supervised Learning Decision Trees and Linear Models

Marco Chiarandini

Department of Mathematics & Computer Science University of Southern Denmark

Slides by Stuart Russell and Peter Norvig

slide-2
SLIDE 2

Decision Trees k-Nearest Neighbor Linear Models

Course Overview

✔ Introduction

✔ Artificial Intelligence ✔ Intelligent Agents

✔ Search

✔ Uninformed Search ✔ Heuristic Search

✔ Uncertain knowledge and Reasoning

✔ Probability and Bayesian approach ✔ Bayesian Networks ✔ Hidden Markov Chains ✔ Kalman Filters

Learning

Supervised Decision Trees, Neural Networks Learning Bayesian Networks Unsupervised EM Algorithm

Reinforcement Learning Games and Adversarial Search

Minimax search and Alpha-beta pruning Multiagent search

Knowledge representation and Reasoning

Propositional logic First order logic Inference Plannning

2

slide-3
SLIDE 3

Decision Trees k-Nearest Neighbor Linear Models

Machine Learning

What? Parameters, network structure, hidden concepts, What from? inductive + unsupervised, reinforcement, supervised What for? prediction, diagnosis, summarization How? passive vs active, online vs offline Type of outputs regression, classification Details generative, discriminative

3

slide-4
SLIDE 4

Decision Trees k-Nearest Neighbor Linear Models

Supervised Learning

Given a training set of N example input-output pairs {(x1, y1), (x2, y2), . . . , (xN, yN)} where each y1 was generated by an unknwon function y = f (x), find a hypothesis function h from an hypothesis space H that approximates the true function f Measure the accuracy of the hypotheis on a test set made of new examples. We aim a good generalization

4

slide-5
SLIDE 5

Decision Trees k-Nearest Neighbor Linear Models

Supervised Learning

Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:

x f(x)

Ockham’s razor: maximize a combination of consistency and simplicity

5

slide-6
SLIDE 6

Decision Trees k-Nearest Neighbor Linear Models

if we have a probability on the hypothesis: h∗ = argmaxh∈H Pr(h | data) = argmaxhH Pr(data | h) Pr(h) Trade off between the expressiveness of a hypothesis space and the complexity of finding a good hypothesis within that space.

6

slide-7
SLIDE 7

Decision Trees k-Nearest Neighbor Linear Models

Outline

  • 1. Decision Trees
  • 2. k-Nearest Neighbor
  • 3. Linear Models

7

slide-8
SLIDE 8

Decision Trees k-Nearest Neighbor Linear Models

Learning Decision Trees

A decision tree of a pair (x, y) represents a function that takes the input attribute x (Boolean, discrete, continuous) and outputs a simple Boolean y. E.g., situations where I will/won’t wait for a table. Training set:

Example Attributes Target Alt Bar Fri Hun Pat Price Rain Res Type Est WillWait X1 T F F T Some $$$ F T French 0–10 T X2 T F F T Full $ F F Thai 30–60 F X3 F T F F Some $ F F Burger 0–10 T X4 T F T T Full $ F F Thai 10–30 T X5 T F T F Full $$$ F T French >60 F X6 F T F T Some $$ T T Italian 0–10 T X7 F T F F None $ T F Burger 0–10 F X8 F F F T Some $$ T T Thai 0–10 T X9 F T T F Full $ T F Burger >60 F X10 T T T T Full $$$ F T Italian 10–30 F X11 F F F F None $ F F Thai 0–10 F X12 T T T T Full $ F F Burger 30–60 T

Classification of examples positive (T) or negative (F)

8

slide-9
SLIDE 9

Decision Trees k-Nearest Neighbor Linear Models

Decision trees

One possible representation for hypotheses E.g., here is the “true” tree for deciding whether to wait:

No Yes No Yes No Yes No Yes No Yes No Yes None Some Full >60 30−60 10−30 0−10 No Yes

Alternate? Hungry? Reservation? Bar? Raining? Alternate? Patrons? Fri/Sat? WaitEstimate? F T F T T T F T T F T T F

9

slide-10
SLIDE 10

Decision Trees k-Nearest Neighbor Linear Models

Example

10

slide-11
SLIDE 11

Decision Trees k-Nearest Neighbor Linear Models

Example

11

slide-12
SLIDE 12

Decision Trees k-Nearest Neighbor Linear Models

Expressiveness

Decision trees can express any function of the input attributes. E.g., for Boolean functions, truth table row → path to leaf:

F T A B F T B

A B A xor B F F F F T T T F T T T F

F F F T T T

Trivially, there is a consistent decision tree for any training set w/ one path to leaf for each example (unless f nondeterministic in x) but it probably won’t generalize to new examples Prefer to find more compact decision trees

12

slide-13
SLIDE 13

Decision Trees k-Nearest Neighbor Linear Models

Hypothesis spaces

How many distinct decision trees with n Boolean attributes?? = number of Boolean functions = number of distinct truth tables with 2n rows = 22n functions E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees More expressive hypothesis space – increases chance that target function can be expressed – increases number of hypotheses consistent w/ training set = ⇒ may get worse predictions There is no way to search the smallest consistent tree among 22n.

13

slide-14
SLIDE 14

Decision Trees k-Nearest Neighbor Linear Models

Heuristic approach

Greedy divide-and-conquer: test the most important attribute first divide the problem up into smaller subproblems that can be solved recursively

function DTL(examples, attributes, default) returns a decision tree if examples is empty then return default else if all examples have the same classification then return the classification else if attributes is empty then return Plurality_Value(examples) else best ← Choose-Attribute(attributes, examples) tree ← a new decision tree with root test best for each value vi of best do examplesi ← {elements of examples with best = vi} subtree ← DTL(examplesi, attributes − best, Mode(examples)) add a branch to tree with label vi and subtree subtree return tree

14

slide-15
SLIDE 15

Decision Trees k-Nearest Neighbor Linear Models

Choosing an attribute

Idea: a good attribute splits the examples into subsets that are (ideally) “all positive” or “all negative”

None Some Full

Patrons?

French Italian Thai Burger

Type?

Patrons? is a better choice—gives information about the classification

15

slide-16
SLIDE 16

Decision Trees k-Nearest Neighbor Linear Models

Information

The more clueless I am about the answer initially, the more information is contained in the answer 0 bits to answer a query on a coin with only head 1 bit to answer query to a Boolean question with prior 0.5, 0.5 2 bits to answer a query on a fair die with 4 faces a query on a coin with 99% probability of returing head brings less information than the query on a fair coin. Shannon formalized this concept with the concept of entropy. For a random variable X with values xk and probability Pr(xk) has entropy: H(X) = −

  • k

Pr(xk) log2 Pr(xk)

16

slide-17
SLIDE 17

Suppose we have p positive and n negative examples is a training set, then the entropy is H(p/(p + n), n/(p + n)) E.g., for 12 restaurant examples, p = n = 6 so we need 1 bit to classify a new example information of the table An attribute A splits the training set E into subsets E1, . . . , Ed, each of which (we hope) needs less information to complete the classification Let Ei have pi positive and ni negative examples H(pi/(pi + ni), ni/(pi + ni)) bits needed to classify a new example

  • n that branch

expected entropy after branching is Remainder(A) =

  • i

pi + ni p + n H(pi/(pi + ni), ni/(pi + ni)) The information gain from attribute A is Gain(A) = H(p/(p + n), n/(p + n)) − Remainder(A) = ⇒ choose the attribute that maximizes the gain

slide-18
SLIDE 18

Decision Trees k-Nearest Neighbor Linear Models

Example contd.

Decision tree learned from the 12 examples:

No Yes

Fri/Sat?

None Some Full

Patrons?

No Yes

Hungry? Type?

French Italian Thai Burger

F T T F F T F T

Substantially simpler than “true” tree—a more complex hypothesis isn’t justified by small amount of data

18

slide-19
SLIDE 19

Decision Trees k-Nearest Neighbor Linear Models

Performance measurement

Learning curve = % correct on test set as a function of training set size Restaurant data; graph averaged over 20 trials

19

slide-20
SLIDE 20

Decision Trees k-Nearest Neighbor Linear Models

Overfitting and Pruning

Pruning by statistical testing under the null hyothesis expected numbers, ˆ pk and ˆ nk: ˆ pk = p · pk + nk p + n ˆ nk = n · pk + nk p + n ∆ =

d

  • k=1

(pk − ˆ pk)2 ˆ pk + (nk − ˆ nk)2 ˆ nk χ2 distribution with p + n − 1 degrees of freedom Early stopping misses combinations of attributes that are informative.

21

slide-21
SLIDE 21

Decision Trees k-Nearest Neighbor Linear Models

Further Issues

Missing data Multivalued attributes Continuous input attributes Continuous-valued output attributes

22

slide-22
SLIDE 22

Decision Trees k-Nearest Neighbor Linear Models

Decision Trees

23

slide-23
SLIDE 23

Decision Trees k-Nearest Neighbor Linear Models

Decision Tree Types

Classification tree analysis is when the predicted outcome is the class to which the data belongs. Iterative Dichotomiser 3 (ID3), C4.5, (Quinlan, 1986) Regression tree analysis is when the predicted outcome can be considered a real number (e.g. the price of a house, or a patient’s length

  • f stay in a hospital).

Classification And Regression Tree (CART) analysis is used to refer to both of the above procedures, first introduced by (Breiman et al., 1984) CHi-squared Automatic Interaction Detector (CHAID). Performs multi-level splits when computing classification trees. (Kass, G. V. 1980). A Random Forest classifier uses a number of decision trees, in order to improve the classification rate. Boosting Trees can be used for regression-type and classification-type problems. Used in data mining (most are included in R, see rpart and party packages, and in Weka, Waikato Environment for Knowledge Analysis)

24

slide-24
SLIDE 24

Decision Trees k-Nearest Neighbor Linear Models

Outline

  • 1. Decision Trees
  • 2. k-Nearest Neighbor
  • 3. Linear Models

25

slide-25
SLIDE 25

Decision Trees k-Nearest Neighbor Linear Models

Non-parametric learning

When little data available parametric learning (restricted from the model selected) When massive data we can let hypothesis grow from data non parametric learning instance based: construct from training instances

26

slide-26
SLIDE 26

Decision Trees k-Nearest Neighbor Linear Models

Predicting Bankruptcy

27

slide-27
SLIDE 27

Decision Trees k-Nearest Neighbor Linear Models

Nearest Neighbor

Basic idea: Remember all your data When someone asks a question

find nearest old data point return answer associated with it

28

slide-28
SLIDE 28

Decision Trees k-Nearest Neighbor Linear Models

Find k observations closest to x and average the response ˆ Y = 1 k

  • xi ∈Nk(x)

yi For qualitative use majority rule Needed a distance measure:

Euclidean Standardization x′ = x−¯

x σx

(Mahalanobis, scale invariant) Hamming

29

slide-29
SLIDE 29

Decision Trees k-Nearest Neighbor Linear Models

Predicting Bankruptcy

30

slide-30
SLIDE 30

Decision Trees k-Nearest Neighbor Linear Models

Predicting Bankruptcy

31

slide-31
SLIDE 31

Decision Trees k-Nearest Neighbor Linear Models

Learning is fast Lookup takes about n computations with k-d trees can be faster Memory can fill up with all that data Problem: Course of dimensionality bd = k

N 1 =

⇒ b = k

N

1 d 32

slide-32
SLIDE 32

Decision Trees k-Nearest Neighbor Linear Models

k-Nearest Neighbor

33

slide-33
SLIDE 33

Decision Trees k-Nearest Neighbor Linear Models

Backruptcy Example

34

slide-34
SLIDE 34

Decision Trees k-Nearest Neighbor Linear Models

1-Nearest Neighbor

35

slide-35
SLIDE 35

Decision Trees k-Nearest Neighbor Linear Models

Outline

  • 1. Decision Trees
  • 2. k-Nearest Neighbor
  • 3. Linear Models

36

slide-36
SLIDE 36

Decision Trees k-Nearest Neighbor Linear Models

Linear Models

Univariate case Hypotheisis space made by linear functions hw(x) = w1x + w0 Find w by min squared loss function: L(hw) =

N

  • j=1

L2 (yj, hw(xj)) =

N

  • j=1

(yj − hw(xj))2 w ∗ = argmin L(hw(x))        ∂L ∂w0 = −2(y − hw(x)) = 0 ∂L ∂w1 = −2(y − hw(x))x = 0 w0, w1 in closed form.

37

slide-37
SLIDE 37

Decision Trees k-Nearest Neighbor Linear Models

Multivariate case hw(x) = w0 + w1x1 + . . . + wnxn = w · x w ∗ = argminw

  • j

L2(yj, wxj) w ∗ = (XTX)−1XTy in closed form Basis functions: fixed non linear functions φj(x): hw(x) = w0 + P

j=1 φj(x)

To avoid overfitting, regularization: EmpLoss(h) + λ · Complexity(h) Complexity(h) = Lq(w) =

  • i

|wi|q

38

slide-38
SLIDE 38

Decision Trees k-Nearest Neighbor Linear Models

Non-Parametric Regression

Instance based methods Similar idea as k-nearest neighbor: For a query point xq solve following regression problem: w ∗ = argminw

  • j

K(||xq − xj||)(yj − w · xj)2 where K is a kernel function (eg, radial kernel)

39

slide-39
SLIDE 39

Decision Trees k-Nearest Neighbor Linear Models

Linear Classification

2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 4.5 5 5.5 6 6.5 7 x2 x1 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 4.5 5 5.5 6 6.5 7 x2 x1

decision boundary described by ax1 + bx2 = 0 hw(x) =

  • 1

if w · x ≥ 0

  • therwise

step function: gradient not defined

40

slide-40
SLIDE 40

Decision Trees k-Nearest Neighbor Linear Models

Logistic Regression

−10 −5 5 10 0.0 0.2 0.4 0.6 0.8 1.0 x 1/(1 + exp(−x))

hw(x) = 1 1 + exp(−w · x) g ′(z) = g(z)(1 − g(z)) g ′(w · x) = g(w · x)(1 − g(w · x) = hw(x)(1 − hw(x)) ∂L ∂wi = −2(y − hw(x)) · g ′(w · x)xi

41

slide-41
SLIDE 41

Decision Trees k-Nearest Neighbor Linear Models

Gradient Descent

Finding local minima of derivable continuous functions w ← any initial value repeat for each wi in w do wi ← wi − α ∂L

∂wi

until convergence ; Batch gradient descent: L is the sum

  • f the contribution of each example.

Guaranteed to converge. Stochastic gradient descent: one example at a time in random order.

  • Online. Not guaranteed to converge.

42

slide-42
SLIDE 42

Decision Trees k-Nearest Neighbor Linear Models

Gradient Descent for Step Function

In step function gradient not defined. However, the update rule: wi ← wi − α(y − hw(x))xi ensures convergence when data are linearly separable. Otherwise unsure.

43