[PPT] - Decision Trees Aarti Singh Machine Learning 10-701/15-781 Oct 6 , PowerPoint Presentation

SLIDE 1

Decision Trees

Aarti Singh

Machine Learning 10-701/15-781 Oct 6 , 2010

SLIDE 2

Learning a good prediction rule

Learn a mapping
Best prediction rule
Hypothesis space/Function class

– Parametric classes (Gaussian, binomial etc.) – Conditionally independent class densities (Naïve Bayes) – Linear decision boundary (Logistic regression) – Nonparametric class (Histograms, nearest neighbor, kernel estimators, Decision Trees – Today)

Given training data, find a hypothesis/function in that is

close to the best prediction rule.

2

SLIDE 3

First …

What does a decision tree represent
Given a decision tree, how do we assign label to a

test point

3

SLIDE 4

Decision Tree for Tax Fraud Detection

4

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K

Each internal node: test
ne feature Xi
Each branch from a node:

selects one value for Xi

Each leaf node: predict Y

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Query Data

SLIDE 5

Decision Tree for Tax Fraud Detection

5

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Query Data

SLIDE 6

Decision Tree for Tax Fraud Detection

6

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Query Data

SLIDE 7

Decision Tree for Tax Fraud Detection

7

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Query Data

No

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

SLIDE 8

Decision Tree for Tax Fraud Detection

8

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Query Data

No

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

SLIDE 9

Decision Tree for Tax Fraud Detection

9

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Query Data

No

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Married

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

SLIDE 10

Decision Tree for Tax Fraud Detection

10

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Query Data

No

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Married

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Assign Cheat to “No”

SLIDE 11

Decision Tree more generally…

11

1 1 1 1

Features can be discrete,

continuous or categorical

Each internal node: test

some set of features {Xi}

Each branch from a node:

selects a set of value for {Xi}

Each leaf node: predict Y

1 1 1 1 1 1

SLIDE 12

So far…

What does a decision tree represent
Given a decision tree, how do we assign label to a

test point

Now …

How do we learn a decision tree from training

data

What is the decision on each leaf

12

SLIDE 13

So far…

What does a decision tree represent
Given a decision tree, how do we assign label to a

test point

Now …

How do we learn a decision tree from training

data

What is the decision on each leaf

13

SLIDE 14

How to learn a decision tree

Top-down induction *ID3, C4.5, CART, …+

14

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K

SLIDE 15

Which feature is best to split?

15

X1 X2 Y T T T T F T T T T T F T F T T F F F F T F F F F

T F Y: 4 Ts 0 Fs Y: 1 Ts 3 Fs T F Y: 3 Ts 1 Fs Y: 2 Ts 2 Fs

Good split if we are more certain about classification after split –

Uniform distribution of labels is bad

Absolutely sure Kind of sure Kind of sure Absolutely unsure

SLIDE 16

Which feature is best to split?

16

Pick the attribute/feature which yields maximum information gain:

H(Y) – entropy of Y H(Y|Xi) – conditional entropy of Y

SLIDE 17

Entropy

Entropy of a random variable Y

More uncertainty, more entropy! Y ~ Bernoulli(p)

Information Theory interpretation: H(Y) is the expected number of bits needed to encode a randomly drawn value of Y (under most efficient code)

17

p Entropy, H(Y) Uniform Max entropy Deterministic Zero entropy

SLIDE 18

Andrew Moore’s Entropy in a Nutshell

18

Low Entropy High Entropy

..the values (locations of soup) unpredictable... almost uniformly sampled throughout our dining room ..the values (locations

f soup) sampled

entirely from within the soup bowl

SLIDE 19

Information Gain

Advantage of attribute = decrease in uncertainty

– Entropy of Y before split – Entropy of Y after splitting based on Xi

Weight by probability of following each branch
Information gain is difference

Max Information gain = min conditional entropy

19

SLIDE 20

Information Gain

20

X1 X2 Y T T T T F T T T T T F T F T T F F F F T F F F F

T F T F Y: 4 Ts 0 Fs Y: 1 Ts 3 Fs Y: 3 Ts 1 Fs Y: 2 Ts 2 Fs

> 0

SLIDE 21

Which feature is best to split?

21

Pick the attribute/feature which yields maximum information gain:

H(Y) – entropy of Y H(Y|Xi) – conditional entropy of Y

Feature which yields maximum reduction in entropy provides maximum information about Y

SLIDE 22

Expressiveness of Decision Trees

22

Decision trees can express any function of the input features.
E.g., for Boolean functions, truth table row → path to leaf:
There is a decision tree which perfectly classifies a training set

with one path to leaf for each example

But it won't generalize well to new examples - prefer to find

more compact decision trees

SLIDE 23

Decision Trees - Overfitting

One training example per leaf – overfits, need compact/pruned decision tree

23

SLIDE 24

Bias-Variance Tradeoff

24

fine partition coarse partition variance small variance large bias small bias large

Ideal classifier average classifier Classifiers based on different training data

SLIDE 25

When to Stop?

Many strategies for picking simpler trees:

– Pre-pruning

Fixed depth
Fixed number of leaves

– Post-pruning

Chi-square test

– Convert decision tree to a set of rules – Eliminate variable values in rules which are independent of label (using chi-square test for independence) – Simplify rule set by eliminating unnecessary rules

– Information Criteria: MDL(Minimum Description Length)

25

Refund MarSt NO Yes No Married Single, Divorced

SLIDE 26

Penalize complex models by introducing cost

26

log likelihood cost regression classification penalize trees with more leaves

Information Criteria

SLIDE 27

5 leaves => 9 bits to encode structure

Information Criteria - MDL

Penalize complex models based on their information content. MDL (Minimum Description Length)

Example: Binary Decision trees

k leaves => 2k – 1 nodes 2k – 1 bits to encode tree structure + k bits to encode label of each leaf (0/1)

# bits needed to describe f (description length)

SLIDE 28

So far…

What does a decision tree represent
Given a decision tree, how do we assign label to a

test point

Now …

How do we learn a decision tree from training

data

What is the decision on each leaf

28

SLIDE 29

How to assign label to each leaf

Classification – Majority vote Regression – ?

29

SLIDE 30

How to assign label to each leaf

Classification – Majority vote Regression – Constant/ Linear/Poly fit

30

SLIDE 31

Regression trees

31

Average (fit a constant ) using training data at the leaves

Num Children? ≥ 2 < 2

SLIDE 32

Connection between nearest neighbor/histogram classifiers and decision trees

32

SLIDE 33

Local prediction

33

Histogram, kernel density estimation, k-nearest neighbor classifier, kernel regression Histogram Classifier

D

SLIDE 34

Local Adaptive prediction

34

Let neighborhood size adapt to data – small neighborhoods near decision boundary (small bias), large neighborhoods elsewhere (small variance) Decision Tree Classifier

Majority vote at each leaf

Dx

SLIDE 35

Histogram Classifier vs Decision Trees

35

Ideal classifier Decision tree histogram 256 cells in each partition

SLIDE 36

Application to Image Coding

36

1024 cells in each partition

SLIDE 37

37

JPEG 0.125 bpp JPEG 2000 0.125 bpp non-adaptive partitioning adaptive partitioning

Application to Image Coding

SLIDE 38

What you should know

38

Decision trees are one of the most popular data mining

tools

Simplicity of design
Interpretability
Ease of implementation
Good performance in practice (for small dimensions)
Information gain to select attributes (ID3, C4.5,…)
Can be used for classification, regression and density

estimation too

Decision trees will overfit!!!

– Must use tricks to find “simple trees”, e.g.,

Pre-Pruning: Fixed depth/Fixed number of leaves
Post-Pruning: Chi-square test of independence
Complexity Penalized/MDL model selection