Decision Trees Aarti Singh Machine Learning 10-701/15-781 Oct 6 , - - PowerPoint PPT Presentation

decision trees
SMART_READER_LITE
LIVE PREVIEW

Decision Trees Aarti Singh Machine Learning 10-701/15-781 Oct 6 , - - PowerPoint PPT Presentation

Decision Trees Aarti Singh Machine Learning 10-701/15-781 Oct 6 , 2010 Learning a good prediction rule Learn a mapping Best prediction rule Hypothesis space/Function class Parametric classes (Gaussian, binomial etc.)


slide-1
SLIDE 1

Decision Trees

Aarti Singh

Machine Learning 10-701/15-781 Oct 6 , 2010

slide-2
SLIDE 2

Learning a good prediction rule

  • Learn a mapping
  • Best prediction rule
  • Hypothesis space/Function class

– Parametric classes (Gaussian, binomial etc.) – Conditionally independent class densities (Naïve Bayes) – Linear decision boundary (Logistic regression) – Nonparametric class (Histograms, nearest neighbor, kernel estimators, Decision Trees – Today)

  • Given training data, find a hypothesis/function in that is

close to the best prediction rule.

2

slide-3
SLIDE 3

First …

  • What does a decision tree represent
  • Given a decision tree, how do we assign label to a

test point

3

slide-4
SLIDE 4

Decision Tree for Tax Fraud Detection

4

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K

  • Each internal node: test
  • ne feature Xi
  • Each branch from a node:

selects one value for Xi

  • Each leaf node: predict Y

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Query Data

slide-5
SLIDE 5

Decision Tree for Tax Fraud Detection

5

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Query Data

slide-6
SLIDE 6

Decision Tree for Tax Fraud Detection

6

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Query Data

slide-7
SLIDE 7

Decision Tree for Tax Fraud Detection

7

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Query Data

No

Refund Marital Status Taxable Income Cheat No Married 80K ?

10
slide-8
SLIDE 8

Decision Tree for Tax Fraud Detection

8

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Query Data

No

Refund Marital Status Taxable Income Cheat No Married 80K ?

10
slide-9
SLIDE 9

Decision Tree for Tax Fraud Detection

9

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Query Data

No

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Married

Refund Marital Status Taxable Income Cheat No Married 80K ?

10
slide-10
SLIDE 10

Decision Tree for Tax Fraud Detection

10

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Query Data

No

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Married

Refund Marital Status Taxable Income Cheat No Married 80K ?

10

Assign Cheat to “No”

slide-11
SLIDE 11

Decision Tree more generally…

11

1 1 1 1

  • Features can be discrete,

continuous or categorical

  • Each internal node: test

some set of features {Xi}

  • Each branch from a node:

selects a set of value for {Xi}

  • Each leaf node: predict Y

1 1 1 1 1 1

slide-12
SLIDE 12

So far…

  • What does a decision tree represent
  • Given a decision tree, how do we assign label to a

test point

Now …

  • How do we learn a decision tree from training

data

  • What is the decision on each leaf

12

slide-13
SLIDE 13

So far…

  • What does a decision tree represent
  • Given a decision tree, how do we assign label to a

test point

Now …

  • How do we learn a decision tree from training

data

  • What is the decision on each leaf

13

slide-14
SLIDE 14

How to learn a decision tree

  • Top-down induction *ID3, C4.5, CART, …+

14

Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K

slide-15
SLIDE 15

Which feature is best to split?

15

X1 X2 Y T T T T F T T T T T F T F T T F F F F T F F F F

T F Y: 4 Ts 0 Fs Y: 1 Ts 3 Fs T F Y: 3 Ts 1 Fs Y: 2 Ts 2 Fs

Good split if we are more certain about classification after split –

Uniform distribution of labels is bad

Absolutely sure Kind of sure Kind of sure Absolutely unsure

slide-16
SLIDE 16

Which feature is best to split?

16

Pick the attribute/feature which yields maximum information gain:

H(Y) – entropy of Y H(Y|Xi) – conditional entropy of Y

slide-17
SLIDE 17

Entropy

  • Entropy of a random variable Y

More uncertainty, more entropy! Y ~ Bernoulli(p)

Information Theory interpretation: H(Y) is the expected number of bits needed to encode a randomly drawn value of Y (under most efficient code)

17

p Entropy, H(Y) Uniform Max entropy Deterministic Zero entropy

slide-18
SLIDE 18

Andrew Moore’s Entropy in a Nutshell

18

Low Entropy High Entropy

..the values (locations of soup) unpredictable... almost uniformly sampled throughout our dining room ..the values (locations

  • f soup) sampled

entirely from within the soup bowl

slide-19
SLIDE 19

Information Gain

  • Advantage of attribute = decrease in uncertainty

– Entropy of Y before split – Entropy of Y after splitting based on Xi

  • Weight by probability of following each branch
  • Information gain is difference

Max Information gain = min conditional entropy

19

slide-20
SLIDE 20

Information Gain

20

X1 X2 Y T T T T F T T T T T F T F T T F F F F T F F F F

T F T F Y: 4 Ts 0 Fs Y: 1 Ts 3 Fs Y: 3 Ts 1 Fs Y: 2 Ts 2 Fs

> 0

slide-21
SLIDE 21

Which feature is best to split?

21

Pick the attribute/feature which yields maximum information gain:

H(Y) – entropy of Y H(Y|Xi) – conditional entropy of Y

Feature which yields maximum reduction in entropy provides maximum information about Y

slide-22
SLIDE 22

Expressiveness of Decision Trees

22

  • Decision trees can express any function of the input features.
  • E.g., for Boolean functions, truth table row → path to leaf:
  • There is a decision tree which perfectly classifies a training set

with one path to leaf for each example

  • But it won't generalize well to new examples - prefer to find

more compact decision trees

slide-23
SLIDE 23

Decision Trees - Overfitting

One training example per leaf – overfits, need compact/pruned decision tree

23

slide-24
SLIDE 24

Bias-Variance Tradeoff

24

fine partition coarse partition variance small variance large bias small bias large

Ideal classifier average classifier Classifiers based on different training data

slide-25
SLIDE 25

When to Stop?

  • Many strategies for picking simpler trees:

– Pre-pruning

  • Fixed depth
  • Fixed number of leaves

– Post-pruning

  • Chi-square test

– Convert decision tree to a set of rules – Eliminate variable values in rules which are independent of label (using chi-square test for independence) – Simplify rule set by eliminating unnecessary rules

– Information Criteria: MDL(Minimum Description Length)

25

Refund MarSt NO Yes No Married Single, Divorced

slide-26
SLIDE 26
  • Penalize complex models by introducing cost

26

log likelihood cost regression classification penalize trees with more leaves

Information Criteria

slide-27
SLIDE 27

5 leaves => 9 bits to encode structure

Information Criteria - MDL

Penalize complex models based on their information content. MDL (Minimum Description Length)

Example: Binary Decision trees

k leaves => 2k – 1 nodes 2k – 1 bits to encode tree structure + k bits to encode label of each leaf (0/1)

# bits needed to describe f (description length)

slide-28
SLIDE 28

So far…

  • What does a decision tree represent
  • Given a decision tree, how do we assign label to a

test point

Now …

  • How do we learn a decision tree from training

data

  • What is the decision on each leaf

28

slide-29
SLIDE 29

How to assign label to each leaf

Classification – Majority vote Regression – ?

29

slide-30
SLIDE 30

How to assign label to each leaf

Classification – Majority vote Regression – Constant/ Linear/Poly fit

30

slide-31
SLIDE 31

Regression trees

31

Average (fit a constant ) using training data at the leaves

Num Children? ≥ 2 < 2

slide-32
SLIDE 32

Connection between nearest neighbor/histogram classifiers and decision trees

32

slide-33
SLIDE 33

Local prediction

33

Histogram, kernel density estimation, k-nearest neighbor classifier, kernel regression Histogram Classifier

D

slide-34
SLIDE 34

Local Adaptive prediction

34

Let neighborhood size adapt to data – small neighborhoods near decision boundary (small bias), large neighborhoods elsewhere (small variance) Decision Tree Classifier

Majority vote at each leaf

Dx

slide-35
SLIDE 35

Histogram Classifier vs Decision Trees

35

Ideal classifier Decision tree histogram 256 cells in each partition

slide-36
SLIDE 36

Application to Image Coding

36

1024 cells in each partition

slide-37
SLIDE 37

37

JPEG 0.125 bpp JPEG 2000 0.125 bpp non-adaptive partitioning adaptive partitioning

Application to Image Coding

slide-38
SLIDE 38

What you should know

38

  • Decision trees are one of the most popular data mining

tools

  • Simplicity of design
  • Interpretability
  • Ease of implementation
  • Good performance in practice (for small dimensions)
  • Information gain to select attributes (ID3, C4.5,…)
  • Can be used for classification, regression and density

estimation too

  • Decision trees will overfit!!!

– Must use tricks to find “simple trees”, e.g.,

  • Pre-Pruning: Fixed depth/Fixed number of leaves
  • Post-Pruning: Chi-square test of independence
  • Complexity Penalized/MDL model selection