Decision Trees Aarti Singh Machine Learning 10-701/15-781 Oct 6 , - - PowerPoint PPT Presentation
Decision Trees Aarti Singh Machine Learning 10-701/15-781 Oct 6 , - - PowerPoint PPT Presentation
Decision Trees Aarti Singh Machine Learning 10-701/15-781 Oct 6 , 2010 Learning a good prediction rule Learn a mapping Best prediction rule Hypothesis space/Function class Parametric classes (Gaussian, binomial etc.)
Learning a good prediction rule
- Learn a mapping
- Best prediction rule
- Hypothesis space/Function class
– Parametric classes (Gaussian, binomial etc.) – Conditionally independent class densities (Naïve Bayes) – Linear decision boundary (Logistic regression) – Nonparametric class (Histograms, nearest neighbor, kernel estimators, Decision Trees – Today)
- Given training data, find a hypothesis/function in that is
close to the best prediction rule.
2
First …
- What does a decision tree represent
- Given a decision tree, how do we assign label to a
test point
3
Decision Tree for Tax Fraud Detection
4
Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
- Each internal node: test
- ne feature Xi
- Each branch from a node:
selects one value for Xi
- Each leaf node: predict Y
Refund Marital Status Taxable Income Cheat No Married 80K ?
10Query Data
Decision Tree for Tax Fraud Detection
5
Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
Refund Marital Status Taxable Income Cheat No Married 80K ?
10Query Data
Decision Tree for Tax Fraud Detection
6
Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
Refund Marital Status Taxable Income Cheat No Married 80K ?
10Query Data
Decision Tree for Tax Fraud Detection
7
Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
Refund Marital Status Taxable Income Cheat No Married 80K ?
10Query Data
No
Refund Marital Status Taxable Income Cheat No Married 80K ?
10Decision Tree for Tax Fraud Detection
8
Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
Refund Marital Status Taxable Income Cheat No Married 80K ?
10Query Data
No
Refund Marital Status Taxable Income Cheat No Married 80K ?
10Decision Tree for Tax Fraud Detection
9
Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
Refund Marital Status Taxable Income Cheat No Married 80K ?
10Query Data
No
Refund Marital Status Taxable Income Cheat No Married 80K ?
10Married
Refund Marital Status Taxable Income Cheat No Married 80K ?
10Decision Tree for Tax Fraud Detection
10
Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
Refund Marital Status Taxable Income Cheat No Married 80K ?
10Query Data
No
Refund Marital Status Taxable Income Cheat No Married 80K ?
10Married
Refund Marital Status Taxable Income Cheat No Married 80K ?
10Assign Cheat to “No”
Decision Tree more generally…
11
1 1 1 1
- Features can be discrete,
continuous or categorical
- Each internal node: test
some set of features {Xi}
- Each branch from a node:
selects a set of value for {Xi}
- Each leaf node: predict Y
1 1 1 1 1 1
So far…
- What does a decision tree represent
- Given a decision tree, how do we assign label to a
test point
Now …
- How do we learn a decision tree from training
data
- What is the decision on each leaf
12
So far…
- What does a decision tree represent
- Given a decision tree, how do we assign label to a
test point
Now …
- How do we learn a decision tree from training
data
- What is the decision on each leaf
13
How to learn a decision tree
- Top-down induction *ID3, C4.5, CART, …+
14
Refund MarSt TaxInc YES NO NO NO Yes No Married Single, Divorced < 80K > 80K
Which feature is best to split?
15
X1 X2 Y T T T T F T T T T T F T F T T F F F F T F F F F
T F Y: 4 Ts 0 Fs Y: 1 Ts 3 Fs T F Y: 3 Ts 1 Fs Y: 2 Ts 2 Fs
Good split if we are more certain about classification after split –
Uniform distribution of labels is bad
Absolutely sure Kind of sure Kind of sure Absolutely unsure
Which feature is best to split?
16
Pick the attribute/feature which yields maximum information gain:
H(Y) – entropy of Y H(Y|Xi) – conditional entropy of Y
Entropy
- Entropy of a random variable Y
More uncertainty, more entropy! Y ~ Bernoulli(p)
Information Theory interpretation: H(Y) is the expected number of bits needed to encode a randomly drawn value of Y (under most efficient code)
17
p Entropy, H(Y) Uniform Max entropy Deterministic Zero entropy
Andrew Moore’s Entropy in a Nutshell
18
Low Entropy High Entropy
..the values (locations of soup) unpredictable... almost uniformly sampled throughout our dining room ..the values (locations
- f soup) sampled
entirely from within the soup bowl
Information Gain
- Advantage of attribute = decrease in uncertainty
– Entropy of Y before split – Entropy of Y after splitting based on Xi
- Weight by probability of following each branch
- Information gain is difference
Max Information gain = min conditional entropy
19
Information Gain
20
X1 X2 Y T T T T F T T T T T F T F T T F F F F T F F F F
T F T F Y: 4 Ts 0 Fs Y: 1 Ts 3 Fs Y: 3 Ts 1 Fs Y: 2 Ts 2 Fs
> 0
Which feature is best to split?
21
Pick the attribute/feature which yields maximum information gain:
H(Y) – entropy of Y H(Y|Xi) – conditional entropy of Y
Feature which yields maximum reduction in entropy provides maximum information about Y
Expressiveness of Decision Trees
22
- Decision trees can express any function of the input features.
- E.g., for Boolean functions, truth table row → path to leaf:
- There is a decision tree which perfectly classifies a training set
with one path to leaf for each example
- But it won't generalize well to new examples - prefer to find
more compact decision trees
Decision Trees - Overfitting
One training example per leaf – overfits, need compact/pruned decision tree
23
Bias-Variance Tradeoff
24
fine partition coarse partition variance small variance large bias small bias large
Ideal classifier average classifier Classifiers based on different training data
When to Stop?
- Many strategies for picking simpler trees:
– Pre-pruning
- Fixed depth
- Fixed number of leaves
– Post-pruning
- Chi-square test
– Convert decision tree to a set of rules – Eliminate variable values in rules which are independent of label (using chi-square test for independence) – Simplify rule set by eliminating unnecessary rules
– Information Criteria: MDL(Minimum Description Length)
25
Refund MarSt NO Yes No Married Single, Divorced
- Penalize complex models by introducing cost
26
log likelihood cost regression classification penalize trees with more leaves
Information Criteria
5 leaves => 9 bits to encode structure
Information Criteria - MDL
Penalize complex models based on their information content. MDL (Minimum Description Length)
Example: Binary Decision trees
k leaves => 2k – 1 nodes 2k – 1 bits to encode tree structure + k bits to encode label of each leaf (0/1)
# bits needed to describe f (description length)
So far…
- What does a decision tree represent
- Given a decision tree, how do we assign label to a
test point
Now …
- How do we learn a decision tree from training
data
- What is the decision on each leaf
28
How to assign label to each leaf
Classification – Majority vote Regression – ?
29
How to assign label to each leaf
Classification – Majority vote Regression – Constant/ Linear/Poly fit
30
Regression trees
31
Average (fit a constant ) using training data at the leaves
Num Children? ≥ 2 < 2
Connection between nearest neighbor/histogram classifiers and decision trees
32
Local prediction
33
Histogram, kernel density estimation, k-nearest neighbor classifier, kernel regression Histogram Classifier
D
Local Adaptive prediction
34
Let neighborhood size adapt to data – small neighborhoods near decision boundary (small bias), large neighborhoods elsewhere (small variance) Decision Tree Classifier
Majority vote at each leaf
Dx
Histogram Classifier vs Decision Trees
35
Ideal classifier Decision tree histogram 256 cells in each partition
Application to Image Coding
36
1024 cells in each partition
37
JPEG 0.125 bpp JPEG 2000 0.125 bpp non-adaptive partitioning adaptive partitioning
Application to Image Coding
What you should know
38
- Decision trees are one of the most popular data mining
tools
- Simplicity of design
- Interpretability
- Ease of implementation
- Good performance in practice (for small dimensions)
- Information gain to select attributes (ID3, C4.5,…)
- Can be used for classification, regression and density
estimation too
- Decision trees will overfit!!!
– Must use tricks to find “simple trees”, e.g.,
- Pre-Pruning: Fixed depth/Fixed number of leaves
- Post-Pruning: Chi-square test of independence
- Complexity Penalized/MDL model selection