Decision Tree Learning
Based on “Machine Learning”, T. Mitchell, McGRAW Hill, 1997, ch. 3 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell
0.
Decision Tree Learning Based on Machine Learning, T. Mitchell, - - PowerPoint PPT Presentation
0. Decision Tree Learning Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 3 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell 1. PLAN DT Learning: Basic Issues 1. Concept learning:
Based on “Machine Learning”, T. Mitchell, McGRAW Hill, 1997, ch. 3 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell
0.
Hypothesis space search by ID3 Statistical measures in decision tree learning: Entropy, Information gain
1.
continuous-valued attributes attributes with many values attributes with different costs training examples with missing attributes values
reduced-error prunning, and rule post-pruning
Ensemble Learning using DTs: boosting, bagging, Random Forests 2.
3.
Given the data: Day Outlook Temperature Humidity Wind EnjoyTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No predict the value of EnjoyTennis for Outlook = sunny, Temp = cool, Humidity = high, Wind = strong 4.
Example: Decision Tree for EnjoyTennis
Outlook Overcast Humidity Normal High No Yes Wind Strong Weak No Yes Yes Rain Sunny 5.
Learned from medical records of 1000 women Negative examples are C-sections [833+,167-] .83+ .17- Fetal_Presentation = 1: [822+,116-] .88+ .12- | Previous_Csection = 0: [767+,81-] .90+ .10- | | Primiparous = 0: [399+,13-] .97+ .03- | | Primiparous = 1: [368+,68-] .84+ .16- | | | Fetal_Distress = 0: [334+,47-] .88+ .12- | | | | Birth_Weight < 3349: [201+,10.6-] .95+ .05- | | | | Birth_Weight >= 3349: [133+,36.4-] .78+ .22- | | | Fetal_Distress = 1: [34+,21-] .62+ .38- | Previous_Csection = 1: [55+,35-] .61+ .39- Fetal_Presentation = 2: [3+,29-] .11+ .89- Fetal_Presentation = 3: [8+,22-] .27+ .73-
6.
START create the root node; assign all examples to root; Main loop:
else iterate over new leaf nodes
7.
ID3 Algorithm: basic version
ID3(Examples, Target attribute, Attributes)
with label = the most common value of Target attribute in Examples;
A ← the attribute from Attributes that best∗ classifies Examples; the decision attribute for Root ← A; for each possible value vi of A add a new tree branch below Root, corresponding to the test A = vi; let Examplesvi be the subset of Examples that have the value vi for A; if Examplesvi is empty below this new branch add a leaf node with label = the most common value
else below this new branch add the subtree ID3(Examplesvi, Target attribute, Attributes\{A});
∗ The best attribute is the one with the highest information gain.
8.
The target function surely is in there...
Which one?
approximate “prefer the shortest tree”
Robust to noisy data...
Local minima...
...
+ + +
A1
+ – + –
A2 A3
+
...
+ – + –
A2 A4
– + – + –
A2
+ – +
... ...
–
v∈Values(A)
10.
p⊕ is the proportion of positive examples in S p⊖ is the proportion of negative examples in S
Entropy(S) = expected number of bits needed to encode ⊕ or ⊖ for a randomly drawn member of S (under the optimal, shortest- length code) The optimal length code for a message having the probability p is − log2 p bits. So: Entropy(S) = p⊕(− log2 p⊕) + p⊖(− log2 p⊖) = −p⊕ log2 p⊕ − p⊖ log2 p⊖
11.
Entropy(S) 1.0 0.5 0.0 0.5 1.0 p+
12.
Which attribute is the best classifier?
High Normal Humidity [3+,4-] [6+,1-] Wind Weak Strong [6+,2-] [3+,3-] = .940 - (7/14).985 - (7/14).592 = .151 = .940 - (8/14).811 - (6/14)1.0 = .048 Gain (S, Humidity ) Gain (S, ) Wind =0.940 E =0.940 E =0.811 E =0.592 E =0.985 E =1.00 E [9+,5-] S: [9+,5-] S:
Similarly, Gain(S, Outlook) = 0.246 Gain(S, Temperature) = 0.029 13.
Outlook Sunny Overcast Rain [9+,5−] {D1,D2,D8,D9,D11} {D3,D7,D12,D13} {D4,D5,D6,D10,D14} [2+,3−] [4+,0−] [3+,2−] Yes {D1, D2, ..., D14}
? ? Which attribute should be tested here?
Ssunny = {D1,D2,D8,D9,D11} Gain (Ssunny , Humidity) sunny Gain (S , Temperature) = .970 − (2/5) 0.0 − (2/5) 1.0 − (1/5) 0.0 = .570 Gain (S sunny , Wind) = .970 − (2/5) 1.0 − (3/5) .918 = .019 = .970 − (3/5) 0.0 − (2/5) 0.0 = .970
14.
Outlook Overcast Humidity Normal High No Yes Wind Strong Weak No Yes Yes Rain Sunny
IF (Outlook = Sunny) ∧ (Humidity = High) THEN EnjoyTennis = No IF (Outlook = Sunny) ∧ (Humidity = Normal) THEN EnjoyTennis = Y es . . . 15.
attributes near the root
there are learning algorithms (e.g. Candidate-Elimination, ch. 2) whose bias is a restriction of hypothesis space H (i.e, a language bias).
16.
→ a short hypothesis that fits data unlikely to be coincidence → a long hypothesis that fits data might be coincidence
(e.g., all trees with a prime number of nodes that use attributes be- ginning with “Z”)
17.
from “Data mining. Practical machine learning tools and techniques” Witten et al, 3rd ed., 2011, pp. 199-200
(A1): the depth of the ID3 tree is O(log m),
(i.e. it remains “bushy” and doesn’t degenerate into long, stringy branches);
(A2): [most] instances differ from each other; (A2’): the d attributes provide enough tests to allow the instnces to be differentiated.
18.
not.
=
i=1 P 2(ci)
Misclassification Impurity: i(n) = 1 − maxk
i=1 P(ci)
not.
= i(n) − P(nl)i(nl) − P(nr)i(nr), where nl and nr are left and right child of node n after splitting. For a Bernoulli variable of parameter p: Entropy (p) = −p log2 p − (1 − p) log2(1 − p) Gini (p) = 1 − p2 − (1 − p)2 = 2p(1 − p) MisClassif (p) = 1 − (1 − p), if p ∈ [0; 1/2) 1 − p, if p ∈ [1/2; 1] =
if p ∈ [0; 1/2) 1 − p, if p ∈ [1/2; 1] 19.
For instance: Temperature = 82.5 (Temperature > 72.3) = t, f
Sort the examples according to the values of the continuous attribute, then identify examples that differ in their target classification. For EnjoyTennis: Temperature: 40 48 60 72 80 90 EnjoyTennis: No No Yes Yes Yes No Temperature>54 Temperature>85
20.
c
21.
Cost(A)
(Cost(A)+1)w
22.
Question: What if an example is missing the value of an attribute A? Answer: Use the training example anyway, sort through the tree, and if node n tests A,
sorted to node n, or
same target value, or
assign the fraction pi of the example to each descendant in the tree. Classify the test instances in the same fashion.
23.
Consider adding noisy training example #15: (Sunny, Hot, Normal, Strong, EnjoyTennis = No) What effect does it produce
Outlook Overcast Humidity Normal High No Yes Wind Strong Weak No Yes Yes Rain Sunny
24.
25.
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 10 20 30 40 50 60 70 80 90 100 Accuracy Size of tree (number of nodes) On training data On test data
26.
27.
28.
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 10 20 30 40 50 60 70 80 90 100 Accuracy Size of tree (number of nodes) On training data On test data On test data (during pruning)
Note: A validation set (distinct from both the training and test sets) was used for
29.
30.
There exist several well-known meta-learning techniques that aggregate decision trees:
When constructing a new tree, the data points that have been in- correctly predicted by earlier trees are given some extra wight, thus forcing the learner to concentrate successively on more and more dif- ficult cases. In the end, a weighted vote is taken for prediction.
New trees do not depend on earlier trees; each tree is independently constructed using a boostrap sample (i.e. sampling with replacing) of the data set. The final classification is done via simple majority voting.
31.
[pseudo-code from Statistical Pattern Recognition, Andrew Webb, Keith Copsey, 2011]
Input:
{(xi, yi) | i = 1, . . . , n} — a set of labeled instances; T ∈ N∗ — the number of boosting rounds;
Training:
initialization: wi = 1/n, for i = 1, . . . , n for t = 1, . . . , T
with weights wi, i = 1, . . . , n;
i wi, where i indexes all instances misclassified by ηt;
else wi ← wi 1 et − 1
renormalize the weights wi, i = 1, . . . , n so that they sum to 1;
Prediction:
given a test instance x, and assuming that the classifiers ηt have two clases, −1 and +1, compute ˆ η = T
t=1
1 et − 1
assign x the label sign(ˆ η); 32.
[pseudo-code from Statistical Pattern Recognition, Andrew Webb, Keith Copsey, 2011]
{(xi, yi) | i = 1, . . . , n} — a set of labeled instances; B ∈ N∗ — the number of samples/(sub)classifiers to be produced;
for b = 1, . . . , B
ment from the training set; (Note: some instances will be replicated, others will be omitted.)
sample as training data;
given a test instance x, assign it the most common label in the set {ηb(x) | b = 1, . . . , B};
33.
RF extends bagging with and additional layer of randomness: random feature selection: While in standard classification trees each node is split using the best split among all variables, in RF each node is split using the best among a subset of features randomly chosen at that node. RF uses only two parameters: − the number of variables in the random subset at each node; − the number of trees in the forest. This somehow counter-intuitive strategy is robust against overfitting, and it compares well to other machine learning techniques (SVMs, neural networks, discriminat analysis etc).
34.
[pseudo-code from Statistical Pattern Recognition, Andrew Webb, Keith Copsey, 2011]
Input:
{(xi, yi) | i = 1, . . . , n} — a set of labeled instances; B ∈ N∗ — the number of samples to be produced / trees in the forest; m — the number of features to be selected
Training:
for b = 1, . . . , B
training set; (Note: some instances will be replicated, others will be omitted.)
and choosing at each node the “best” among m randomly selected attributes;
Computation of the out-the-bag error:
a training instance xi, is misclassified by RF if its label yi differs from zi, the most common label in the set {ηb′(xi) | b′ ∈ {1, . . . , B}, such that xi ∈ the boostrap sample for the classifier ηb′};
Prediction:
given a test instance x, assign it the most common label in the set {ηb(x) | b = 1, . . . , B}; 35.
0.
Exemplifying The application of the ID3 algorithm on continuous attributes; Decision surfaces; decision boundaries; The computation of the CVLOO error CMU, 2002 fall, Andrew Moore, midterm, pr. 3
1.
1 2 3 4 5 7 8 9 10 X 6
X 1 2 3 4 6 7 8 9 10 5 8.25 8.75
2 1
X 1 2 3 4 5 6 7 8 9 10
X 8.25 8.75
+
+
5
ID3 tree:
X<5 X<8.75
1 1
X<8.25
1 2
[4−,0+] [1−,5+] [5−,5+] [1−,0+] [0−,2+] [1−,2+] [0−,3+] Da Da Nu Nu Nu Da 2.
ID3: IG computations
X<8.75 X<8.25 [1−,2+] [1−,3+] Nu [0−,2+] Da [1−,5+] [0−,3+] Nu Da [1−,5+] IG: 0.109 IG: 0.191 Level 1:
5 8.75 8.25
+ +
− −
Decision "surfaces": X<8.75 [5−,3+] Nu [0−,2+] Da [5−,5+] X<8.25 [1−,2+] [4−,3+] Nu Da [5−,5+] X<5 [4−,0+] [1−,5+] Da Nu [5−,5+] < < < Level 0: 3.
ID3, CVLOO: Decision surfaces CVLOO error: 3/10
8.75 8.25
+ +
4.5 X=4: 8.75 8.25
+ +
4.5 X=6: 5 8.75
+ +
7.75
X=8: 5
+ +
X=8.5: 5 8.25
+ +
9.25 X=9: 5 8.75 8.25
+ +
X=1,2,3: 5 8.75 8.25
+ +
X=7: 5 8.75 8.25
+ +
X=10:
4.
DT2
+
5
Decision "surfaces": X<5
1
[1−,5+] Da Nu [5−,5+] [4−,0+]
5.
DT2, CVLOO IG computations
Case 1: X=1, 2, 3, 4 X<5 X<8.75 X<8.25 [3−,0+] [1−,5+] Da Nu [4−,5+] < < < [1−,2+] [4−,3+] Nu [0−,2+] Da [4−,5+] [3−,3+] Nu Da [4−,5+] /4.5 Case 2: X=6, 7, 8 X<5 X<8.75 X<8.25 [4−,0+] [1−,4+] Da Nu < < [5−,2+] Nu [0−,2+] Da [5−,4+] [4−,2+] Nu Da [5−,4+] /5.5 /7.75 [5−,4+] [1−,2+] 6.
DT2, CVLOO IG computations (cont’d)
Case 3: X=8.5 X<5 [4−,0+] [0−,5+] Da Nu [4−,5+] Case 2: X=9, 10 X<5 X<8.75 X<8.25 [1−,4+] Da Nu < [5−,3+] Nu [0−,1+] Da [5−,4+] [4−,3+] Nu Da [5−,4+] [5−,4+] [1−,1+] /9.25 [4−,0+] < CVLOO error: 1/10 7.
CMU, 2010 fall, Ziv Bar-Joseph, HW2, pr. 2.1 Input:
X1 X2 X3 X4 Class 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
4
X
1
X
2
X 1 1 1 [4−,8+] 1 [1−,2+] 1 [0−,2+] [1−,0+] [3−,0+] [4−,2+] [0−,6+]
8.
While traversing the ID3 tree [usually in bottom-up manner], remove the nodes for which there is not enough (“significant”) statistical evidence that there is a dependence between the values of the input attribute tested in that node and the values
supported by the set of instances assigned to that node.
9.
OX4 X4 = 0 X4 = 1 Class = 0 4 Class = 1 2 6
N=12
⇒ 8 > < > : P(Class = 0) = 4 12 = 1 3, P(Class = 1) = 2 3 P(X4 = 0) = 6 12 = 1 2, P(X4 = 1) = 1 2 OX1|X4=0 X1 = 0 X1 = 1 Class = 0 3 1 Class = 1 2
N=6
⇒ 8 > > > > > > > > > > < > > > > > > > > > > : P(Class = 0 | X4 = 0) = 4 6 = 2 3 P(Class = 1 | X4 = 0) = 1 3 P(X1 = 0 | X4 = 0) = 3 6 = 1 2 P(X1 = 1 | X4 = 0) = 1 2 OX2|X4=0,X1=1 X2 = 0 X2 = 1 Class = 0 1 Class = 1 2
N=3
⇒ 8 > > > > > > > > > > < > > > > > > > > > > : P(Class = 0 | X4 = 0, X1 = 1) = 1 3 P(Class = 1 | X4 = 0, X1 = 1) = 2 3 P(X2 = 0 | X4 = 0, X1 = 1) = 2 3 P(X2 = 1 | X4 = 0, X1 = 1) = 1 3
10.
k=1 Oi,k
k=1 Ok,j
indep.
k=1 Oi,k) (r k=1 Ok,j)
r
c
11.
12.
EX4 X4 = 0 X4 = 1 Class = 0 2 2 Class = 1 4 4 EX1|X4 X1 = 0 X1 = 1 Class = 0 2 2 Class = 1 1 1 EX2|X4,X1=1 X2 = 0 X2 = 1 Class = 0 2 3 1 3 Class = 1 4 3 2 3 EX4(Class = 0, X4 = 0) : N = 12, P(Class = 0) = 1 3 ¸ si P(X4 = 0) = 1 2 ⇒ N · P(Class = 0, X4 = 0) = N · P(Class = 0) · P(X4 = 0) = 12 · 1 3 · 1 2 = 2 13.
χ2
X4 = (4 − 2)2
2 + (0 − 2)2 2 + (2 − 4)2 4 + (6 − 4)2 4 = 2 + 2 + 1 + 1 = 6 χ2
X1|X4=0 = (3 − 2)2
2 + (1 − 2)2 2 + (0 − 1)2 1 + (2 − 1)2 1 = 3 χ2
X2|X4=0,X1=1 =
3 2 2 3 +
3 2 1 3 +
3 2 4 3 +
3 2 2 3 = 4 9 · 27 4 = 3 p-values: 0.0143, 0.0833, and respectively 0.0833. The first one of these p-values is smaller than ε, therefore the root node (X4) cannot be prunned.
14.
4
X 1 1
15.