[PPT] - Decision Trees Prof. Mike Hughes Many slides attributable to: Erik PowerPoint Presentation

SLIDE 1

Decision Trees

2

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/

Many slides attributable to: Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books)

Prof. Mike Hughes

SLIDE 2

Objectives for day 14 Decision Trees

Decision Tree Regression
How to predict
How to train
Greedy recursive algorithm
Possible cost functions
Decision Tree Classification
Possible cost functions
Comparisons to other methods

3

Mike Hughes - Tufts COMP 135 - Spring 2019

SLIDE 3

What will we learn?

4

Mike Hughes - Tufts COMP 135 - Spring 2019

Supervised Learning Unsupervised Learning Reinforcement Learning

Data, Label Pairs Performance measure Task data x label y

{xn, yn}N

n=1

Training Prediction Evaluation

SLIDE 4

5

Mike Hughes - Tufts COMP 135 - Spring 2019

Task: Regression

Supervised Learning Unsupervised Learning Reinforcement Learning

regression

x y y

is a numeric variable e.g. sales in $$

SLIDE 5

Salary prediction for Hitters data

6

Mike Hughes - Tufts COMP 135 - Spring 2019

SLIDE 6

7

Mike Hughes - Tufts COMP 135 - Spring 2019

Salary Prediction by “Region”

Divide x space into regions, predict constant within each region

Regions are rectangular (“axis aligned”)

SLIDE 7

From Regions to Decision Tree

8

Mike Hughes - Tufts COMP 135 - Spring 2019

SLIDE 8

9

Mike Hughes - Tufts COMP 135 - Spring 2019

Decision tree regression

Parameters:

tree architecture (list of nodes, list of parent-child pairs)
at each internal node: x variable id and threshold value
at each leaf: scalar y value to predict

Hyperparameters

max_depth, min_samples_split

Prediction procedure:

Determine which leaf (region) the input features belong to
Guess the y value associated with that leaf

Training procedure:

minimize error on training set
use greedy heuristics (build from root to leaf)

Classification and Regression Trees by Breiman et al (1984)

SLIDE 9

Ideal Training for Decision Tree

10

Mike Hughes - Tufts COMP 135 - Spring 2019

min

R1,...RJ J

X

j=1

X

n:xn∈Rj

(yn − ˆ yRj)2

Search space is too big (so many regions)! Hard to solve exactly… … let’s break it down into subproblems

SLIDE 10

Key subproblem: Within 1 region, how to find best binary split?

11

Mike Hughes - Tufts COMP 135 - Spring 2019

Given a big region R, find the best possible binary split into two subregions (best means minimize mean squared error)

min

j,s,ˆ yR1,ˆ yR2

Let binary_split denote this procedure

We can solve this subproblem efficiently! For each feature index j in 1, 2, … F:

find its best possible cut point s[j] and its cost[j]

j <- argmin( cost[1] … cost[F] ) return best index j and its cut point s[j]

SLIDE 11

Greedy top-down training

def train_tree_greedy(x_NF, y_N, depth d):

12

Mike Hughes - Tufts COMP 135 - Spring 2019

if d >= MAX_DEPTH: return LeafNode(x_NF, y_N) elif N < MIN_INTERNAL_NODE_SIZE: return LeafNode(x_NF, y_N) else: # j : integer index indicating feature to split # s : real value used as threshold to split # L / R : number of examples in left / right region j, s, x_LF, x_RF, y_L, y_R = binary_split(x_NF, y_N) if no split possible: return LeafNode(x_NF, y_N) left_child = train_tree_greedy (x_LF, y_L, d+1) right_child = train_tree_greedy(x_RF, y_R, d+1) return InternalNode(x_NF, y_N, j, s, left_child, right_child)

Hyperparameters controling complexity

MAX_DEPTH
MIN_INTERNAL_NODE_SIZE
MIN_LEAF_NODE_SIZE

Training is a recursive process. Returns a tree (by reference to its root node)

SLIDE 12

Greedy Tree for Hitters Data

13

Mike Hughes - Tufts COMP 135 - Spring 2019

SLIDE 13

Cost functions for regression trees

Mean squared error
Assumed on previous slides, very common
How to solve for region’s best guess?
Mean absolute error
Possible! (supported in sklearn)
How to solve for region’s best guess?

14

Mike Hughes - Tufts COMP 135 - Spring 2019

min min

s,ˆ yR1,

Optimal solution: Guess mean output value of the region:

ˆ yR = 1 |R| X

i:xi∈R

yi

<latexit sha1_base64="yH1TQDUBZkWkXyE1I4qvKwaSocs=">ACHXicbVDLSgMxFM34rPVdekmWARXZaYWFEounFZi31ApwyZNOGZjJDkhGHdH7Ejb/ixoUiLtyIf2P6WGjrgcDhnHO5ucePGZXKtr+tpeWV1bX13EZ+c2t7Z7ewt9+USIwaeCIRaLtI0kY5aShqGKkHQuCQp+Rlj+8HvuteyIkjfidSmPSDVGf04BipIzkFSruACmdZp6uZ/ASuoFAWDuZHtVHGXRlEnqawgv4FHoUg5NKPWoVyjaJXsCuEicGSmCGWpe4dPtRTgJCVeYISk7jh2rkZCUcxIlncTSWKEh6hPOoZyFBLZ1ZPrMnhslB4MImEeV3Ci/p7QKJQyDX2TDJEayHlvLP7ndRIVnHc15XGiCMfTRUHCoIrguCrYo4JgxVJDEBbU/BXiATIFKVNo3pTgzJ+8SJrlknNaKt9WitWrWR05cAiOwAlwBmoghtQAw2AwSN4Bq/gzXqyXqx362MaXbJmMwfgD6yvH9r0obs=</latexit>

Optimal solution: Guess median output value of the region:

ˆ yR = median({yi : xi ∈ R})

<latexit sha1_base64="dJ1fREy7huDN7QkHSqYPu5HWjo=">ACG3icbVBNS8NAEN34bf2qevSyWAS9lKQKiCIXjyq2Co0JWy203bpZhN2J2I+R9e/CtePCjiSfDgv3Fbe9Dqgxke782wOy9MpDoup/OxOTU9Mzs3HxpYXFpeaW8utYwcao51HksY30TMgNSKijQAk3iQYWhRKuw/7pwL+BW1ErK4wS6AVsa4SHcEZWiko1/wewzwrgvyoEfUR7jDPIK2YKrYpn6eBYIe0jvbfaHopV/QnaBcavuEPQv8UakQkY4D8rvfjvmaQKuWTGND03wVbONAouoSj5qYGE8T7rQtNSxSIwrXx4W0G3rNKmnVjbUkiH6s+NnEXGZFoJyOGPTPuDcT/vGaKnYNWLlSIij+/VAnlRjOgiKtoUGjKzhHEt7F8p7zHNONo4SzYEb/zkv6Rq3q71drFXuX4ZBTHNkgm2SbeGSfHJMzck7qhJN78kieyYvz4Dw5r87b9+iEM9pZJ7/gfHwBcaegWw=</latexit>

SLIDE 14

15

Mike Hughes - Tufts COMP 135 - Spring 2019

y

x2 x1

is a binary variable (red or blue)

Supervised Learning

binary classification

Unsupervised Learning Reinforcement Learning

Task: Binary Classification

SLIDE 15

Decision Tree Classifier

16

Mike Hughes - Tufts COMP 135 - Spring 2019

Leaves make binary predictions! Goal: Does patient have heart disease?

SLIDE 16

Decision Tree Probabilistic Classifier

17

Mike Hughes - Tufts COMP 135 - Spring 2019

Leaves count samples in each class! Then return fractions! Goal: What probability does patient have heart disease?

+ + - - +++ + - - + + ++ - + - - -

0.5 0.667 0.25 0.80

SLIDE 17

18

Decision Tree Classifier

Parameters:

tree architecture (list of nodes, list of parent-child pairs)
at each internal node: x variable id and threshold
at each leaf: number of examples in each class

Hyperparameters:

max_depth, min_samples_split

Prediction:

identify rectangular region for input x
predict: most common label value in region
predict_proba: fraction of each label in region

Training:

minimize cost on training set
greedy construction from root to leaf

Classification and Regression Trees by Breiman et al (1984)

SLIDE 18

Cost functions for classification trees

Information gain or “entropy”
Cost for a region with N examples of C classes
Gini impurity

19

Mike Hughes - Tufts COMP 135 - Spring 2019

cost(x1:N, y1:N) = −

C

X

c=1

pc log pc, pc , 1 N X

n

δc(yn)

<latexit sha1_base64="rZTIAaWf/chI2mkSUD6vFXOvwR0=">ACZnicbZHPa9RAFMcn8Ve7al1bxIOXh4uwhbokVCEQrEXT1LBbQubNUwmL9uhk5nszIs0hPyT3jx78c9wdrMHbX0w8Jnv+zEz38kqJR1F0c8gvHP3v0HW9uDh48e7zwZPt09c6a2AqfCKGMvMu5QSY1TkqTworLIy0zheXZ1sqf0frpNFfqalwXvKFloUnLyUDruE8JpaYRx1Y7hO2/jD5+4Amh5gH47gNSuLtNWHMXdtxOoPHWQKLPo8QCS5bLmud8JSMhKrhcKl5AUlos27lo/ph+gPeSoiK/axk2q9PhKJpE64DbEG9gxDZxmg5/JLkRdYmahOLOzeKonLUmhsBsktcOKiyu+wJlHzUt083ZtUwevJDYaxfmCt/t3R8tK5psx8Zcnp0t3MrcT/5WY1Fe/nrdRVTahFf1BRKyADK8hlxYFqcYDF1b6u4K45N4e8j8z8CbEN598G84OJ/GbyeGXt6Pjxs7tgL9pKNWczesWP2iZ2yKRPsV7Ad7AZ7we9wJ3wWPu9Lw2DTs8f+iRD+ACEKtzs=</latexit>

cost(x1:N, y1:N) =

C

X

c=1

pc(1 − pc)

<latexit sha1_base64="JAZ0vg9+DgsaS9yITSICD1psF2c=">ACJ3icbZDLSgMxFIYzXmu9V26CRahBS0zVCESrEbV1LBXqCtQyZN29DMheSMtAzNm58FTeCiujSNzG9LT1h8DHf87h5PxOILgC0/wyFhaXldWE2vJ9Y3Nre3Uzm5V+aGkrEJ94cu6QxQT3GMV4CBYPZCMuI5gNadfGtVrD0wq7nt3MAxYyVdj3c4JaAtO3XZBDaAiPoK4gwe2JF1cRMf4eEcBYXcFOFrh3RghXfl3CgKcYZCx9rpFk7lTZz5lh4HqwpNFUZTv12mz7NHSZB1QpRqWGUArIhI4FSxONkPFAkL7pMsaGj3iMtWKxnfG+FA7bdzxpX4e4LH7eyIirlJD19GdLoGemq2NzP9qjRA6562Ie0EIzKOTRZ1QYPDxKDTc5pJREMNhEqu/4pj0hCQUeb1CFYsyfPQzWfs05y+dvTdPFqGkcC7aMDlEWOkNFdI3KqIoekTP6A29G0/Gi/FhfE5aF4zpzB76I+P7B9BFo10=</latexit>

SLIDE 19

Advantages of Decision Trees

20

Mike Hughes - Tufts COMP 135 - Spring 2019 + Can handle heterogeneous datasets (some features are numerical, some are categorical) easily without requiring standardized scales like penalized linear models do + Flexible non-linear decision boundaries + Relatively few hyperparameters to select

SLIDE 20

Limitations of Decision Trees

21

Mike Hughes - Tufts COMP 135 - Spring 2019 Axis-aligned assumption not always a good idea

SLIDE 21

Summary of Classifiers

22

Mike Hughes - Tufts COMP 135 - Spring 2019

Knobs to tune Function class flexibility Interpret? Logistic Regression L2/L1 penalty on weights Linear Inspect weights MLPClassifier

L2/L1 penalty on weights Num layers, num units Activation functions GD method: SGD or LBFGS or … Step size, batch size

Universal (with enough units) Challenging K Nearest Neighbors Classifier Number of Neighbors Distance metric Piecewise constant Inspect neighbors Decision Tree Classifier

Max. depth
Min. leaf size

Axis-aligned Piecewise constant Inspect tree