Decision Trees Prof. Mike Hughes Many slides attributable to: Erik - - PowerPoint PPT Presentation

decision trees
SMART_READER_LITE
LIVE PREVIEW

Decision Trees Prof. Mike Hughes Many slides attributable to: Erik - - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Decision Trees Prof. Mike Hughes Many slides attributable to: Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani


slide-1
SLIDE 1

Decision Trees

2

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/

Many slides attributable to: Erik Sudderth (UCI) Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books)

  • Prof. Mike Hughes
slide-2
SLIDE 2

Objectives for day 14 Decision Trees

  • Decision Tree Regression
  • How to predict
  • How to train
  • Greedy recursive algorithm
  • Possible cost functions
  • Decision Tree Classification
  • Possible cost functions
  • Comparisons to other methods

3

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-3
SLIDE 3

What will we learn?

4

Mike Hughes - Tufts COMP 135 - Spring 2019

Supervised Learning Unsupervised Learning Reinforcement Learning

Data, Label Pairs Performance measure Task data x label y

{xn, yn}N

n=1

Training Prediction Evaluation

slide-4
SLIDE 4

5

Mike Hughes - Tufts COMP 135 - Spring 2019

Task: Regression

Supervised Learning Unsupervised Learning Reinforcement Learning

regression

x y y

is a numeric variable e.g. sales in $$

slide-5
SLIDE 5

Salary prediction for Hitters data

6

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-6
SLIDE 6

7

Mike Hughes - Tufts COMP 135 - Spring 2019

Salary Prediction by “Region”

Divide x space into regions, predict constant within each region

Regions are rectangular (“axis aligned”)

slide-7
SLIDE 7

From Regions to Decision Tree

8

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-8
SLIDE 8

9

Mike Hughes - Tufts COMP 135 - Spring 2019

Decision tree regression

Parameters:

  • tree architecture (list of nodes, list of parent-child pairs)
  • at each internal node: x variable id and threshold value
  • at each leaf: scalar y value to predict

Hyperparameters

  • max_depth, min_samples_split

Prediction procedure:

  • Determine which leaf (region) the input features belong to
  • Guess the y value associated with that leaf

Training procedure:

  • minimize error on training set
  • use greedy heuristics (build from root to leaf)

Classification and Regression Trees by Breiman et al (1984)

slide-9
SLIDE 9

Ideal Training for Decision Tree

10

Mike Hughes - Tufts COMP 135 - Spring 2019

min

R1,...RJ J

X

j=1

X

n:xn∈Rj

(yn − ˆ yRj)2

Search space is too big (so many regions)! Hard to solve exactly… … let’s break it down into subproblems

slide-10
SLIDE 10

Key subproblem: Within 1 region, how to find best binary split?

11

Mike Hughes - Tufts COMP 135 - Spring 2019

Given a big region R, find the best possible binary split into two subregions (best means minimize mean squared error)

min

j,s,ˆ yR1,ˆ yR2

Let binary_split denote this procedure

We can solve this subproblem efficiently! For each feature index j in 1, 2, … F:

  • find its best possible cut point s[j] and its cost[j]

j <- argmin( cost[1] … cost[F] ) return best index j and its cut point s[j]

slide-11
SLIDE 11

Greedy top-down training

def train_tree_greedy(x_NF, y_N, depth d):

12

Mike Hughes - Tufts COMP 135 - Spring 2019

if d >= MAX_DEPTH: return LeafNode(x_NF, y_N) elif N < MIN_INTERNAL_NODE_SIZE: return LeafNode(x_NF, y_N) else: # j : integer index indicating feature to split # s : real value used as threshold to split # L / R : number of examples in left / right region j, s, x_LF, x_RF, y_L, y_R = binary_split(x_NF, y_N) if no split possible: return LeafNode(x_NF, y_N) left_child = train_tree_greedy (x_LF, y_L, d+1) right_child = train_tree_greedy(x_RF, y_R, d+1) return InternalNode(x_NF, y_N, j, s, left_child, right_child)

Hyperparameters controling complexity

  • MAX_DEPTH
  • MIN_INTERNAL_NODE_SIZE
  • MIN_LEAF_NODE_SIZE

Training is a recursive process. Returns a tree (by reference to its root node)

slide-12
SLIDE 12

Greedy Tree for Hitters Data

13

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-13
SLIDE 13

Cost functions for regression trees

  • Mean squared error
  • Assumed on previous slides, very common
  • How to solve for region’s best guess?
  • Mean absolute error
  • Possible! (supported in sklearn)
  • How to solve for region’s best guess?

14

Mike Hughes - Tufts COMP 135 - Spring 2019

min min

s,ˆ yR1,

Optimal solution: Guess mean output value of the region:

ˆ yR = 1 |R| X

i:xi∈R

yi

<latexit sha1_base64="yH1TQDUBZkWkXyE1I4qvKwaSocs=">ACHXicbVDLSgMxFM34rPVdekmWARXZaYWFEounFZi31ApwyZNOGZjJDkhGHdH7Ejb/ixoUiLtyIf2P6WGjrgcDhnHO5ucePGZXKtr+tpeWV1bX13EZ+c2t7Z7ewt9+USIwaeCIRaLtI0kY5aShqGKkHQuCQp+Rlj+8HvuteyIkjfidSmPSDVGf04BipIzkFSruACmdZp6uZ/ASuoFAWDuZHtVHGXRlEnqawgv4FHoUg5NKPWoVyjaJXsCuEicGSmCGWpe4dPtRTgJCVeYISk7jh2rkZCUcxIlncTSWKEh6hPOoZyFBLZ1ZPrMnhslB4MImEeV3Ci/p7QKJQyDX2TDJEayHlvLP7ndRIVnHc15XGiCMfTRUHCoIrguCrYo4JgxVJDEBbU/BXiATIFKVNo3pTgzJ+8SJrlknNaKt9WitWrWR05cAiOwAlwBmoghtQAw2AwSN4Bq/gzXqyXqx362MaXbJmMwfgD6yvH9r0obs=</latexit>

Optimal solution: Guess median output value of the region:

ˆ yR = median({yi : xi ∈ R})

<latexit sha1_base64="dJ1fREy7huDN7QkHSqYPu5HWjo=">ACG3icbVBNS8NAEN34bf2qevSyWAS9lKQKiCIXjyq2Co0JWy203bpZhN2J2I+R9e/CtePCjiSfDgv3Fbe9Dqgxke782wOy9MpDoup/OxOTU9Mzs3HxpYXFpeaW8utYwcao51HksY30TMgNSKijQAk3iQYWhRKuw/7pwL+BW1ErK4wS6AVsa4SHcEZWiko1/wewzwrgvyoEfUR7jDPIK2YKrYpn6eBYIe0jvbfaHopV/QnaBcavuEPQv8UakQkY4D8rvfjvmaQKuWTGND03wVbONAouoSj5qYGE8T7rQtNSxSIwrXx4W0G3rNKmnVjbUkiH6s+NnEXGZFoJyOGPTPuDcT/vGaKnYNWLlSIij+/VAnlRjOgiKtoUGjKzhHEt7F8p7zHNONo4SzYEb/zkv6Rq3q71drFXuX4ZBTHNkgm2SbeGSfHJMzck7qhJN78kieyYvz4Dw5r87b9+iEM9pZJ7/gfHwBcaegWw=</latexit>
slide-14
SLIDE 14

15

Mike Hughes - Tufts COMP 135 - Spring 2019

y

x2 x1

is a binary variable (red or blue)

Supervised Learning

binary classification

Unsupervised Learning Reinforcement Learning

Task: Binary Classification

slide-15
SLIDE 15

Decision Tree Classifier

16

Mike Hughes - Tufts COMP 135 - Spring 2019

Leaves make binary predictions! Goal: Does patient have heart disease?

slide-16
SLIDE 16

Decision Tree Probabilistic Classifier

17

Mike Hughes - Tufts COMP 135 - Spring 2019

Leaves count samples in each class! Then return fractions! Goal: What probability does patient have heart disease?

+ + - - +++ + - - + + ++ - + - - -

0.5 0.667 0.25 0.80

slide-17
SLIDE 17

18

Decision Tree Classifier

Parameters:

  • tree architecture (list of nodes, list of parent-child pairs)
  • at each internal node: x variable id and threshold
  • at each leaf: number of examples in each class

Hyperparameters:

  • max_depth, min_samples_split

Prediction:

  • identify rectangular region for input x
  • predict: most common label value in region
  • predict_proba: fraction of each label in region

Training:

  • minimize cost on training set
  • greedy construction from root to leaf

Classification and Regression Trees by Breiman et al (1984)

slide-18
SLIDE 18

Cost functions for classification trees

  • Information gain or “entropy”
  • Cost for a region with N examples of C classes
  • Gini impurity

19

Mike Hughes - Tufts COMP 135 - Spring 2019

cost(x1:N, y1:N) = −

C

X

c=1

pc log pc, pc , 1 N X

n

δc(yn)

<latexit sha1_base64="rZTIAaWf/chI2mkSUD6vFXOvwR0=">ACZnicbZHPa9RAFMcn8Ve7al1bxIOXh4uwhbokVCEQrEXT1LBbQubNUwmL9uhk5nszIs0hPyT3jx78c9wdrMHbX0w8Jnv+zEz38kqJR1F0c8gvHP3v0HW9uDh48e7zwZPt09c6a2AqfCKGMvMu5QSY1TkqTworLIy0zheXZ1sqf0frpNFfqalwXvKFloUnLyUDruE8JpaYRx1Y7hO2/jD5+4Amh5gH47gNSuLtNWHMXdtxOoPHWQKLPo8QCS5bLmud8JSMhKrhcKl5AUlos27lo/ph+gPeSoiK/axk2q9PhKJpE64DbEG9gxDZxmg5/JLkRdYmahOLOzeKonLUmhsBsktcOKiyu+wJlHzUt083ZtUwevJDYaxfmCt/t3R8tK5psx8Zcnp0t3MrcT/5WY1Fe/nrdRVTahFf1BRKyADK8hlxYFqcYDF1b6u4K45N4e8j8z8CbEN598G84OJ/GbyeGXt6Pjxs7tgL9pKNWczesWP2iZ2yKRPsV7Ad7AZ7we9wJ3wWPu9Lw2DTs8f+iRD+ACEKtzs=</latexit>

cost(x1:N, y1:N) =

C

X

c=1

pc(1 − pc)

<latexit sha1_base64="JAZ0vg9+DgsaS9yITSICD1psF2c=">ACJ3icbZDLSgMxFIYzXmu9V26CRahBS0zVCESrEbV1LBXqCtQyZN29DMheSMtAzNm58FTeCiujSNzG9LT1h8DHf87h5PxOILgC0/wyFhaXldWE2vJ9Y3Nre3Uzm5V+aGkrEJ94cu6QxQT3GMV4CBYPZCMuI5gNadfGtVrD0wq7nt3MAxYyVdj3c4JaAtO3XZBDaAiPoK4gwe2JF1cRMf4eEcBYXcFOFrh3RghXfl3CgKcYZCx9rpFk7lTZz5lh4HqwpNFUZTv12mz7NHSZB1QpRqWGUArIhI4FSxONkPFAkL7pMsaGj3iMtWKxnfG+FA7bdzxpX4e4LH7eyIirlJD19GdLoGemq2NzP9qjRA6562Ie0EIzKOTRZ1QYPDxKDTc5pJREMNhEqu/4pj0hCQUeb1CFYsyfPQzWfs05y+dvTdPFqGkcC7aMDlEWOkNFdI3KqIoekTP6A29G0/Gi/FhfE5aF4zpzB76I+P7B9BFo10=</latexit>
slide-19
SLIDE 19

Advantages of Decision Trees

20

Mike Hughes - Tufts COMP 135 - Spring 2019 + Can handle heterogeneous datasets (some features are numerical, some are categorical) easily without requiring standardized scales like penalized linear models do + Flexible non-linear decision boundaries + Relatively few hyperparameters to select

slide-20
SLIDE 20

Limitations of Decision Trees

21

Mike Hughes - Tufts COMP 135 - Spring 2019 Axis-aligned assumption not always a good idea

slide-21
SLIDE 21

Summary of Classifiers

22

Mike Hughes - Tufts COMP 135 - Spring 2019

Knobs to tune Function class flexibility Interpret? Logistic Regression L2/L1 penalty on weights Linear Inspect weights MLPClassifier

L2/L1 penalty on weights Num layers, num units Activation functions GD method: SGD or LBFGS or … Step size, batch size

Universal (with enough units) Challenging K Nearest Neighbors Classifier Number of Neighbors Distance metric Piecewise constant Inspect neighbors Decision Tree Classifier

  • Max. depth
  • Min. leaf size

Axis-aligned Piecewise constant Inspect tree