Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation
Advanced Analytics in Business [D0S07a] Big Data Platforms & - - PowerPoint PPT Presentation
Advanced Analytics in Business [D0S07a] Big Data Platforms & Technologies [D0S06a] Supervised Learning Overview Regression Logistic regression K-NN Decision and regression trees 2 The analytics process 3 Recall Supervised learning
Overview
Regression Logistic regression K-NN Decision and regression trees
2
The analytics process
3
Recall
Supervised learning
You have a labelled data set at your disposal Correlate features to target Common case: predict the future based on patterns observed now (predictive) Classification (categorical) versus regression (continuous)
Unsupervised learning
Describe patterns in data Clustering, association rules, sequence rules No labelling required Common case: descriptive, explanatory
For supervised learning, our data set will contain a label 4
Most classification use cases use a binary categorical variable
Churn prediction: churn yes/no Credit scoring: default yes/no Fraud detection: suspicous yes/no Response modeling: customer buys yes/no Predictive maintenance: needs check yes/no
Regression: continuous label Classification: categorical label For classification:
Binary classification (positive/negative
- utcome)
Multiclass classification (more than two possible outcomes) Ordinal classification (target is ordinal) Multilabel classification (multiple outcomes are possible)
For regression:
Absolute values Delta values Quantiles regression
Recall
Single versus multi-output models is possible as well (Definitions in literature and documentation can differ a bit) 5
Defining your target
Recommender system: a form of multi-class? Multi-label? Survival analysis: instead of yes/no predict the “time until yes occurs”
Oftentimes, different approaches are possible
Regression, quantile regression, mean residuals regression? Or: predicting the absolute value or the change? Or: convert manually to a number of bins and perform classification? Or: reduce the groups to two outcomes? Or: sequential binary classification (“classifier chaining”)? Or: perform segmentation first and build a model per segment?
6
Regression
7
Regression
https://xkcd.com/605/
8
Linear regression
Not much new here…
with
Price = 100000 + 100000 * number of bedrooms : mean response when (y-intercept) : change in mean response when increases by one unit
How to determine the parameters?
: minimize sum of squared errors (SSE) With standard error
OLS: “Ordinary Least Squares”. Why SSE though?
y = β0 + β1x1 + β2x2 + … + ϵ ϵ ∼ N(0, σ)
β0 → x = 0 β1 x1
argmin→
β ∑n i=1(yi − ^
yi)2 σ = √SSE/n
9
Logistic Regression
10
Logistic regression
Classification is solved as well?
Customer Age Income Gender … Response John 30 1200 M No → 0 Sophie 24 2200 F No → 0 Sarah 53 1400 F Yes → 1 David 48 1900 M No → 0 Seppe 35 800 M Yes → 1
But no guarantee that output is 0 or 1 Okay fine, a probability then, but no guarantee that outcome is between 0 and 1 either Target and errors also not normally distributed (assumption of OLS violated)
^ y = β0 + β1age + β2income + β3gender
11
Logistic regression
We use a bounding function to limit the
- utcome between 0 and 1:
(logistic, sigmoid) Same basic formula, but now with the goal
- f binary classification
Two possible outcomes: either 0 or 1, no or yes – a categorical, binary label, not continuous Logistic regression is thus a technique for classification rather than regression Though the predictions are still continuous: between [0,1]
f(z) =
1 1+e−z
12
Logistic regression
Linear regression with a transformation such that the output is always between 0 and 1, and can thus be interpreted as a probability (e.g. response or churn probability) Or (“logit” – natural logarithm of odds): P(response = yes|age, income, gender) = 1 − P(response = no|age, income, gender) = 1 1 + e−(β0+β1age+β2income+β3gender)
ln( ) = β0 + β1age + β2income + β3gender P(response = yes|age, income, gender) P(response = no|age, income, gender)
13
Our first predictive model: a formula Not very spectacular, but note:
Easy to understand Easy to construct Easy to implement
In some settings, the end result will be a logistic model “extracted” from more complex approaches
Customer Age Income Gender … Response John 30 1200 M No → 0 Sophie 24 2200 F No → 0 Sarah 53 1400 F Yes → 1 David 48 1900 M No → 0 Seppe 35 800 M Yes → 1
↓ ↓
Customer Age Income Gender … Response Score Will 44 1500 M 0.76 Emma 28 1000 F 0.44
Logistic regression
1 1 + e−(0.10+0.22age+0.05income−0.80gender) 14
Logistic regression
If increases by 1: : “odds-ratio”: multiplicative increase in odds when increases by 1 (other variables constant)
→ → odds/probability increase with → → odds/probability decrease with
Doubling amount:
Amount of change required for doubling primary outcome odds Doubling amount for
1 1 + e−(β0+β1age+β2income+β3gender) Xi
logit|Xi+1 = logit|Xi + βi
- dds|Xi+1 = odds|Xieβi
eβi Xi
βi > 0 eβi > 1 Xi βi < 0 eβi < 1 Xi Xi = log(2)/βi
15
Logistic regression
Easy to interpret and understand Statistical rigor, a “well-calibrated” classifier Linear decision boundary, though interaction effects can be taken into the model (and explicitely; allows for investigation) Sensitive to outliers Categorical variables need to be converted (e.g. using dummy encoding as the most common approach, though recall ways to reduce large amount of dummies) Somewhat sensitive to the curse of dimensionality…
16
Regularization
17
Stepwise approaches
Statisticians love “parsimonious” models:
If a “smaller” model works just as well as a “larger” one, prefer the smaller one Also: “curse of dimensionality”
Makes sense: most statistical techniques don’t like dumping in your whole feature set all at once Selection based approaches (build up final model step-by-step):
Forward selection Backward selection Hybrid (stepwise) selection
See MASS::stepAIC , leaps::regsubsets , caret , or simply step in R Not implemented by default in Python (neither scikit-learn or statsmodels ) … What’s going on? 18
Stepwise approaches
Trying to get the best, smallest model given some information about a large number of variables is reasonable
Many sources cover stepwise selection methods However, this is not really a legitimate situation
Frank Harrell (1996):
It yields R-squared values that are badly biased to be high, the F and chi-squared tests quoted next to each variable on the printout do not have the claimed distribution. The method yields confidence intervals for effects and predicted values that are falsely narrow (Altman and Andersen, 1989). It yields p-values that do not have the proper meaning, and the proper correction for them is a difficult problem. It gives biased regression coefficients that need shrinkage (the coefficients for remaining variables are too large (Tibshirani, 1996). It has severe problems in the presence of collinearity. It is based on methods (e.g., F tests for nested models) that were intended to be used to test prespecified hypotheses. Increasing the sample size does not help very much (Derksen and Keselman, 1992). It uses a lot of paper. (https://www.stata.com/support/faqs/statistics/stepwise-regression-problems/)
“ “
19
Stepwise approaches
Some of these issues have / can be fixed (e.g. using proper tests), but… This actually already reveals something we’ll visit again when talking about evaluation! Take-away: use a proper train-test setup!
Developing and confirming a model based on the same dataset is called data dredging. Although there is some underlying relationship amongst the variables, and stronger relationships are expected to yield stronger scores, these are random variables and the realized values contain error. Thus, when you select variables based on having better realized values, they may be such because of their underlying true value, error, or both. True, using the AIC is better than using p-values, because it penalizes the model for complexity, but the AIC is itself a random variable (if you run a study several times and fit the same model, the AIC will bounce around just like everything else).
“ “
20
Regularization
SSEModel1 = (1 − 1)2 + (2 − 2)2 + (3 − 3)2 + (8 − 4)2 = 16 SSEModel2 = (1 − −1)2 + (2 − 2)2 + (3 − 5)2 + (8 − 8)2 = 8 21
Regularization
Key insight: introduce a penalty on the size of the weights
Constrained, instead of less parameters! Makes the model less sensitive to outliers, improves generalization
Lasso and ridge regression:
Standard: with Lasso regression (L1 regularization): Ridge regression (L2 regularization):
No penalization on the intercept! Obviously: standardization/normalization required!
y = β0 + β1x1 + … + βpxp + ϵ argmin→
β ∑n i=1(yi − ^
yi)2 argmin→
β ∑n i=1(yi − ^
yi)2 + λ ∑p
j=1 |βj|
argmin→
β ∑n i=1(yi − ^
yi)2 + λ ∑p
j=1 β2 j
22
Lasso and ridge regression
https://newonlinecourses.science.psu.edu/stat508/lesson/5/5.1
23
Lasso and ridge regression
Lasso will force coefficients to become zero, ridge only to keep them within bounds
Variable selection for free with Lasso Why ridge, then? Easier to implement (slightly) and faster to compute (slightly), or when you have a limited number of variables to begin with Lasso will also not consider grouping effects (e.g. pick a variable at random when variables are correlated), will not work when number of instances is less than number of features
In practice, however, lasso is preferred, tends to work well even with small sample sizes How to pick a good value for lambda: cross-validation! (See later) Works both for linear and logistic regression, concept of L1 and L2 regularization also pops up with other model types (e.g. SVM’s, Neural Networks) and fields:
Tikhonov regularization (Andrey Tikhonov), ridge regression (statistics), weight decay (machine learning), the Tikhonov–Miller method, the Phillips–Twomey method, the constrained linear inversion method, and the method
- f linear regularization
Need to normalize variables befhorehand to ensure that the regularisation term regularises/affects the variable involved in a similar manner! λ
MATLAB always uses the centred and scaled variables for the computations within ridge. It just back-transforms them before returning them.
“ “
24
Lasso and ridge regression
25
Elastic net
Every time you have two similar approaches, there’s an easy paper opportunity by proposing to combine them (and giving it a new name)…
Combine L1 and L2 penalties Retains benefit of introducing sparsity Good at getting grouping effects Implemented in R and Python (check the documentation: everybody disagrees on how to call and ) Grid search on two parameters necessary Lasso parameter will be the most pronounced in most practical settings
argmin→
β n
∑
i=1
(yi − ^ yi)2 + λ1
p
∑
j=1
|βj| + λ2
p
∑
j=1
β2
j
λ1 λ2
26
Some Other Forms of Regression
27
Non-parametric regression
“Non-parametric” being a fancy name for “no underlying distribution assumed, purely data- driven”
“Smoothers” such as LOESS (locally weighted scatterplot smoothing) Does not require specification of a function to fit a model, only a “smoothing” parameter Very flexible, but requires large data samples (because LOESS relies on local data structure to provide local fitting), does not produce a regression function Take care when using this as a “model”, more an exploratory means!
28
Generalized additive models
(GAMs), similar concept as normal regression, but uses splines and other given smoothing function in a linear combination Benefit: capture non-linearities by smooth functions Functions can be parametric, non-parametric, polynomial, local weighted mean, … Very flexible, best of both worlds approach Danger of overfitting: stringent validation required Theoretical relation to boosting (which we’ll discuss later)
Very nice technique but not that well known y = β0 + f1(x1)+. . . +fpxp + ϵ
aerosolve - Machine learning for humans A machine learning library designed from the ground up to be human friendly. A general additive linear piecewise spline model. The training is done at a higher resolution specified by num_buckets between the min and max of a feature’s range. At the end of each iteration we attempt to project the linear piecewise spline into a lower dimensional function such as a polynomial spline with Dirac delta endpoints.
“ “
29
Generalized additive models
Henckaerts, Antonio, et al., 2017:
log(E(nclaims)) = log(exp) + β0 + β1coveragePO + β2coverageFO + β3fueldiesel+ f1(ageph) + f2(power) + f3(bm) + f4(ageph, power) + f5(long, lat)
30
Multinomial and ordinal logistic regression
Extension for non-binary categorical outcomes
possible outcomes, features For
- utcomes, construct
binary logistic regression models
and thus P(yi = k|Xi) = β0,k + β1,kx1,i + ⋯ + βM,kxM,i
K M K K − 1
ln( ) = β∙,1 ⋅ Xi
P(yi=1|Xi) P(yi=K|Xi)
ln( ) = β∙,2 ⋅ Xi
P(yi=2|Xi) P(yi=K|Xi)
… ln( ) = β∙,K−1 ⋅ Xi
P(yi=K−1|Xi) P(yi=K|Xi)
P(yi = K|Xi) = 1 − ∑K−1
k=1 P(yi = k|Xi) = 1 − ∑K−1 k=1 P(yi = K|Xi)eβ∙,kXi
P(yi = K|Xi) =
1 1+∑K−1
k=1 eβ∙,kXi
31
Multinomial and ordinal logistic regression
Extension for ordered categorical outcomes but since , The logit functions for all ratings are parallel since they only differ in the intercept (proportional odds model) P(yi = D|Xi) = P(yi ≤ D|Xi) P(yi = C|Xi) = P(yi ≤ C|Xi) − P(yi ≤ D|Xi) P(yi = B|Xi) = P(yi ≤ B|Xi) − P(yi ≤ C|Xi) P(yi = A|Xi) = P(yi ≤ A|Xi) − P(yi ≤ B|Xi) P(yi = AA|Xi) = P(yi ≤ AA|Xi) − P(yi ≤ A|Xi) P(yi = AAA|Xi) = 1 − P(yi ≤ AA|Xi) ln( ) = −θR + β1x1 + ⋯ + βnxn
P(yi≤R|Xi) 1−P(yi≤R|Xi)
P(yi ≤ AAA|Xi) = 1 θAAA = ∞ 32
PCR and PLS
Principal Component Regression (PCR):
Key idea: perform PCA first on the features and then perform normal regression Number of components to be tuned using cross-validation (see later) Standardization required as PCA is scaling-sensitive
Partial Least Squares (PLS) regression:
PCR does not take response into account PLS performs PCA but now include the target as well Variance aspect often dominates so PLS will behave closely to PLS in many settings
33
Decision Trees
34
Decision trees
Both for classification and regression
We’ll discuss classification first
Based on recursively partioning the data
Splitting decision: How to split a node?
Age < 30, income < 1000, status = married?
Stopping decision: When to stop splitting?
When to stop growing the tree?
Assignment decision: How to assign a label outcome in the leaf nodes?
Which class to assign to the leave node?
35
Terminology
36
ID3
ID3 (Iterative Dichotomiser 3)
Most basic decision tree algorithm, by Ross Quinlan (1986) Begin with the original set as the root node On each iteration of the algorithm, iterate through every unused attribute of the set and calculate a measure for that attribute, e.g. Entropy and Information Gain Select the best attribute, split on the selected attribute to produce subsets Continue to recurse on each subset, considering only attributes not selected before (for this particular branch of the tree) Recursion stops when every element in a subset belongs to the same class label, or there are no more attributes to be selected, or there are no instances left in the subset S S H(S) IG(A, S) S
37
ID3
38
ID3
39
ID3
40
Impurity measures
Which measure? Based on impurity
- Minimal impurity
- Also minimal impurity
- Maximal impurity
41
Impurity measures
Intuitively, it’s easy to see that:
- ••••• → ••• + •••
Is better than:
- ••••• → ••• + •••
But what about:
- ••••• → •••• + ••
We need a measure… 42
Entropy
Entropy is a measure of the amount of uncertainty in a data set (information theory) with the data (sub)set, the classes (e,g, {yes, no}), the proportion of elements with class over
When , the set is completely pure (all elements belong to the same class)
H(S) = − ∑x∈X p(x)log2(p(x)) S X p(x) x |S|
H(S) = 0
43
Entropy
For the original data set, we get:
#yes #no x = yes x = no Entropy 9 5
- 0.41
- 0.53
0.94
44
Information gain
- ••••• → •••• + ••
We can calculate the entropy of the original set and all the subsets, but how to measure the improvement? Information gain: measure of the difference in impurity before and after the split
How much uncertainty was reduced by a particular split?
with the set of subsets obtained by splitting the original set on attribute and = IG(A, S) = H(S) − ∑t∈T p(t)H(t) T S A p(t) |t|/|S| 45
Information gain
Original set :
#yes #no x = yes x = no Entropy 9 5
- 0.41
- 0.53
0.94
Calculate the entropy for all subsets created by all candidate splitting features:
Attribute Subset #yes #no x = yes x = no Entropy Outlook Sunny 2 3
- 0.53
- 0.44
0.97 Overcast 4 0.00 Rain 3 2
- 0.44
- 0.53
0.97 Temperature Hot 2 2
- 0.50
- 0.50
1 Mild 4 2
- 0.39
- 0.53
0.92 Cool 3 1
- 0.31
- 0.5
0.81 Humidity High 3 4
- 0.52
- 0.46
0.99 Normal 6 1
- 0.19
- 0.4
0.59 Wind Strong 3 3
- 0.50
- 0.50
1 Weak 6 2
- 0.31
- 0.5
0.81
S 46
#yes #no x = yes x = no Entropy 9 5
- 0.41
- 0.53
0.94 Attribute Subset #yes #no x = yes x = no Entropy Outlook Sunny 2 3
- 0.53
- 0.44
0.97 Overcast 4 0.00 Rain 3 2
- 0.44
- 0.53
0.97 Temperature Hot 2 2
- 0.50
- 0.50
1 Mild 4 2
- 0.39
- 0.53
0.92 Cool 3 1
- 0.31
- 0.5
0.81 Humidity High 3 4
- 0.52
- 0.46
0.99 Normal 6 1
- 0.19
- 0.4
0.59 Wind Strong 3 3
- 0.50
- 0.50
1 Weak 6 2
- 0.31
- 0.5
0.81
Information gain
IG(outlook, S) = 0.94 - ( 0.97 + 0.00 + 0.97) = 0.94 - 0.69 = 0.25 ← highest IG IG(temperature, S) = 0.03 IG(humidity, S) = 0.15 IG(wind, S) = 0.05
IG(A, S) = H(S) − ∑t∈T p(t)H(t)
5 14 4 14 5 14
47
ID3: after one split
48
ID3: continue
Recursion stops when every element in a subset belongs to the same class label,
- r there are no more attributes to be selected, or there are no instances left in the
subset 49
ID3: final tree
Assign labels to the leaf nodes: easy – just pick the most common class 50
Impurity measures
Entropy (Shannon index) is not the only measure of impurity that can be used
Entropy: Gini diversity index:
Not very different: Gini works a bit better for continuous variables (see after) and is a little faster Most implementations default to this approach
Classification error:
Something to think about: why not use accuracy directly? Or another metric of interest such as AUC or precision, F1, …?
H(S) = − ∑x∈X p(x)log2(p(x)) Gini(S) = 1 − ∑x∈X p(x)2 ClassErr(S) = 1 − maxx∈Xp(x)
51
Summary so far
Using the tree to predict new labels is easy: just follow the questions in the tree and look at the
- utcome
Easy to understand, easy (for a computer) to construct Can be easily expressed as simple IF…THEN rules as well, can hence be easily implemented in existing programs (even as a SQL procedure)
Fun as background research: algorithms exist which directly try to introduce prediction models in the form of a rule base, with RIPPER (Repeated Incremental Pruning to Produce Error Reduction) being the most well known. It’s just as old and leads to very similar models, but has some interesting differences and is (nowadays) not widely implemented or known anymore 52
https://christophm.github.io/interpretable-ml-book/rules.html
See also RuleFit (https://github.com/christophM/rulefit) and Skope-Rules (https://github.com/scikit-learn-contrib/skope-rules) for interesting, newer approaches 53
Problems still to solve
ID3 is greedy: it never backtracks (i.e. retraces
- n previous steps) during the construction of
the tree, it only moves forward
This means that global optimality of the tree is not guaranteed Algorithms exist which overcome this (see e.g.
evtree package for R), though they’re often
slow and do not give much better results (so greedy is good enough)
A bigger problem however is the fact that we do not have a way to tackle continuous variables, such as temperature = 21, 24, 26, 27, 30, … Another big problem is that the “grow for as long as you can” strategy leads to trees which will be horribly overfitting!
54
Spotting overfitting
(Note that this is a good motivating case to illustrate the difference between “supervised methods for predictive analytics” or “for descriptive analytics”) 55
C4.5
Also by Ross Quinlan Extension of ID3: still uses information gain Main contribution: dealing with continuous variables Can also deal with missing variables
The original paper describes that you just ignore them when calculating the impurity measure and information gain Though most implementations do not implement this!
Also allows to set importance weights on attributes (biasing the information gain, basically) Describes methods to prune trees
56
C4.5: continuous variables
Say we want to split on temperature = 21, 24, 26, 27, 30, …
Obviously, using the values as-is would not be a good idea Would lead to a lot of subsets, many of which potentially having only a few instances When applying the tree on new data: the chance of encountering a value which was unseen during training is much higher than for categoricals, e.g. what if temperature is 22?
Instead, enforce binary splits by:
Splitting on temperature <= 21 → two subsets (yes, no) – and calculate the information gain Splitting on temperature <= 24 → two subsets (yes, no) – and calculate the information gain Splitting on temperature <= 26 → two subsets (yes, no) – and calculate the information gain Splitting on temperature <= 27 → two subsets (yes, no) – and calculate the information gain And so on… Important: only the distinct set of values seen in the training set are considered (others wouldn’t change the information gain), though some papers propose changes to make this a little more stable
57
C4.5: continuous variables
Note that each of our “temperate <= …” splits lead to a yes/no outcome
Couldn’t we do the same for categorical features as well? Humity = “high” → two subsets (yes, no) – and calculate the information gain Humity = “medium” → two subsets (yes, no) – and calculate the information gain Humity = “low” → two subsets (yes, no) – and calculate the information gain
Turns out that constructing such a binary tree is better
Information gain measure is biased towards preferring splits that lead to more subsets (avoided now, always two subsets) Obviously: we can now re-use attributes throughout the tree as each attribute leads to multiple binary subsets Tree can be deeper, but less wide Weka is a weird exception
58
C4.5: preventing overfitting
Early stopping: stop based on a stop criterion
E.g. when number of instances in subset goes below a threshold Or when depth gets too high Or: set aside a validation set during training and stop when performance on this set starts to decrease (if you have enough training data to begin with)
59
C4.5: preventing overfitting
Pruning: grow the full tree, but then reduce it
Merge back leaf nodes if they add little power to the classification accuracy “Inside every big tree is a small, perfect tree waiting to come out.” –Dan Steinberg Many forms exist: weakest-link pruning, cost-complexity pruning (common), etc… Oftentimes governed by a “complexity” parameter in most implementations
Only recently in sklearn : https://scikit- learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html#sphx-glr-auto-examples-tree-plot-cost- complexity-pruning-py
Scoring of new instances can now return a “probability”
60
C5 (See5)
Also by Ross Quinlan Commercial offering Faster, memory efficient, smaller trees, weighting of cases and supplying misclassification costs Not widely adopted… only recently made open source, better open source implementations available Support for boosting (see next course)
Outdated 61
A few final insights…
Conditional inference trees:
Another method which instead of using information gain uses a significance test procedure in
- rder to select variables
Preprocessing:
Decision trees are robust to outliers Only missing value treatment needed Some implementations have proposed three-way splits (yes / no / NA)
Multiclass:
Concept of decision trees easily extended to multiclass setting
62
A few final insights…
Categorization:
Recall preprocessing: possible to run a decision tree on one continuous variable only to suggest a good binning based on the leaf nodes
Interaction effects and nonlinearities:
Considered by default by decision trees CHAID (Chi-square automatic interaction detection): chi-square based test to split trees
Variable selection:
Based on features that pop up earlier in the tree
Non well-calibrated and unstable classifier:
Sensitive to changes in training data: a small change can cause your tree to look different
63
A few final insights
Remember to prevent overfitting trees
But a deep tree does not necessarily mean that you have a problem And a short tree does not necessarily mean that it’s not overfitting
However, note that it is now very likely that your leaf nodes will not be completely pure (i.e. not containing 100% yes cases or no cases) 64
A few final insights
Together with logistic regression, decision trees are in the top-3 of predictive techniques on tabular data. They’re simple to understand and present, and require very little data preparation, and can learn non-linear relationships In fact, many ways exist to take your favority implementation and extending it
E.g. when domain experts have their favorite set of features, you can easily constrain the tree to
- nly consider a subset of the features in the first n levels
You might even consider playing with more candidates to generate binary split rules, e.g. “feature X between A and B?”, “euclidean dist(X, Y) <= t”, …
65
A few final insights
Regression trees:
As made popular by CART: Classification And Regression Trees “Tree structured regression offers an interesting alternative for looking at regression type problems. It has sometimes given clues to data structure not apparent from a linear regression analysis” Instead of calculating the #yes’s vs. #no’s to get the predicted class label, take the mean of the continuous label and use that as the outcome
But how to select the splitting criterion?
Squared residuals minimization algorithm which implies that expected sum of variances for two resulting nodes should be minimized Based on sum of squared errors: find the split that produces the greatest separation in SSE in each subset Find nodes with minimal within variance… and therefore greatest between variance (a little bit like k-means, a clustering technique we’ll visit later) Important: comments regarding pruning still apply, though a regression tree will typically need to be deeper than a classification one
66
A few final insights
Visualizing:
R: use fancyRpartPlot() from the rattle package for nicer visualizations Python: use dtreeviz (https://github.com/parrt/dtreeviz) for nicer visualizations
67
Decision trees versus (logistic) regression
Decision boundary of decision trees: squares orthogonal to a dimension 68
Decision trees versus (logistic) regression
Decision trees can struggle to capture linear relationships
E.g. the best it can do is a step function approximation of a linear relationship This is strictly related to how decision trees work: it splits the input features in several “orthogonal” regions and assigns a prediction value to each region Here, a deeper tree would be necessary to approximate the linear relationship Or apply a transformation first (e.g. PCA)
69
K-nearest Neighbors
70
K-nearest neighbors (K-NN)
A non-parametric method used for classification and regression “The data is the model” Trivially easy Based on the concept of distances (so normalization/standardization required)
71
K-nearest neighbors (K-NN)
Has some appealing properties: easy to understand Fun to tweak with custom distances, different values for k, dynamic k-values, custom distance measures… – even a recommender system can be built using this approach Provides surprisingly good results, given enough data Regression: e.g. use a (weighted) average of the k nearest neighbors, weighted by the inverse of their distance The main disadvantage of k-NN is that it is a “lazy learner”: it does not learn anything from the training data and simply uses the training data itself as a model
This means you don’t get a formula, or a tree as a summarizing model And that you need to keep training data around Relatively slow Not as stable for noisy data, might be hard to generalize, unstable with regards to k setting
72
K-nearest neighbors (K-NN)
https://towardsdatascience.com/scanned-digits-recognition-using-k-nearest-neighbor-k-nn-d1a1528f0dea
73
Wrap up
Linear regression, logistic regression and decision trees still amongst most widely used techniques
White box, statistical rigor, easy to construct and interpret
K-NN less can provide a quick baseline and is very extensible Decision between regression and (which) classification not always a clear cut choice!
Neither is the decision between unsupervised and supervised Iteration, domain expert involvement required
74