Pattern Recognition Bertrand Thirion and John Ashburner Bertrand - - PowerPoint PPT Presentation

pattern recognition
SMART_READER_LITE
LIVE PREVIEW

Pattern Recognition Bertrand Thirion and John Ashburner Bertrand - - PowerPoint PPT Presentation

Introduction Generalization Overview of the main methods Resources Pattern Recognition Bertrand Thirion and John Ashburner Bertrand Thirion and John Ashburner Pattern Recognition Introduction Definitions Generalization Classification and


slide-1
SLIDE 1

Introduction Generalization Overview of the main methods Resources

Pattern Recognition

Bertrand Thirion and John Ashburner

Bertrand Thirion and John Ashburner Pattern Recognition

slide-2
SLIDE 2

Introduction Generalization Overview of the main methods Resources Definitions Classification and Regression Curse of Dimensionality Bertrand Thirion and John Ashburner Pattern Recognition

slide-3
SLIDE 3

Introduction Generalization Overview of the main methods Resources Definitions Classification and Regression Curse of Dimensionality

Some key concepts

supervised learning: The data comes with additional attributes that we want to predict = ⇒ classification and regression. unsupervised learning: No target values. Discover groups of similar examples within the data (clustering). Determine the distribution of data within the input space (density estimation). Project the data down to two or three dimensions for visualization.

Bertrand Thirion and John Ashburner Pattern Recognition

slide-4
SLIDE 4

Introduction Generalization Overview of the main methods Resources Definitions Classification and Regression Curse of Dimensionality

General supervised learning setting

We have a training dataset of n observations, each consisting of an input xi and a target yi. Each input, xi, consists of a vector of p features. D = {(xi, yi)|i = 1, .., n} The aim is to predict the target for a new input x∗.

Bertrand Thirion and John Ashburner Pattern Recognition

slide-5
SLIDE 5

Introduction Generalization Overview of the main methods Resources Definitions Classification and Regression Curse of Dimensionality

Classification

Targets (y) are categorical labels. Train with D and use result to make best guess

  • f y∗ given x∗.

Classification Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1

Bertrand Thirion and John Ashburner Pattern Recognition

slide-6
SLIDE 6

Introduction Generalization Overview of the main methods Resources Definitions Classification and Regression Curse of Dimensionality

Probabilistic classification

Targets (y) are categorical labels. Train with D and compute P(y∗ = k|x∗, D).

Probabilistic classification Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1

Bertrand Thirion and John Ashburner Pattern Recognition

slide-7
SLIDE 7

Introduction Generalization Overview of the main methods Resources Definitions Classification and Regression Curse of Dimensionality

Regression

Targets (y) are continuous real variables. Train with D and compute p(y∗|x∗, D).

10 20 30 40 50 60 70

31 14 23 31 63 55 14 58 35 27 Feature 1 Feature 2 Bertrand Thirion and John Ashburner Pattern Recognition

slide-8
SLIDE 8

Introduction Generalization Overview of the main methods Resources Definitions Classification and Regression Curse of Dimensionality

Many other settings

Multi-class classification when there are more than two possible categories. Ordinal regression for classification when there is some

  • rdering of the categories.

Chu, Wei, and Zoubin Ghahramani. “Gaussian processes for ordinal regression.” In Journal of Machine Learning Research, pp. 1019-1041. 2005.

Multi-task learning when there are multiple targets to predict, which may be related. etc

Bertrand Thirion and John Ashburner Pattern Recognition

slide-9
SLIDE 9

Introduction Generalization Overview of the main methods Resources Definitions Classification and Regression Curse of Dimensionality

Multi-Class classification

Multinomial Logistic regression Theoretically optimal. Expensive optimization. One-versus-all classification [SVMs] Among several hyperplane, choose the one with maximal margin. = ⇒ recommended One-versus-one classification Vote across each pair of class. Expensive, not optimal.

Bertrand Thirion and John Ashburner Pattern Recognition

slide-10
SLIDE 10

Introduction Generalization Overview of the main methods Resources Definitions Classification and Regression Curse of Dimensionality

Curse of dimensionality

Large p, small n.

Bertrand Thirion and John Ashburner Pattern Recognition

slide-11
SLIDE 11

Introduction Generalization Overview of the main methods Resources Definitions Classification and Regression Curse of Dimensionality

Nearest-neighbour classification

−3 −2 −1 1 2 −2 −1 1 2 Feature 1 Feature 2

Not nice smooth separations. Lots of sharp corners. May be improved with K-nearest neighbours.

Bertrand Thirion and John Ashburner Pattern Recognition

slide-12
SLIDE 12

Introduction Generalization Overview of the main methods Resources Definitions Classification and Regression Curse of Dimensionality

Behaviour changes in high-dimensions

Bertrand Thirion and John Ashburner Pattern Recognition

slide-13
SLIDE 13

Introduction Generalization Overview of the main methods Resources Definitions Classification and Regression Curse of Dimensionality

Behaviour changes in high-dimensions

2 4 6 8 10 12 14 16 18 20 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Circle area = π r2 Sphere volume = 4/3 π r3 Number of dimensions Volume of hyper−sphere (r=1/2)

Bertrand Thirion and John Ashburner Pattern Recognition

slide-14
SLIDE 14

Introduction Generalization Overview of the main methods Resources Assessing generalizability Accuracy Measures

Occam’s razor

“Everything should be kept as simple as possible, but no simpler.”

— Einstein (allegedly)

Complex models (with many estimated parameters) usually explain training data better than simpler models. Simpler models often generalise better to new data than nore complex models. Need to find the model with the optimal bias/variance tradeoff.

Bertrand Thirion and John Ashburner Pattern Recognition

slide-15
SLIDE 15

Introduction Generalization Overview of the main methods Resources Assessing generalizability Accuracy Measures

Bayesian model selection

Real Bayesians don’t cross-validate (except when they need to). P(M|D) = p(D|M)P(M) p(D) The Bayes factor allows the plausibility of two models (M1 and M2) to be compared: K = p(D|M1) p(D|M2) =

  • θM1 p(D|θM1, M1)p(θM1|M1)dθM1
  • θM2 p(D|θM2, M2)p(θM2|M2)dθM2

This is usually too costly in practice, so approximations are used.

Bertrand Thirion and John Ashburner Pattern Recognition

slide-16
SLIDE 16

Introduction Generalization Overview of the main methods Resources Assessing generalizability Accuracy Measures

Model selection

Some approximations/alternatives to the Bayesian approach: Laplace approximations: find the MAP/ML solution and use a Gaussian approximation to the parameter uncertainty. Minimum Message Length (MML): an information theoretic approach. Minimum Description Length (MDL): an information theoretic approach based on how well the model compresses the data. Akaike Information Criterion (AIC): −2 log p(D|θ) + 2k, where k is the number of estimated parameters. Bayesian Information Criterion (BIC): −2 log p(D|θ) + k log q, where q is the number of

  • bservations.

Bertrand Thirion and John Ashburner Pattern Recognition

slide-17
SLIDE 17

Introduction Generalization Overview of the main methods Resources Assessing generalizability Accuracy Measures

Model selection by nested cross-validation

Inner cross-validation loop used to evaluate model’s performance

  • n a pre-defined grid of parameters and retain the best one.

Safe, but costly. Supported by some libraries (e.g. scikit-learn). Some estimators have path model, hence allow faster evaluation (e.g. LASSO). Randomized techniques also exist, sometimes more efficient. Caveat: Inner cross-validation loop = outer cross-validation loop for parameter evaluation.

Bertrand Thirion and John Ashburner Pattern Recognition

slide-18
SLIDE 18

Introduction Generalization Overview of the main methods Resources Assessing generalizability Accuracy Measures

Accuracy measures for regression

Root-mean squared error for point predictions. Correlation coefficient for point predictions. Log predictive probability can be used for probabilistic predictions. Expected loss/risk for point predictions for decision making.

Bertrand Thirion and John Ashburner Pattern Recognition

slide-19
SLIDE 19

Introduction Generalization Overview of the main methods Resources Assessing generalizability Accuracy Measures

Accuracy measures for binary classification

Wikipedia contributors, “Sensitivity and specificity,” Wikipedia, The Free Encyclopedia, http: //en.wikipedia.org/w/index. php?title=Sensitivity_and_ specificity&oldid=655245669 (accessed April 9, 2015). Bertrand Thirion and John Ashburner Pattern Recognition

slide-20
SLIDE 20

Introduction Generalization Overview of the main methods Resources Assessing generalizability Accuracy Measures

Accuracy measures from ROC curve

The Receiver operating characteristic (ROC) curve is a plot of true-positive rate (sensitivity) versus false-positive rate (1-specificity) over the full range of possible thresholds. The area under the curve (AUC) is the integral under the ROC curve.

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1

ROC Curve (AUC=0.9769) 1−Specificity Sensitivity Bertrand Thirion and John Ashburner Pattern Recognition

slide-21
SLIDE 21

Introduction Generalization Overview of the main methods Resources Assessing generalizability Accuracy Measures

Log predictive probability

Some data are more easily classified than others. Probabilistic classifiers provide a level of confidence for each prediction. p(y∗|x∗, y, X, θ) Quality of predictions can be assessed using the test log predictive probability:

1 m m

  • i=1

log2 p(y∗i =ti|x∗i, y, X, θ) After subtracting the baseline measure, this shows the average bits

  • f information given by the model.

Rasmussen & Williams. “Gaussian Processes for Machine Learning”, MIT Press (2006). http://www.gaussianprocess.org/gpml/ Bertrand Thirion and John Ashburner Pattern Recognition

slide-22
SLIDE 22

Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging

Overview of classification tools

Only one rule: No tool wins in all situations.

Bertrand Thirion and John Ashburner Pattern Recognition

slide-23
SLIDE 23

Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging

Generative models for classification

P(y =k|x) = P(y =k)p(x|y =k)

  • j P(y =j)p(x|y =j)

Ground truth Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1

Bertrand Thirion and John Ashburner Pattern Recognition

slide-24
SLIDE 24

Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging

Linear discriminant analysis

P(y =k|x) = P(y =k)p(x|y =k)

  • j P(y =j)p(x|y =j)

Assumes: P(x|y =k) = N(x|µk, Σ)

p(x,y=0) = p(x|y=0) p(y=0) Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1 p(x,y=1) = p(x|y=1) p(y=1) Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1 p(x) = p(x,y=0) + p(x,y=1) Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1 p(y=0|x) = p(x,y=0)/p(x) Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1

Model has 2p + p(p − 1) parameters to estimate (two means and a single covariance). Number of observations is pn (size of inputs).

Bertrand Thirion and John Ashburner Pattern Recognition

slide-25
SLIDE 25

Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging

Quadratic discriminant analysis

P(y =k|x) = P(y =k)p(x|y =k)

  • j P(y =j)p(x|y =j)

Assumes different covariances: P(x|y =k) = N(x|µk, Σk)

p(x,y=0) = p(x|y=0) p(y=0) Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1 p(x,y=1) = p(x|y=1) p(y=1) Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1 p(x) = p(x,y=0) + p(x,y=1) Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1 p(y=0|x) = p(x,y=0)/p(x) Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1

Model has 2p + 2p(p − 1) parameters to estimate (two means and two covariances). Number of observations is pn.

Bertrand Thirion and John Ashburner Pattern Recognition

slide-26
SLIDE 26

Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging

Naive Bayes

P(y =k|x) = P(y =k)p(x|y =k)

  • j P(y =j)p(x|y =j)

Assumes that features are independent: p(x|y =k) =

  • i

p(xi|y =k)

p(x,y=0) = p(x|y=0) p(y=0) Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1 p(x,y=1) = p(x|y=1) p(y=1) Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1 p(x) = p(x,y=0) + p(x,y=1) Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1 p(y=0|x) = p(x,y=0)/p(x) Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1

Model has variable number of parameters to estimate, but the above example has 3p. Number of observations is pn.

Bertrand Thirion and John Ashburner Pattern Recognition

slide-27
SLIDE 27

Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging

Linear regression: maximum likelihood

A simple way to do regression is by: f (x∗) = wTx∗ Assuming Gaussian noise on y, the ML estimate of w is by: ˆ w = (XTX)−1XTy where X =

  • x1

x2 . . . xn T , and y =

  • y1

y2 . . . yn T Model has p parameters to estimate. Number of observations is n (number of targets). Usually needs dimensionality reduction, with (eg) SVD.

Bertrand Thirion and John Ashburner Pattern Recognition

slide-28
SLIDE 28

Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging

Linear regression: maximum posterior

We may have prior knowledge about various distributions: p(y∗|x∗, w) =N(wTx∗, σ2) p(w) =N(0, Σ0) Therefore, p(w|y, X) =N(σ−2B−1XTy, B−1), where B = σ−2XTX + Σ−1 Maximum a posteriori (MAP) estimate of w is by: ˆ w = σ−2B−1XTy, where B = σ−2XTX + Σ−1

Bertrand Thirion and John Ashburner Pattern Recognition

slide-29
SLIDE 29

Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging

Linear regression: Bayesian

We may have prior knowledge about various distributions: p(y∗|x∗, w) =N(wTx∗, σ2) p(w) =N(0, Σ0) Therefore, p(w|y, X) =N(σ−2B−1XTy, B−1), where B = σ−2XTX + Σ−1 Predictions are made by integrating out the uncertainty of the weights, rather than estimating them: p(y∗|x∗, y, X) =

  • w

p(y∗|x∗, w)p(w|y, X)dw =N(σ−2xT

∗ B−1XTy, xT ∗ B−1x∗)

Estimated parameters may be σ2, and parameters encoding Σ0.

Bertrand Thirion and John Ashburner Pattern Recognition

slide-30
SLIDE 30

Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging

Kernel methods: Woodbury matrix identity

B−1 =

  • σ−2XTX + Σ−1

−1 invert a p × p matrix =Σ0 − Σ0XT(Iσ2 + XΣ0XT)−1XΣ0 invert an n × n matrix

Wikipedia contributors, “Woodbury matrix identity,” Wikipedia, The Free Encyclopedia, http://en.wikipedia.org/w/index.php?title=Woodbury_matrix_identity&oldid=638370219 (accessed April 1, 2015). (A + UCV)−1 = A−1 − A−1U(C−1 + VA−1U)−1VA−1. Bertrand Thirion and John Ashburner Pattern Recognition

slide-31
SLIDE 31

Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging

Kernel methods: Gaussian process regression

The predicted distribution is: p(y∗|x∗, y, X) =N(kTC−1y, c − kTC−1k) where: C =XΣ0XT + Iσ2 k =XΣ0x∗ c =xT

∗ Σ0x∗ + σ2

Bertrand Thirion and John Ashburner Pattern Recognition

slide-32
SLIDE 32

Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging

Kernel methods: nonlinear methods

Sometimes, we want alternatives to C = XΣ0XT + Iσ2. Nonlinearity is achieved by replacing the matrix K = XΣ0XT with some function of the data that gives a positive definite matrix encoding similarities. eg k(xi, xj) = θ1 + θ2xi · xj + θ3 exp

  • −||xi − xj||2

2θ2

4

  • Hyper-parameters θ1 to θ4 can be optimised in a number of ways.

Bertrand Thirion and John Ashburner Pattern Recognition

slide-33
SLIDE 33

Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging

Kernel methods: nonlinear methods

Non-linear methods are useful in low-dimension to adapt the shape of decision boundaries. For large p, small n problems, nonlinear methods do not seem to help much. Nonlinearity also reduces interpretability.

Bertrand Thirion and John Ashburner Pattern Recognition

slide-34
SLIDE 34

Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging

Probabilistic discriminative models

Regression Continuous targets: y ∈ R Usually assume a Gaussian distribution: p(y|x, w) = N(f (x, w), σ2) where σ2 is a variance. Binary Classification Categorical targets: y ∈ {0, 1} Usually assume a binomial distribution: p(y|x, w) = σ(f (x, w))y(1 − σ(f (x, w)))1−y where σ is a squashing function.

Bertrand Thirion and John Ashburner Pattern Recognition

slide-35
SLIDE 35

Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging

Probabilistic discriminative models

For binary classification: p(y∗ = 1|x∗, w) = σ(f (x∗, w)) where σ is some squashing function, eg: Logistic sigmoid function (inverse of Logit). Normal CDF (inverse of Probit).

−5 5 0.5 1 f* σ(f*) Logistic function −2 2 0.5 1 f* σ(f*) Inverse Probit function (Normal CDF)

Bertrand Thirion and John Ashburner Pattern Recognition

slide-36
SLIDE 36

Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging

Probabilistic discriminative models

Integrating over the uncertainty

  • f the separating hyperplane

allows probabilistic predictions further from the training data. This is not usually done for methods such as the relevance-vector machine (RVM).

Rasmussen, Carl Edward, and Joaquin Quinonero-Candela. “Healing the relevance vector machine through augmentation.” In Proceedings of the 22nd international conference on Machine learning, pp. 689-696. ACM, 2005.

Simple Logistic Regression Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1 2 4 −7 −6 −5 −4 −3 −2 −1 Hyperplane Uncertainty Feature 1 Feature 2 Bayesian Logistic Regression Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1 Bayesian Logistic Regression Feature 1 Feature 2 2 4 −7 −6 −5 −4 −3 −2 −1

Bertrand Thirion and John Ashburner Pattern Recognition

slide-37
SLIDE 37

Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging

Probabilistic discriminative models

Making probabilistic predictions involves:

1 Computing the distribution of a latent variable corresponding

to the test data (cf regression): p(f∗|x∗, y, X) =

  • f

p(f∗|x∗, f)p(f|y, X)df

2 Using this distribution to give a probabilistic prediction:

P(y∗ = 1|x∗, y, X) =

  • f∗

σ(f∗)p(f∗|x∗, y, X)df∗ Unfortunately, these integrals are analytically intractable, so approximations are needed.

Bertrand Thirion and John Ashburner Pattern Recognition

slide-38
SLIDE 38

Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging

Probabilistic discriminative models

Approximate methods for probabilistic classification include: The Laplace Approximation (LA). Fastest, but less accurate. Expectation Propagation (EP). More accurate than the Laplace approximation, but slightly slower. MCMC methods. The “gold standard”, but very slow to draw lots of random samples.

Nickisch, Hannes, and Carl Edward Rasmussen. “Approximations for Binary Gaussian Process Classification.” Journal of Machine Learning Research 9 (2008): 2035-2078. Bertrand Thirion and John Ashburner Pattern Recognition

slide-39
SLIDE 39

Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging

Discriminative models for classification

t = σ(f (x∗)) where σ is some squashing function, eg: Logistic function (inverse of Logit). Normal CDF (inverse of Probit). Hinge loss (support vector machines)

−5 5 0.5 1 f* σ(f*) Logistic function −2 2 0.5 1 f* σ(f*) Inverse Probit function (Normal CDF)

Bertrand Thirion and John Ashburner Pattern Recognition

slide-40
SLIDE 40

Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging

Discriminative models for classification: convexity

In practice, the hinge and logistic losses yield a convex estimation problem and are preferred. minw

n

  • i=1

L(yi, Xi, w) + λR(w) (M-estimators framework) L is the loss function (hinge, logistic, quadratic...) R is the regularizer (typically a norm on w) λ > 0 balances the two terms L and R convex → unique minimizer (SVMs, ℓ2-logistic, ℓ1-logistic).

Bertrand Thirion and John Ashburner Pattern Recognition

slide-41
SLIDE 41

Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging

Support vector classification

SVMs are reasonably fast, accurate and easy to tune (C = 103 is a good default, no dramatic failure). Multi-class: One-versus-one, One-versus all.

Bertrand Thirion and John Ashburner Pattern Recognition

slide-42
SLIDE 42

Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging

Ensemble learning

Combining predictions from weak learners. Bootstrap aggregating (bagging)

Train several weak classifiers, with different models or randomly drawn subsets of the data. Average their predictions with equal weight.

Boosting

A family of approaches, where models are weighted according to their accuracy. AdaBoost is popular, but has problems with target noise.

Bayesian model averaging

Really a model selection method. Relatively ineffective for combining models.

Bayesian model combination

Shows promise.

Monteith, et al. “Turning Bayesian model averaging into Bayesian model combination.” Neural Networks (IJCNN), The 2011 International Joint Conference on. IEEE, 2011. Bertrand Thirion and John Ashburner Pattern Recognition

slide-43
SLIDE 43

Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging

Boosting

Reduce sequentially the bias of the combined estimator. Examples: AdaBoost, Gradient Tree Boosting, ...

Bertrand Thirion and John Ashburner Pattern Recognition

slide-44
SLIDE 44

Introduction Generalization Overview of the main methods Resources Simple Generative Models: Naive Bayes, Linear Discriminant Simple Discriminative Models: Gaussian Processes, Suppor Model Averaging

Bagging

Build several estimators independently and average their

  • predictions. Reduce the variance.

Examples: Bagging methods, Forests of randomized trees, ...

Bertrand Thirion and John Ashburner Pattern Recognition

slide-45
SLIDE 45

Introduction Generalization Overview of the main methods Resources

Free Books

The Elements of Statistical Learning: Data Mining, Inference, and Prediction Trevor Hastie, Robert Tibshirani, Jerome Fried(2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/ An Introduction to Statistical Learning with Applications in R Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani (2013) http://www-bcf.usc.edu/%7Egareth/ISL/ Introduction to Machine Learning Amnon Shashua (2008) http://arxiv.org/pdf/0904.3664.pdf

Bertrand Thirion and John Ashburner Pattern Recognition

slide-46
SLIDE 46

Introduction Generalization Overview of the main methods Resources

Free Books

Bayesian Reasoning and Machine Learning David Barber (2014) http://www.cs.ucl.ac.uk/staff/d.barber/brml/ Gaussian Processes for Machine Learning Carl Edward Rasmussen and Christopher K. I. Williams (2006) http://www.gaussianprocess.org/gpml/chapters/ Information Theory, Inference, and Learning Algorithms David J.C. MacKay (2003) http: //www.inference.phy.cam.ac.uk/itila/book.html

Bertrand Thirion and John Ashburner Pattern Recognition

slide-47
SLIDE 47

Introduction Generalization Overview of the main methods Resources

Web sites

Kernel Machines http://www.kernel-machines.org/ The Gaussian Processes Web Site includes links to

  • software. http://www.gaussianprocess.org/

SVM - Support Vector Machines includes links to software. http://www.support-vector-machines.org/ Pascal Video Lectures http://videolectures.net/pascal

Bertrand Thirion and John Ashburner Pattern Recognition

slide-48
SLIDE 48

Introduction Generalization Overview of the main methods Resources

MATLAB tools

Spider Object orientated environment for machine learning in MATLAB. GPML Gaussian processes for supervised learning. Pronto MATLAB ML tbx for neuroimaging. GUI. Implements many ML concepts. Continuity with SPM.

Bertrand Thirion and John Ashburner Pattern Recognition

slide-49
SLIDE 49

Introduction Generalization Overview of the main methods Resources

Python tools

Scikit-learn Generic ML in Python. Complete, high-quality, well-documented, reference implementations. Nilearn Python interface to Scikit-learn for Neuroimaging. Easy-to-use/install. Good viz. Pymvpa Python tool for ML. Advanced stuff (Pipelines, Hyperalignment).

Bertrand Thirion and John Ashburner Pattern Recognition