[PPT] - PRLab TUDelft NL PATTERN RECOGNITION & MACHINE LEARNING An PowerPoint Presentation

SLIDE 1

PRLab TUDelft NL

SLIDE 2

PRLab TUDelft NL

PATTERN RECOGNITION & MACHINE LEARNING

An Introduction Marco Loog Pattern Recognition Laboratory Delft University of Technology

SLIDE 3

PRLab TUDelft NL

What These Lectures Will Cover

Intro to supervised learning and classification Semi-supervised learning Multiple instance learning Active learning Transfer learning, domain adaptation, etc., etc. General theme is partially supervised learning General focus is on methods and concepts

SLIDE 4

PRLab TUDelft NL

Lab & Final Assignment

Last day : computer lab

You can work on your “final assignment”

Like rest of this course, attending is not mandatory It is eight [8!] hours… which I find somewhat long Most of the work you can also do, say, at home I guess we still have to find the right way to go about this…

More about the actual assignment on a later slide…

SLIDE 5

PRLab TUDelft NL

Supervised Learning

Aims to find solutions to difficult decision, assignment, classification, and prediction problems Automation often, possibly implicit, an issue

Have computers and robots do task that are dangerous, tedious, boring, etc.

Part of the point : humans often severely biased in judgment, very inaccurate

You cannot even trust your own eyes!

SLIDE 6

PRLab TUDelft NL

SLIDE 7

PRLab TUDelft NL

Comparing and …

SLIDE 8

PRLab TUDelft NL

Supervised Learning = Not Modeling

Fruits orange, reddish-green

to yellowish-green, round, 4- 12 cm, consist of a leathery peel, 6 mm thick, tightly adherent, protecting the juicy inner pulp, which is divided into segments that may not contain seeds, depending on the cultivar…

SLIDE 9

PRLab TUDelft NL

Orange Modeling

Difficult, hassle, overly ambitious, inaccurate,… Captures typicalities

SLIDE 10

PRLab TUDelft NL

Supervised Learning

…is learning by example Given input and associated output, determine input-

utput mapping

Mapping should be able to generalize to new and previously unseen examples

SLIDE 11

PRLab TUDelft NL

Remark

In the end, one never relies solely on extremes of pure modeling or pure model-free learning Learning does use models, though they are weak and nonspecific

SLIDE 12

PRLab TUDelft NL

Restricted Setting : Classification

Sought-after mapping puts out discrete label, category, or class membership

A, B, C,… Orange, apple, banana… Benign, malignant,… Present, absent,…

Many relevant decision problems can be formulated as such

SLIDE 13

PRLab TUDelft NL

Standard Approach

Training phase

Collect example objects Measure features of choice and represent in vector space Chop up -dimensional feature space and assign every part a class label

Test phase

Extract same features from new object Look in what part it ends up in feature space Assign label of corresponding part to object

SLIDE 14

PRLab TUDelft NL

redness weight

SLIDE 15

PRLab TUDelft NL

redness weight

SLIDE 16

PRLab TUDelft NL

redness weight

SLIDE 17

PRLab TUDelft NL

redness weight label?

SLIDE 18

PRLab TUDelft NL

Also : Learning is Ill-Posed

SLIDE 20

PRLab TUDelft NL

A General Challenge

One of the challenges [in research and applications] is how to pick out classifier that generalizes best to unseen data

How to do accurate induction Important tradeoff : complexity decision boundary versus accuracy on training examples

Key issue : how to tell how good a classifier works on infinite amounts of unseen data based on a finite sample?

SLIDE 21

PRLab TUDelft NL

A Note on Research

The purpose of PR [and ML] research is not only to construct classification routines but, in addition, to understand these routines and to obtain insight in their behavior, pros and cons, etc. Ultimately, it should lead to understanding the learning problem as such And no, it is not about getting the best classification performance or achieving “state of the art”!

SLIDE 22

PRLab TUDelft NL

Mathematics versus Empiricism

Can’t we just all solve it mathematically?

After all, we can write down our objective function :

Some major problems

Finite sample : we do not know (, ) Ones math skills might be limited

argmin

≠ ;

,

SLIDE 23

PRLab TUDelft NL

Mathematics versus Empiricism

Luckily [applied?] computer science is an empirical discipline with programs as its experiments

We can just build classifiers and see what happens Use of artificial and real-world data

So we can ditch the math?

SLIDE 24

PRLab TUDelft NL

Still : Insight Please?

Yes, this remains, all in all, difficult… Firstly : hold on to current knowledge

Generally, there is no such thing as the overall best classifier Classifiers should be studied relative to one or more [families

f] data sets / examples

Parameters of major influence : sample size and dimensionality

SLIDE 25

PRLab TUDelft NL

Still : Insight Please?

Secondly : ask yourself “obvious” questions [and try to answer them]

Why does my approach work better / worse? Can I come up with examples in which case the one approach is always better than the other? Do I understand why that happens in this case? Can I say more than “experiment X gives

utcome Y”?

Trivial? Sure…

SLIDE 26

PRLab TUDelft NL

Lab & Final Assignment

The idea : implement [or take] two or three methods and do a basic comparison

To each other To the standard benchmark [e.g. supervised classifier or random sampling]

In particular

Find data sets [artificial or real world] in which the methods

utperform the standard benchmark

Find a data set for which Method outperforms Method Ω and vice versa Explain your reasoning, constructions, and findings!

SLIDE 27

PRLab TUDelft NL

How To?

Empirical risk ⇒ smooth ⇒ make convex

argmin

≠ ;

,

SLIDE 28

PRLab TUDelft NL

Surrogate Losses…

SLIDE 29

PRLab TUDelft NL

Regularization

Regularized empirical risk : typically controls complexity / smoothness of the solution ∙ ; Ubiquitous example

Take ; = , , the class of linear classifiers Take =

argmin
! ", ";

+ ()

"

SLIDE 30

PRLab TUDelft NL

Loss, Hypothesis Class, Regularizer?

Yes

LDA, QDA, NMC, SVM, least squares classifier [a.k.a. least squares SVM, Fisher classifier], logistic regression, neural nets, lasso

No?

$NN, random forest, AdaBoost, Parzen classifier Some parameters are tuned different then others

N.B. no classifier really minimizes the empirical 0-1 loss directly…

SLIDE 31

PRLab TUDelft NL

The Dipping Phenomenon

A consequence of the use of surrogate losses

Meant as a warning : be award of what you optimize! Meant as an example of PR research

But first we need learning curves…

SLIDE 32

PRLab TUDelft NL

Learning Curves

Tool to study behavior of classifiers over varying number of examples and to compare two or more classifiers

# training examples error rate

SLIDE 33

PRLab TUDelft NL

Expected Behavior

Monotonic decrease of learning curve [at least in the average]

error rate # training examples

SLIDE 34

PRLab TUDelft NL

Well… There is Peaking

Independently described in 1995 by both Opper and Duin

error rate # training examples

SLIDE 35

PRLab TUDelft NL

New Hypothesis?

Can we guarantee that, in expectation, best performance, for particular classifier on particular problem, is achieved when sample size is infinite?

SLIDE 36

PRLab TUDelft NL

The Dipping Phenomenon

Can we guarantee that, in expectation, best performance, for particular classifier on particular problem, is achieved when sample size is infinite? No, we cannot…

SLIDE 37

PRLab TUDelft NL

Basic Dipping : Linear Classifiers

SLIDE 38

PRLab TUDelft NL

Basic Dipping : Linear Classifiers

SLIDE 39

PRLab TUDelft NL

Basic Dipping : Linear Classifiers

SLIDE 40

PRLab TUDelft NL

Basic Dipping : Linear Classifiers

SLIDE 41

PRLab TUDelft NL

SLIDE 42

PRLab TUDelft NL

Reading Material and References

Bartlett, Jordan, McAuliffe, “Convexity, classification, and risk bounds,” JASA, 2006
Ben-David, Loker, Srebro, Sridharan, “Minimizing the misclassification error rate using a surrogate convex loss,” ICML, 2012
Bishop, “Pattern recognition and machine learning”, Springer, 2006
Duda, Hart, Stork, “Pattern Classification”, John Willey & Sons, 2000
Hastie, Tibshirani, Friedman, “The Elements of Statistical Learning: Data Mining, Inference, and Prediction”, Springer Verlag, 2001
Jain, Duin, Mao, “Statistical pattern recognition: A review”, IEEE TPAMI, 2000
Loog, Duin, “The dipping phenomenon,” in Structural, S+SSPR, LNCS 7626, 2012
McLachlan, “Discriminant analysis and statistical pattern recognition”, John Wiley & Sons, 1992
Poggio, Smale, “The mathematics of learning: Dealing with data”, Notices of the AMS, 2003
Reid, Williamson, “Composite binary losses,” JMLR, 2010
Reid, Williamson, “Information, divergence and risk for binary experiments,” JMLR, 2011
Rifkin, Yeo, Poggio, “Regularized least-squares classification”, Nato Science Series 190, 2003
Ripley, “Pattern recognition and neural networks”, Cambridge University Press, 1996
Zhang, “Statistical behavior and consistency of classification methods based on convex risk minimization,” Annals of Statistics, 2004

PATTERN RECOGNITION & MACHINE LEARNING

What These Lectures Will Cover

Intro to supervised learning and classification Semi-supervised learning Multiple instance learning Active learning Transfer learning, domain adaptation, etc., etc. General theme is partially supervised learning General focus is on methods and concepts

Lab & Final Assignment

Last day : computer lab

You can work on your “final assignment”

Like rest of this course, attending is not mandatory It is eight [8!] hours… which I find somewhat long Most of the work you can also do, say, at home I guess we still have to find the right way to go about this…

More about the actual assignment on a later slide…

Supervised Learning

Aims to find solutions to difficult decision, assignment, classification, and prediction problems Automation often, possibly implicit, an issue

Have computers and robots do task that are dangerous, tedious, boring, etc.

Part of the point : humans often severely biased in judgment, very inaccurate

You cannot even trust your own eyes!

Comparing and …

Supervised Learning = Not Modeling

to yellowish-green, round, 4- 12 cm, consist of a leathery peel, 6 mm thick, tightly adherent, protecting the juicy inner pulp, which is divided into segments that may not contain seeds, depending on the cultivar…

Orange Modeling

Difficult, hassle, overly ambitious, inaccurate,… Captures typicalities

Supervised Learning

…is learning by example Given input and associated output, determine input-

Mapping should be able to generalize to new and previously unseen examples

Remark

In the end, one never relies solely on extremes of pure modeling or pure model-free learning Learning does use models, though they are weak and nonspecific

Restricted Setting : Classification

Sought-after mapping puts out discrete label, category, or class membership

A, B, C,… Orange, apple, banana… Benign, malignant,… Present, absent,…

Many relevant decision problems can be formulated as such

Standard Approach

Training phase

Collect example objects Measure features of choice and represent in vector space Chop up -dimensional feature space and assign every part a class label

Test phase

Extract same features from new object Look in what part it ends up in feature space Assign label of corresponding part to object

redness weight

redness weight

redness weight

redness weight label?

More Realistic Problems…

Manual construction becomes difficult when > 3 Formulate classifier building as “fitting” problem that can be automated

Learning algorithm

Ingredients :

What functions / mappings to fit : hypothesis class? What defines a good fit : loss / risk function? How do find the optimal fit?

Also : Learning is Ill-Posed

A General Challenge

One of the challenges [in research and applications] is how to pick out classifier that generalizes best to unseen data

How to do accurate induction Important tradeoff : complexity decision boundary versus accuracy on training examples

Key issue : how to tell how good a classifier works on infinite amounts of unseen data based on a finite sample?

A Note on Research

Mathematics versus Empiricism

Can’t we just all solve it mathematically?

After all, we can write down our objective function :

Some major problems

Finite sample : we do not know (, ) Ones math skills might be limited

argmin

,

Mathematics versus Empiricism

Luckily [applied?] computer science is an empirical discipline with programs as its experiments

We can just build classifiers and see what happens Use of artificial and real-world data

So we can ditch the math?

Still : Insight Please?

Yes, this remains, all in all, difficult… Firstly : hold on to current knowledge

Generally, there is no such thing as the overall best classifier Classifiers should be studied relative to one or more [families

Parameters of major influence : sample size and dimensionality

Still : Insight Please?

Secondly : ask yourself “obvious” questions [and try to answer them]

Why does my approach work better / worse? Can I come up with examples in which case the one approach is always better than the other? Do I understand why that happens in this case? Can I say more than “experiment X gives

Trivial? Sure…

Lab & Final Assignment

The idea : implement [or take] two or three methods and do a basic comparison

To each other To the standard benchmark [e.g. supervised classifier or random sampling]

In particular

Find data sets [artificial or real world] in which the methods

Find a data set for which Method outperforms Method Ω and vice versa Explain your reasoning, constructions, and findings!

How To?

argmin

,

Surrogate Losses…

Regularization

Regularized empirical risk : typically controls complexity / smoothness of the solution ∙ ; Ubiquitous example

Take ; = , , the class of linear classifiers Take =

+ ()