[PPT] - MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 PowerPoint Presentation

SLIDE 1

APPLIED MACHINE LEARNING – 2011-2012

1

APPLIED MACHINE LEARNING

1

MACHINE LEARNING Overview

SLIDE 2

APPLIED MACHINE LEARNING – 2011-2012

2

APPLIED MACHINE LEARNING

2

Exam Format

The exam lasts a total of 3 hours:

Upon entering the room, you must leave you bag, cell phone, etc, in a

corner of the room; you are allowed to keep a couple of pen/pencil/ eraser and a few blank sheets of paper.

The exam will be graded anonymously; make sure to have your camipro

card with you to write your sciper number on your exam sheet, as we will check your card. Exam is closed book but you can bring one A4 page with personal handwritten notes written recto-verso.

SLIDE 3

APPLIED MACHINE LEARNING – 2011-2012

3

APPLIED MACHINE LEARNING

3

Formalism / Taxonomy:

You should be capable of giving formal definitions of a pdf, marginal,

likelihood.

You should know the difference between supervised / unsupervised

learning and be able to give examples of algorithms in each case. Principles of evaluation:

You should know the basic principles of evaluation of ML techniques:

training vs. testing sets, cross-validation, ground truth.

You should know the principle of each method of evaluation seen in class

and know which method of evaluation to apply where (F-measure in clustering vs. classification, BIC, etc).

What to know for the exam

SLIDE 4

APPLIED MACHINE LEARNING – 2011-2012

4

APPLIED MACHINE LEARNING

4

For each algorithm, be able to explain:

– what it can do: classification, regression, structure discovery / reduction of dimensionality – what one should be careful about (limitations of the algorithm, choice

f hyperparameters) and how does this choice influence the results.

– the key steps of the algorithm, its hyperparameters, the variables it takes as input and the variables it outputs

What to know for the exam

SLIDE 5

APPLIED MACHINE LEARNING – 2011-2012

5

APPLIED MACHINE LEARNING

5

For each algorithm, be able to explain:

SVM

– what it can do: classification, regression, structure discovery / reduction of dimensionality Performs binary classification; can be extended to multi-class classification; can be extended to regression (SVR) – what one should be careful about (limitations of the algorithm, choice

f hyperparameters)

e.g. choice of kernel; too small kernel width in Gaussian kernels may lead to over-fitting; – the key steps of the algorithm, its hyperparameters, the variables it takes as input and the variables it outputs

What to know for the exam

SLIDE 6

APPLIED MACHINE LEARNING – 2011-2012

6

APPLIED MACHINE LEARNING

6

Class Overview

This overview is meant to highlight similarities and differences across the different methods presented in class. To be well prepared to the exam, read carefully the slides, the exercises and their solutions.

SLIDE 7

APPLIED MACHINE LEARNING – 2011-2012

7

APPLIED MACHINE LEARNING

7

Class Overview

This class has presented groups of methods for structure discovery, classification and non-linear regression.

Classification SVM, GMM + Bayes Regression SVR GMR Structure Discovery PCA & Clustering Techniques K-Means, Soft K-means, GMM

SLIDE 8

APPLIED MACHINE LEARNING – 2011-2012

8

APPLIED MACHINE LEARNING

8

Overview: Finding Structure in Data

Techniques for finding structure in data proceed by projecting or grouping the data from the original space into another space of lower dimension. The projected space is chosen so as to highlight particular features common to subsets of datapoints. Pre-processing step: The found structure may be exploited in a second stage by another algorithm for regression, classification, etc.

SLIDE 9

APPLIED MACHINE LEARNING – 2011-2012

9

APPLIED MACHINE LEARNING

9

Overview: Finding Structure in Data

Determines what is most common across datapoints.
Projects onto axes that maximize correlation (eigenvectors of covariance matrix)
 lower dimensions allow to discriminate across subgroups of datapoints!
Discard dimensions with the smallest eigenvalues.

Principal Component Analysis (PCA)

Y AX 

N

x

,

q

y q N  

SLIDE 10

APPLIED MACHINE LEARNING

10

Clustering Methods

All three methods for clustering we have seen in class (K-means, soft K- means, GMM) are all solved through E-M (expectation-maximization). You should be able to spell out the similarities and differences across K- means, soft K-means and GMM.

They

are similar in their representation

f the

problem and

ptimization method.
They differ in the number of parameters to estimate and number of

hyper-parameters, etc.

Overview: Finding Structure in Data

SLIDE 11

APPLIED MACHINE LEARNING

11

Clustering Methods and Metric of Similarity

All clustering methods depend on choosing well a metric of similarity to measure how similar subgroup of data-points are. You should be able to list which metric of similarity can be used in each case and how this choice may impact the clustering.

Overview: Finding Structure in Data

Lp-norm in K-means Exponential decreasing function in soft K-means modulated by the stiffness ~= isotropic rbf (unnormalized Gauss) function Likelihood of each Gauss function Can use isotropic/diagonal & full covariance matrices

SLIDE 12

APPLIED MACHINE LEARNING

12

Clustering versus Classification

Fundamental difference between clustering and classification:

Clustering is unsupervised classification
Classification is supervised classification

Both use the F-measure but not in the same way. The clustering F-measure assumes a semi-supervised model, in which only a subset of the points are labelled.

SLIDE 13

13

APPLIED MACHINE LEARNING

13

Semi-Supervised Learning

Clustering F1-Measure:

(careful: similar but not the same F-measure as the F-measure we will see for classification!)

Tradeoff between clustering correctly all datapoints of the same class in the same cluster and making sure that each cluster contains points of only one class.

Picks for each class the cluster with the maximal F1 measure Recall: proportion of datapoints correctly classified/clusterized Precision: proportion of datapoints of the same class in the cluster      

 

             

1 1 1

: nm of labeled datapoints : the set of classes : nm of clusters, : nm of members of class and of cluster , max , 2 , , , , , , ,

ik i ik ik

i i i i c C k i i i i i i i i

M C c K n c k c F C K F c k M R c k P c k F c k R c k P c k n R c k c n P c k k



     



Penalize fraction of labeled points in each class

SLIDE 14

APPLIED MACHINE LEARNING

14

True Positives( ) : nm of datapoints of class 1 that are correctly classified False Negative ( ) : nm of datapoints of class 1 that are incorrectly classified False Positives( ) : nm of datapoints of TP FN FP class 2 that are incorrectly classified Recall: Precision: 2*Precision*Recall Precision+Recall TP TP FN TP TP FP F   

Classification F-Measure:

(careful: similar but not the same F-measure as the F-measure we saw for clustering!)

Tradeoff between classifying correctly all datapoints of the same class and making sure that each class contains points of only one class. Recall: Proportion of datapoints correctly classified in Class 1 Precision: proportion of datapoints of class 1 correctly classified over all datapoints classified in class 1

Performance Measures

SLIDE 15

APPLIED MACHINE LEARNING – 2011-2012

15

APPLIED MACHINE LEARNING

15

Overview: Classification

GMM + Bayes SVM Non-Linear boundary in both cases. Compute number of parameters required for the same fit. Original two-classes 1 Gauss fct per class But full covariance matrix 7 support vectors

SLIDE 16

APPLIED MACHINE LEARNING – 2011-2012

16

APPLIED MACHINE LEARNING

16

Kernel Methods

We have seen two examples of kernel method with SVM/SVR. Kernel Methods implicitly search for structure in the data prior to performing another computation (classification or regression)

The kernel allows to extract non-linear types of correlations.
These methods exploit the Kernel Trick:

The kernel trick exploits the observation that all linear methods for finding structure in data are based on computing an inner product across variables. This inner product can be replaced by the kernel function if

known. The problem becomes then linear in feature space.

     

: , , .

i j i j

k X X k x x x x     

Metric of similarity across datapoints

SLIDE 17

APPLIED MACHINE LEARNING – 2011-2012

17

APPLIED MACHINE LEARNING

17

Overview: Regression Techniques

SVR and GMR lead to a regressive model that computes a weighted combination of local predictors.

For a query point , predict the associated output : x y

In SVR, the computation is reduced to summing only

ver the support vectors (a subset of datapoints)

In GMR, the sum is over the set of Gaussians. The centers of the Gaussians are usually not located on any particular datapoint. The models are local m(x)!

   

* 1

,

i

M i i i

y k x x b  



  



   

1 K i i i

y x x  m



 



SVR Solution GMR Solution

SLIDE 18

APPLIED MACHINE LEARNING – 2011-2012

18

APPLIED MACHINE LEARNING

18

Overview: Regression Techniques

SVR and GMR lead to the following regressive model: 8 Gauss functions full covariance matrix

   

1 K i i i

y x x  m



 



GMR Solution

SLIDE 19

APPLIED MACHINE LEARNING – 2011-2012

19

APPLIED MACHINE LEARNING

19

Overview: Regression Techniques

SVR and GMR lead to the following regressive model: 27 support vectors

   

* 1

,

i

M i i i

y k x x b  



  



SVR Solution

SLIDE 20

APPLIED MACHINE LEARNING

20

SVR and GMR are based on the same probabilistic regressive model, but do not optimize the same objective function.

SVR:
minimizes reconstruction error through convex optimization
finds a number of models <= number of datapoints (support vectors)
GMR:
learns p(x,y) through likelihood maximization and then compute p(y|x);
starts with a low number of models << number of datapoints

Overview: Regression Techniques

SLIDE 21

APPLIED MACHINE LEARNING – 2011-2012

21

APPLIED MACHINE LEARNING

21

Good Bye!

This course covered a variety of topics that are core to Machine Learning. It gives you the basis to go and read recent advances in each of these topics. We hope that you will find this material useful and that you will use some

f these algorithms in the future.

If you do so, drop us a note and we would be glad to include your application in future lectures as examples!

MACHINE LEARNING Overview

Exam Format

What to know for the exam

What to know for the exam

What to know for the exam

Class Overview

Class Overview

Overview: Finding Structure in Data

Overview: Finding Structure in Data

Y AX 

x

,

y q N  

Clustering Methods

Overview: Finding Structure in Data

Clustering Methods and Metric of Similarity

Overview: Finding Structure in Data

Clustering versus Classification

Semi-Supervised Learning



Performance Measures

Overview: Classification

Kernel Methods

Overview: Regression Techniques

For a query point , predict the associated output : x y

   

,

y k x x b  

  



   



Overview: Regression Techniques

   



Overview: Regression Techniques

   

,

y k x x b  

  



Overview: Regression Techniques

Good Bye!

Mer erry ry Ch Chri ristma stmas s & H & Hap appy py Ne New Ye Year ar