Introduction to Machine Learning Yifeng Tao School of Computer - - PowerPoint PPT Presentation

introduction to machine learning
SMART_READER_LITE
LIVE PREVIEW

Introduction to Machine Learning Yifeng Tao School of Computer - - PowerPoint PPT Presentation

Introduction to Machine Learning Introduction to Machine Learning Yifeng Tao School of Computer Science Carnegie Mellon University Yifeng Tao Carnegie Mellon University 1 Logistics o Course website:


slide-1
SLIDE 1

Introduction to Machine Learning

Yifeng Tao School of Computer Science Carnegie Mellon University

Carnegie Mellon University 1 Yifeng Tao

Introduction to Machine Learning

slide-2
SLIDE 2

Logistics

  • Course website:

http://www.cs.cmu.edu/~yifengt/courses/machine-learning Slides uploaded after lecture

  • Time: Mon-Fri 9:50-11:30am lecture, 11:30-12:00pm discussion
  • Contact: yifengt@cs.cmu.edu

Yifeng Tao Carnegie Mellon University 2

slide-3
SLIDE 3

What is machine learning?

  • What are we talking when we talk about AI and ML?

Carnegie Mellon University 3 Yifeng Tao

Deep learning Artificial intelligence Machine learning

slide-4
SLIDE 4

What is machine learning

Yifeng Tao Carnegie Mellon University 4

Probability Statistics Calculus Linear algebra Machine Learning Computer vision Natural language processing Computational Biology

slide-5
SLIDE 5

Where are we?

  • Supervised learning: linear models
  • Kernel machines: SVMs and duality
  • Unsupervised learning: latent space analysis and clustering
  • Supervised learning: decision tree, kNN and model selection
  • Learning theory: generalization and VC dimension
  • Neural network (basics)
  • Deep learning in CV and NLP
  • Probabilistic graphical models
  • Reinforcement learning and its application in clinical text mining
  • Attention mechanism and transfer learning in precision medicine

Yifeng Tao Carnegie Mellon University 5

slide-6
SLIDE 6

What’s more after introduction?

Carnegie Mellon University 6 Yifeng Tao

Machine learning Probabilistic graphical models Deep learning Optimization Learning theory

slide-7
SLIDE 7

What’s more after introduction?

  • Supervised learning: linear models
  • Kernel machines: SVMs and duality
  • à Optimization
  • Unsupervised learning: latent space analysis and clustering
  • Supervised learning: decision tree, kNN and model selection
  • Learning theory: generalization and VC dimension
  • à Statistical machine learning
  • Neural network (basics)
  • Deep learning in CV and NLP
  • à Deep learning
  • Probabilistic graphical models

Yifeng Tao Carnegie Mellon University 7

slide-8
SLIDE 8

Curriculum for an ML Master/Ph.D. student in CMU

  • 10701 Introduction to Machine Learning:
  • http://www.cs.cmu.edu/~epxing/Class/10701/
  • 35705 Intermediate Statistics:
  • http://www.stat.cmu.edu/~larry/=stat705/
  • 36708 Statistical Machine Learning:
  • http://www.stat.cmu.edu/~larry/=sml/
  • 10725 Convex Optimization:
  • http://www.stat.cmu.edu/~ryantibs/convexopt/
  • 10708 Probabilistic Graphical Models:
  • http://www.cs.cmu.edu/~epxing/Class/10708-17/
  • 10707 Deep Learning:
  • https://deeplearning-cmu-10707.github.io/
  • Books:
  • Bishop. Pattern Recognition and Machine Learning
  • Goodfellow et al. Deep learning

Yifeng Tao Carnegie Mellon University 8

slide-9
SLIDE 9

Neural network (basics)

Yifeng Tao School of Computer Science Carnegie Mellon University Slides adapted from Eric Xing, Maria-Florina Balcan, Russ Salakhutdinov, Matt Gormley

Carnegie Mellon University 9 Yifeng Tao

Introduction to Machine Learning

slide-10
SLIDE 10

A Recipe for Supervised Learning

  • 1. Given training data:
  • 2. Choose each of these:
  • Decision function
  • Loss function
  • 3. Define goal and train with SGD:
  • (take small steps opposite the gradient)

Yifeng Tao Carnegie Mellon University 10

[Slide from Matt Gormley et al.]

slide-11
SLIDE 11

Logistic Regression

  • The prediction rule:
  • In this case, learning P(y|x) amounts to

learning conditional probability over two Gaussian distribution.

  • Limitation: only simple data distribution.

Yifeng Tao Carnegie Mellon University 11

[Slide from Eric Xing et al.]

slide-12
SLIDE 12

Learning highly non-linear functions

  • f: X à y
  • f might be non-linear function
  • X continuous or discrete vars
  • y continuous or discrete vars

Yifeng Tao Carnegie Mellon University 12

[Slide from Eric Xing et al.]

slide-13
SLIDE 13

From biological neuron networks to artificial neural networks

  • Signals propagate through neurons in brain.
  • Signals propagate through perceptrons in artificial neural network.

Yifeng Tao Carnegie Mellon University 13

[Slide from Eric Xing et al.]

slide-14
SLIDE 14

Perceptron Algorithm and SVM

  • Perceptron: simple learning algorithm for supervised classification

analyzed via geometric margins in the 50’s [Rosenblatt’57].

  • Similar to SVM, a linear classifier based on analysis of margins.
  • Originally introduced in the online learning scenario.
  • Online learning model
  • Its guarantees under large margins

Yifeng Tao Carnegie Mellon University 14

[Slide from Maria-Florina Balcan et al.]

slide-15
SLIDE 15

The Online Learning Algorithm

  • Example arrive sequentially.
  • We need to make a prediction.
  • Afterwards observe the outcome.
  • For i=1, 2, ..., :
  • Application:
  • Email classification
  • Recommendation systems
  • Add placement in a new market

Yifeng Tao Carnegie Mellon University 15

[Slide from Maria-Florina Balcan et al.]

slide-16
SLIDE 16

Linear Separators: Perceptron Algorithm

  • h(x) = wTx + w0, if h(x) ≥ 0, then label x as +,
  • therwise label it as –
  • Set t=1, start with the all zero vector w1.
  • Given example x, predict positive iff wtTx ≥ 0
  • On a mistake, update as follows:
  • Mistake on positive, then update wt+1 ß wt + x
  • Mistake on negative, then update wt+1 ß wt – x
  • Natural greedy procedure:
  • If true label of x is +1 and wt incorrect on x we

have wtTx < 0, wt+1Tx ß wtTx + xTx = wtTx + ||x||2, so more chance wt+1 classifies x correctly.

  • Similarly for mistakes on negative examples.

Yifeng Tao Carnegie Mellon University 16

[Slide from Maria-Florina Balcan et al.]

slide-17
SLIDE 17

Perceptron: Example and Guarantee

  • Example:
  • Guarantee: If data has margin 𝛿 and

all points inside a ball of radius 𝑆, then Perceptron makes ≤ (𝑆/𝛿)2 mistakes.

  • Normalized margin: multiplying all points

by 100, or dividing all points by 100, doesn’t change the number of mistakes; algo is invariant to scaling.

Yifeng Tao Carnegie Mellon University 17

[Slide from Maria-Florina Balcan et al.]

slide-18
SLIDE 18

Perceptron: Proof of Mistake Bound

  • Guarantee: If data has margin 𝛿 and all points inside a ball of radius

𝑆, then Perceptron makes ≤ (𝑆/𝛿)2 mistakes.

  • Proof:
  • Idea: analyze 𝑥𝑢T𝑥∗ and ǁ𝑥𝑢ǁ, where 𝑥∗ is the max-margin sep, ǁ𝑥∗ǁ=1.
  • Claim 1: 𝑥𝑢+1T𝑥∗ ≥ 𝑥𝑢T𝑥∗ + 𝛿. (because 𝑦T𝑥∗ ≥ 𝛿)
  • Claim 2: 𝑥𝑢+12 ≤ 𝑥𝑢2 + 𝑆2. (by Pythagorean Theorem)
  • After 𝑁 mistakes:
  • 𝑥𝑁+1T𝑥∗ ≥ 𝛿𝑁 (by Claim 1)
  • ||𝑥𝑁+1|| ≤ 𝑆 𝑁
  • (by Claim 2)
  • 𝑥𝑁+1T𝑥∗ ≤ ǁ𝑥𝑁+1ǁ (since 𝑥∗ is unit length)
  • So, 𝛿𝑁 ≤ 𝑆𝑁, so 𝑁 ≤ (R/ 𝛿)2.

Yifeng Tao Carnegie Mellon University 18

[Slide from Maria-Florina Balcan et al.]

slide-19
SLIDE 19

Multilayer perceptron (MLP)

  • A simple and basic type of

feedforward neural networks

  • Contains many perceptrons that

are organized into layers

  • MLP “perceptrons” are not

perceptrons in the strict sense

Yifeng Tao Carnegie Mellon University 19

[Slide from Russ Salakhutdinov et al.]

slide-20
SLIDE 20

Artificial Neuron (Perceptron)

Yifeng Tao Carnegie Mellon University 20

[Slide from Russ Salakhutdinov et al.]

slide-21
SLIDE 21

Artificial Neuron (Perceptron)

Yifeng Tao Carnegie Mellon University 21

[Slide from Russ Salakhutdinov et al.]

slide-22
SLIDE 22

Activation Function

  • sigmoid activation function:
  • Squashes the neuron’s output between 0 and 1
  • Always positive
  • Bounded
  • Strictly increasing
  • Used in classification output layer
  • tanh activation function:
  • Squashes the neuron’s output between -1 and 1
  • Bounded
  • Strictly increasing
  • A linear transformation of sigmoid function

Yifeng Tao Carnegie Mellon University 22

[Slide from Russ Salakhutdinov et al.]

slide-23
SLIDE 23

Activation Function

  • Rectified linear (ReLU) activation:
  • Bounded below by 0 (always non-negative)
  • Tends to produce units with sparse

activities

  • Not upper bounded
  • Strictly increasing
  • Most widely used activation function
  • Advantages:
  • Biological plausibility
  • Sparse activation
  • Better gradient propagation: vanishing

gradient in sigmoidal activation

Yifeng Tao Carnegie Mellon University 23

[Slide from Russ Salakhutdinov et al.]

slide-24
SLIDE 24

Activation Function in Alexnet

  • A four-layer convolutional neural network
  • ReLU: solid line
  • Tanh: dashed line

Yifeng Tao Carnegie Mellon University 24

[Slide from https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf]

slide-25
SLIDE 25

Single Hidden Layer MLP

Yifeng Tao Carnegie Mellon University 25

[Slide from Russ Salakhutdinov et al.]

slide-26
SLIDE 26

Capacity of MLP

  • Consider a single layer neural network

Yifeng Tao Carnegie Mellon University 26

[Slide from Russ Salakhutdinov et al.]

slide-27
SLIDE 27

Capacity of Neural Nets

  • Consider a single layer neural network

Yifeng Tao Carnegie Mellon University 27

[Slide from Russ Salakhutdinov et al.]

slide-28
SLIDE 28

MLP with Multiple Hidden Layers

Yifeng Tao Carnegie Mellon University 28

[Slide from Russ Salakhutdinov et al.]

slide-29
SLIDE 29

Capacity of Neural Nets

  • Deep learning playground

Yifeng Tao Carnegie Mellon University 29

[Slide from https://playground.tensorflow.org]

slide-30
SLIDE 30

Training a Neural Network

Yifeng Tao Carnegie Mellon University 30

[Slide from Russ Salakhutdinov et al.]

slide-31
SLIDE 31

Stochastic Gradient Descent

Yifeng Tao Carnegie Mellon University 31

[Slide from Russ Salakhutdinov et al.]

slide-32
SLIDE 32

Mini-batch SGD

  • Make updates based on a mini-batch of examples (instead of a

single example)

  • The gradient is the average regularized loss for that mini-batch
  • Can give a more accurate estimate of the gradient
  • Can leverage matrix/matrix operations, which are more efficient

Yifeng Tao Carnegie Mellon University 32

[Slide from Russ Salakhutdinov et al.]

slide-33
SLIDE 33

Backpropagation

  • Method used to train neural networks following gradient descent
  • Essentially implementation of chain rule and dynamic programming
  • The derivative of last two terms:

Yifeng Tao Carnegie Mellon University 33

[Slide from https://en.wikipedia.org/wiki/Backpropagation]

slide-34
SLIDE 34

Backpropagation

  • If oj is output à straightforward
  • Else:

Yifeng Tao Carnegie Mellon University 34

[Slide from https://en.wikipedia.org/wiki/Backpropagation]

slide-35
SLIDE 35

Weight Decay

Yifeng Tao Carnegie Mellon University 35

[Slide from Russ Salakhutdinov et al.]

slide-36
SLIDE 36

Optimization: Momentum

  • Momentum: Can use an exponential average of previous gradients:
  • Can get pass plateous more quickly, by “gaining momentum”
  • Works well in positions with bad Hessian matrix
  • SGD w/o momentum, SGD w/ momentum

Yifeng Tao Carnegie Mellon University 36

[Slide from http://ruder.io/optimizing-gradient-descent/]

slide-37
SLIDE 37

Momentum-based Optimization

  • Nesterov accelerated gradient (NAG):
  • Adagrad:
  • smaller updates for params associated with frequently occurring features
  • larger updates for params associated with infrequent features
  • RMSprop and Adadelta:
  • Reduce the aggressive, monotonically decreasing learning rate in Adagrad
  • Adam

Yifeng Tao Carnegie Mellon University 37

[Slide from http://ruder.io/optimizing-gradient-descent/]

slide-38
SLIDE 38

Demo of Optimization Methods

Yifeng Tao Carnegie Mellon University 38

[Slide from http://ruder.io/optimizing-gradient-descent/]

slide-39
SLIDE 39

Take home message

  • Perceptron is an online linear classifier
  • Multilayer perceptron consists perceptrons with various activations
  • Backpropagation is an implementation of calculating gradients of

neural network in a backward and dynamic programming way

  • Momentum-based mini-batch gradient descent methods are used in
  • ptimizing neural networks
  • What’s next?
  • Regularization in neural networks
  • Widely used NN architecture in practice

Carnegie Mellon University 39 Yifeng Tao

slide-40
SLIDE 40

References

  • Eric Xing, Tom Mitchell. 10701 Introduction to Machine Learning:

http://www.cs.cmu.edu/~epxing/Class/10701-06f/

  • Barnabás Póczos, Maria-Florina Balcan, Russ Salakhutdinov. 10715

Advanced Introduction to Machine Learning: https://sites.google.com/site/10715advancedmlintro2017f/lectures

  • Matt Gormley. 10601 Introduction to Machine Learning:

http://www.cs.cmu.edu/~mgormley/courses/10601/index.html

  • Wikipedia

Carnegie Mellon University 40 Yifeng Tao