Kernel Methods CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Slides - - PowerPoint PPT Presentation

kernel methods
SMART_READER_LITE
LIVE PREVIEW

Kernel Methods CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Slides - - PowerPoint PPT Presentation

Kernel Methods CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu Slides credit: Piyush Rai Beyond linear classification Problem: linear classifiers Easy to implement and easy to optimize But limited to linear decision boundaries What


slide-1
SLIDE 1

Kernel Methods

CMSC 422 MARINE CARPUAT

marine@cs.umd.edu

Slides credit: Piyush Rai

slide-2
SLIDE 2

Beyond linear classification

  • Problem: linear classifiers

– Easy to implement and easy to optimize – But limited to linear decision boundaries

  • What can we do about it?

– Last week: Neural networks

  • Very expressive but harder to optimize (non-

convex objective)

– Today: Kernels

slide-3
SLIDE 3

Kernel Methods

  • Goal: keep advantages of linear models,

but make them capture non-linear patterns in data!

  • How?

– By mapping data to higher dimensions where it exhibits linear patterns

slide-4
SLIDE 4

Classifying non-linearly separable data with a linear classifier: examples

Non-linearly separable data in 1D Becomes linearly separable in new 2D space defined by the following mapping:

slide-5
SLIDE 5

Classifying non-linearly separable data with a linear classifier: examples

Non-linearly separable data in 2D Becomes linearly separable in the 3D space defined by the following transformation:

slide-6
SLIDE 6

Defining feature mappings

  • Map an original feature vector

to an expanded version

  • Example: quadratic feature mapping represents feature

combinations

slide-7
SLIDE 7

Feature Mappings

  • Pros: can help turn non-linear classification

problem into linear problem

  • Cons: “feature explosion” creates issues

when training linear classifier in new feature space

– More computationally expensive to train – More training examples needed to avoid

  • verfitting
slide-8
SLIDE 8

Kernel Methods

  • Goal: keep advantages of linear models,

but make them capture non-linear patterns in data!

  • How?

– By mapping data to higher dimensions where it exhibits linear patterns – By rewriting linear models so that the mapping never needs to be explicitly computed

slide-9
SLIDE 9

The Kernel Trick

  • Rewrite learning algorithms so they only depend
  • n dot products between two examples
  • Replace dot product

by kernel function which computes the dot product implicitly

slide-10
SLIDE 10

Example of Kernel function

slide-11
SLIDE 11

Another example of Kernel Function (see CIML 9.1)

What is the function k(x,z) that can implicitly compute the dot product ?

slide-12
SLIDE 12

Kernels: Formally defined

slide-13
SLIDE 13

Kernels: Mercer’s condition

For all square integrable functions f

  • Can any function be used as a kernel function?
  • No! it must satisfy Mercer’s condition.
slide-14
SLIDE 14

Kernels: Constructing combinations of kernels

slide-15
SLIDE 15

Commonly Used Kernel Functions

slide-16
SLIDE 16

The Kernel Trick

  • Rewrite learning algorithms so they only depend
  • n dot products between two examples
  • Replace dot product

by kernel function which computes the dot product implicitly

slide-17
SLIDE 17

“Kernelizing” the perceptron

  • Naïve approach: let’s explicitly train a perceptron

in the new feature space

Can we apply the Kernel trick? Not yet, we need to rewrite the algorithm using dot products between examples

slide-18
SLIDE 18

“Kernelizing” the perceptron

  • Perceptron Representer Theorem

“During a run of the perceptron algorithm, the weight vector w can always be represented as a linear combination of the expanded training data”

Proof by induction

(on board + see CIML 9.2)

slide-19
SLIDE 19

“Kernelizing” the perceptron

  • We can use the perceptron representer theorem to

compute activations as a dot product between examples

slide-20
SLIDE 20

“Kernelizing” the perceptron

  • Same training algorithm, but

doesn’t explicitly refers to weights w anymore

  • nly depends on dot products between examples
  • We can apply the kernel trick!
slide-21
SLIDE 21

Kernel Methods

  • Goal: keep advantages of linear models,

but make them capture non-linear patterns in data!

  • How?

– By mapping data to higher dimensions where it exhibits linear patterns – By rewriting linear models so that the mapping never needs to be explicitly computed

slide-22
SLIDE 22

Discussion

  • Other algorithms can be kernelized:

– See CIML for K-means – We’ll talk about Support Vector Machines next

  • Do Kernels address all the downsides of

“feature explosion”?

– Helps reduce computation cost during training – But overfitting remains an issue

slide-23
SLIDE 23

What you should know

  • Kernel functions

– What they are, why they are useful, how they relate to feature combination

  • Kernelized perceptron

– You should be able to derive it and implement it