A Neural Network View of Kernel Methods Shuiwang Ji Department of - - PowerPoint PPT Presentation

a neural network view of kernel methods
SMART_READER_LITE
LIVE PREVIEW

A Neural Network View of Kernel Methods Shuiwang Ji Department of - - PowerPoint PPT Presentation

A Neural Network View of Kernel Methods Shuiwang Ji Department of Computer Science & Engineering Texas A&M University 1 / 24 Linear Models 1 In a binary classification problem, we have training data { , y i } m i =1 ,


slide-1
SLIDE 1

A Neural Network View of Kernel Methods

Shuiwang Ji Department of Computer Science & Engineering Texas A&M University

1 / 24

slide-2
SLIDE 2

Linear Models

1 In a binary classification problem, we have training data {˜

①✐, yi}m

i=1,

where ˜ ①✐ ∈ Rn−1 represents the input feature vector and yi ∈ {−1, 1} is the corresponding label.

2 In logistic regression, for each sample ˜

①✐, a linear classifier, parameterized by ˜ ✇ ∈ Rn−1 and b ∈ R, computes the classification score as h(˜ ①✐) = σ(˜ ✇ ❚ ˜ ①✐ + b) = 1 1 + exp [−(˜ ✇ ❚ ˜ ①✐ + b)] , (1) where σ(·) is the sigmoid function.

3 Note that the classification score h(˜

①✐) can be interpreted as the probability of ˜ ①✐ having label 1.

4 We let ①✐ =

1 ˜ ①✐

  • ∈ Rn, i = 1, 2, . . . , m and ✇ =

b ˜ ✇

  • ∈ Rn. Then

we can re-write Eqn. (1) as h(①✐) = σ(✇ ❚①✐) = 1 1 + exp [−✇ ❚①✐]. (2)

2 / 24

slide-3
SLIDE 3

Linearly Inseparable Data

1 If the training data {①✐, yi}m i=1 are linearly separable, there exists a

✇ ∗ ∈ Rn such that yi✇ ∗❚①✐ ≥ 0, i = 1, 2, . . . , m. In this case, a linear model like logistic regression can perfectly fit the original training data {①✐, yi}m

i=1.

1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6

2 However, this is not possible for linearly inseparable cases.

1.0 0.5 0.0 0.5 1.0

3 / 24

slide-4
SLIDE 4

Feature Mapping

1 A typical method to handle such linearly inseparable cases is feature

  • mapping. That is, instead of using original {①✐}m

i=1, we use a feature

mapping function φ : Rn → RN on {①✐}m

i=1, so that mapped feature

vectors {φ(①✐)}m

i=1 are linearly separable. 2 For example, we can map the linearly inseparable data with

φ(①) = φ 1 ˜ x

  • =

  1 ˜ x ˜ x2   . (3)

1.0 0.5 0.0 0.5 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 x^2

4 / 24

slide-5
SLIDE 5

Logistic Regression with Feature Mapping

In the context of logistic regression, the whole process can be described as h(①✐) = σ(✇ ❚φ(①✐)) = 1 1 + exp [−✇ ❚φ(①✐)], (4) where the dimension of the parameter vector ✇ becomes N accordingly.

∈ ℝ

() ∈ ℝ

  • 5 / 24
slide-6
SLIDE 6

Computation of Feature Mapping

1 In order to achieve strong enough representation power, it is common

in practice that φ(①) has a much higher dimension than ①, i.e. N >> n.

2 However, this dramatically increases the costs of computing either

φ(①) or ✇ ❚φ(①).

3 In the following, we introduce an efficient way to implicitly compute

✇ ❚φ(①).

4 Specifically, we use the representer theorem to show that computing

✇ ❚φ(①) can be transformed to computing m

i=1 αiφ(①✐)❚φ(①),

where {αi}m

i=1 are learnable parameters. 5 Then we introduce the kernel methods, which significantly reduce the

costs of computing φ(①✐)❚φ(①).

6 / 24

slide-7
SLIDE 7

Summary of Models with Feature Mapping

Given the training data {①✐, yi}m

i=1 (①✐ ∈ Rn, yi ∈ R) and a feature

mapping φ : Rn → RN, to solve a supervised learning task (regression or classification), we need to do the following steps: Compute the feature vectors {φ(①✐)}m

i=1 of all training samples;

Initialize a linear model with a parameter vector ✇ ∈ RN; Minimize the task-specific loss function L on {φ(①✐), yi}m

i=1 with

respect to ✇.

7 / 24

slide-8
SLIDE 8

Regularization

1 Since loss L is a function of z = ✇ ❚φ(①) and y, we can write it as

L(✇ ❚φ(①), y). Minimizing L(✇ ❚φ(①), y) on {φ(①✐), yi}m

i=1 is an

  • ptimization problem:

min

1 m

m

  • i=1

L

  • ✇ ❚φ(①✐), yi
  • .

(5)

2 However, in many situations, minimizing L only may cause the

problem of over-fitting. A common method to address over-fitting is to apply the ℓ2-regularization, changing Equation (5) into Equation (6): min

1 m

m

  • i=1

L(✇ ❚φ(①✐), yi) + λ 2 ||✇||2

2, λ ≥ 0,

(6) where λ ≥ 0 is a hyper-parameter, known as the regularization parameter, controlling the extent to which we penalize large ℓ2-norms

  • f ✇.

8 / 24

slide-9
SLIDE 9

Representer Theorem

min

1 m

m

  • i=1

L(✇ ❚φ(①✐), yi) + λ 2 ||✇||2

2, λ ≥ 0,

(7) In order to derive a solutions to this optimization problem, we introduce the following theorem, which is a special case of the well-known Representer Theorem. The Representer Theorem is the theoretical foundation of kernel methods. Theorem If the optimization problem in Equation (6) (copied above) has optimal solutions, there must exist an optimal solution with the form ✇ ∗ = m

i=1 αiφ(①✐).

9 / 24

slide-10
SLIDE 10

Proof I

Since elements of {φ(①✐)}m

i=1 are all in RN, they can form a subspace

V ⊆ RN such that V = {① : ① = m

i=1 ciφ(①✐)}. Assume {✈ 1, ..., ✈ ♥′} is

an orthonormal basis of V , where n′ ≤ N. V also has an orthogonal complement subspace V ⊥, which has an orthonormal basis {✉1, ..., ✉◆−♥′}. Clearly, ✈ ❚

❦ ✉❥ = 0 for any 1 ≤ k ≤ n′ and 1 ≤ j ≤ N −n′.

For an arbitrary vector ✇ ∈ RN, we can decompose it into a linear combination of orthonormal basis vectors in the subspace V and V ⊥. That is, we can write ✇ as ✇ = ✇ ❱ + ✇ ❱ ⊥ =

n′

  • k=1

sk✈ ❦ +

N−n′

  • j=1

tj✉❥. (8)

10 / 24

slide-11
SLIDE 11

Proof II

First, we can show that ||✇||2

2 =

  • n′
  • k=1

sk✈ ❦ +

N−n′

  • j=1

tj✉❥

  • 2

2

=  

n′

  • k=1

sk✈ ❚

❦ + N−n′

  • j=1

tj✉❚

   

n′

  • k=1

sk✈ ❦ +

N−n′

  • j=1

tj✉❥   =

n′

  • k=1

s2

k||✈ ❦||2 2 + N−n′

  • j=1

t2

j ||✉❥||2 2 + 2 n′

  • k=1

N−n′

  • j=1

sktj✈ ❚

❦ ✉❥

=

n′

  • k=1

s2

k||✈ ❦||2 2 + N−n′

  • j=1

t2

j ||✉❥||2 2

n′

  • k=1

s2

k||✈ ❦||2 2 = ||✇ ❱ ||2 2.

(9)

11 / 24

slide-12
SLIDE 12

Proof III

Second, because each φ(①✐), 1 ≤ i ≤ m is a vector in V and {✈ 1, ..., ✈ ♥′} is an orthonormal basis of V , we have φ(①✐) = n′

k=1 βik✈ ❦. This leads to

the following equalities: ✇ ❚φ(①✐) =  

n′

  • k=1

sk✈ ❚

❦ + N−n′

  • j=1

tj✉❚

  n′

  • k=1

βik✈ ❦

  • =

n′

  • k=1

skβik||✈ ❦||2

2 = ✇ ❚ ❱ φ(①✐).

(10)

12 / 24

slide-13
SLIDE 13

Proof IV

Based on the results in Eqn. (9) and Eqn. (10), and the fact that ✇ ❱ is a vector in V = {① : ① = m

i=1 ciφ(①✐)}, we can derive that, for an

arbitrary ✇, there always exists a ✇ ❱ = m

i=1 αiφ(①✐) satisfying

1 m

m

  • i=1

L(✇ ❚φ(①✐), yi) + λ 2 ||✇||2

2 ≥ 1

m

m

  • i=1

L(✇ ❚

❱ φ(①✐), yi) + λ

2 ||✇ ❱ ||2

2.

(11) In other words, if a vector ✇ ∗ minimizes

1 m

m

i=1 L(✇ ❚φ(①✐), yi) + λ 2||✇||2 2, the corresponding ✇ ∗ ❱ must also

minimize it, and there exist some {αi}m

i=1 such that ✇ ∗ ❱ = m i=1 αiφ(①✐).

13 / 24

slide-14
SLIDE 14

Use of the Representer Theorem in Training

1 According to Theorem 1, we only need to consider

✇ ∈ {m

i=1 αiφ(①✐)} when solving the optimization problem in

Equation (6).

2 Therefore, we have the following transformed optimization problem by

replacing ✇ in Equation (6) with m

i=1 αiφ(①✐):

min

α1,...,αm

1 m

m

  • j=1

L m

  • i=1

αiφ(①✐)❚φ(①❥), yj

  • + λ

2

m

  • j=1

m

  • i=1

αiαjφ(①✐)❚φ(①❥), (12) for λ ≥ 0

3 As a result, if we know φ(①✐)❚φ(①❥) for all 1 ≤ j, i ≤ m, we can

compute the optimization objective in Equation (12) without explicitly knowing {φ(①✐)}m

i=1.

14 / 24

slide-15
SLIDE 15

Use of the Representer Theorem in Prediction

1 In addition, the output of the linear model with parameter ✇ for any

input φ(①) only depends on ✇ ❚φ(①), i.e., m

i=1 αiφ(①✐)❚φ(①). 2 So for any unseen ① that is not in the training set, we can make

predictions directly without computing φ(①) first if we know φ(①✐)❚φ(①) for all 1 ≤ i ≤ m.

3 In summary, in both training and prediction, what we really need is

the inner product of two feature vectors, not the feature vectors themselves.

15 / 24

slide-16
SLIDE 16

Kernel Methods I

1 In the following discussion, we call

k(①✐, ①❥) = φ(①✐)❚φ(①❥) the kernel function of ①✐, ①❥.

2 Note that, although k(①✐, ①❥) is the inner product of φ(①✐) and

φ(①❥), we compute it directly from ①✐ and ①❥.

3 The advantage of using kernel function is that, in many cases,

computing φ(·) may be very expensive or even infeasible, while computing k(①✐, ①❥) is much easier.

4 For example, consider

φ(①) = [1, √ 2x1, ..., √ 2xn−1, x2

1, ..., x2 n−1,

√ 2x1x2, ..., √ 2xixj(i < j), ..., √ 2xn−2xn−1]T, where ① = [1, x1, ..., xn−1]T.

5 For any two n-dimensional feature vectors ❛ and ❜, φ(❛) and φ(❜)

are both O(n2)-dimensional vectors.

6 So, the time complexity of computing φ(❛) and φ(❜) first and then

their inner product is O(n2).

16 / 24

slide-17
SLIDE 17

Kernel Methods II

1 However, we show that the result can be computed from ❛❚❜ more

efficiently as φ(❛)❚φ(❜) = 1 +

n−1

  • i=1

2aibi +

n−1

  • i=1

a2

i b2 i + n−1

  • i=1

n−1

  • j=i+1

2aiajbibj =

n−1

  • i=0

a2

i b2 i + n−1

  • i=0

n−1

  • j=i+1

2aiajbibj, (a0 = b0 = 1) = n−1

  • i=0

aibi 2 = (❛❚❜)2, (13) where the computational complexity is decreased from O(n2) to O(n).

2 We will introduce some concrete kernel functions like the radial basis

  • function. Some of them even lead to an infinite dimension of φ(①).

In this case, using the kernel function is necessary.

17 / 24

slide-18
SLIDE 18

Kernel Methods III

1 With the kernel function, the optimization problem in Equation (12)

can be re-written as min

α1,...,αm

1 m

m

  • j=1

L m

  • i=1

αik(①✐, ①❥), yj

2

m

  • j=1

m

  • i=1

αiαjk(①✐, ①❥), λ ≥ 0. (14)

2 We can initialize α = (α1, ..., αm)T, select the task-specific loss

function L, and minimize the objective in Equation (14) on training data via gradient descent. This method is know as the kernel method.

18 / 24

slide-19
SLIDE 19

Kernel Logistic Regression

1 Concretely, using the example of logistic regression for binary

classification problems, we now have h(①) = σ(✇ ❚φ(①)) = σ m

  • i=1

αik(①✐, ①)

  • =

1 1 + exp (− m

i=1 αik(①✐, ①)). 2 When applying the binary cross entropy loss

L(z, y) = −y log

  • 1

1 + exp (−z)

  • − (1 − y) log
  • 1 −

1 1 + exp (−z)

  • in Equation 14, this model is known as kernel logistic regression. In

addition, if we apply the hinge loss L(z, y) = max(1 − yz, 0), we

  • btain the support vector machines (SVM).

∈ ℝ ℎ

  • (, )
  • 19 / 24
slide-20
SLIDE 20

Kernel Methods and Neural Networks I

We now move one step further from the kernel methods by comparing it with a two-layer feed forward neural network. In the kernel logistic regression model for binary classification with training data {①✐, yi}m

i=1, given input ①, the output is

σ m

  • i=1

αik(①✐, ①)

  • = σ(α❚❦),

where α = [α1, α2, . . . , αm]T is the learnable parameter and ❦ = [k(①1, ①), k(①2, ①), . . . , k(①♠, ①)]T is an m-dimensional vector computed by a sequence of kernel functions. Basically, we compute the kernel functions first and obtain an m-dimensional vector ❦. We then perform the the original logistic regression with ❦ as input. In a two-layer feed-forward neural network with a hidden size of t for the same task, we need to compute a hidden t-dimensional vector

  • first. And the second layer can be considered as the original logistic

regression in Eqn. (2). Note that the hidden size t here is fixed.

20 / 24

slide-21
SLIDE 21

Kernel Methods and Neural Networks II

1 If we treat the computation of ❦ as the first “hidden layer”, the kernel

logistic regression model can be thought of as a two-layer feed-forward neural network. But here the hidden size and the number

  • f parameters are the same as the number of training samples.

2 This will lead to much computation and storage cost when there are

millions of training samples. Borrowed from the idea of fixed hidden size in feed-forward neural networks, we can build the first hidden layer from t representatives of all training samples, where t is a fixed number.

3 A simple way to obtain t representatives is to use the t centroids

{❝✐}t

i=1 obtained from running k-means clustering algorithm on all

training samples {①✐}m

i=1. 4 When the hidden size is fixed to t, the dimension of the parameter

vector α is also fixed to t.

21 / 24

slide-22
SLIDE 22

Radial Basis Function Networks

1 In practice, a very frequently used kernel function is the radial basis

function k(①✐, ①❥) = exp

  • −||①✐ − ①❥||2

2

2σ2

  • .

2 If we build the hidden layer from t clustering centroids {❝✐}t i=1 and

use the radial basis function in the kernel logistic regression model, we

  • btain the radial basis function network (RBF).

∈ ℝ ℎ

  • (, )
  • 22 / 24
slide-23
SLIDE 23

Radial Basis Function and Feed-Forward Networks

1 We now discuss the difference between the radial basis function

network and the two-layer feed-forward neural network.

2 Basically, in the radial basis function network, the first hidden layer is

not trained in an end-to-end fashion.

3 In other words, we fix the feature mapping φ(·) when training the

  • network. As a result, it only has equivalent representation power as a

linear model with kernel methods.

4 However, in a regular two-layer feed-forward neural networks, the

feature mapping φ(·) is trained and is potentially more powerful as it is more data and task specific.

23 / 24

slide-24
SLIDE 24

THANKS!

24 / 24