A Neural Network View of Kernel Methods
Shuiwang Ji Department of Computer Science & Engineering Texas A&M University
1 / 24
A Neural Network View of Kernel Methods Shuiwang Ji Department of - - PowerPoint PPT Presentation
A Neural Network View of Kernel Methods Shuiwang Ji Department of Computer Science & Engineering Texas A&M University 1 / 24 Linear Models 1 In a binary classification problem, we have training data { , y i } m i =1 ,
Shuiwang Ji Department of Computer Science & Engineering Texas A&M University
1 / 24
1 In a binary classification problem, we have training data {˜
①✐, yi}m
i=1,
where ˜ ①✐ ∈ Rn−1 represents the input feature vector and yi ∈ {−1, 1} is the corresponding label.
2 In logistic regression, for each sample ˜
①✐, a linear classifier, parameterized by ˜ ✇ ∈ Rn−1 and b ∈ R, computes the classification score as h(˜ ①✐) = σ(˜ ✇ ❚ ˜ ①✐ + b) = 1 1 + exp [−(˜ ✇ ❚ ˜ ①✐ + b)] , (1) where σ(·) is the sigmoid function.
3 Note that the classification score h(˜
①✐) can be interpreted as the probability of ˜ ①✐ having label 1.
4 We let ①✐ =
1 ˜ ①✐
b ˜ ✇
we can re-write Eqn. (1) as h(①✐) = σ(✇ ❚①✐) = 1 1 + exp [−✇ ❚①✐]. (2)
2 / 24
1 If the training data {①✐, yi}m i=1 are linearly separable, there exists a
✇ ∗ ∈ Rn such that yi✇ ∗❚①✐ ≥ 0, i = 1, 2, . . . , m. In this case, a linear model like logistic regression can perfectly fit the original training data {①✐, yi}m
i=1.
1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6
2 However, this is not possible for linearly inseparable cases.
1.0 0.5 0.0 0.5 1.0
3 / 24
1 A typical method to handle such linearly inseparable cases is feature
i=1, we use a feature
mapping function φ : Rn → RN on {①✐}m
i=1, so that mapped feature
vectors {φ(①✐)}m
i=1 are linearly separable. 2 For example, we can map the linearly inseparable data with
φ(①) = φ 1 ˜ x
1 ˜ x ˜ x2 . (3)
1.0 0.5 0.0 0.5 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 x^2
4 / 24
In the context of logistic regression, the whole process can be described as h(①✐) = σ(✇ ❚φ(①✐)) = 1 1 + exp [−✇ ❚φ(①✐)], (4) where the dimension of the parameter vector ✇ becomes N accordingly.
∈ ℝ
() ∈ ℝ
1 In order to achieve strong enough representation power, it is common
in practice that φ(①) has a much higher dimension than ①, i.e. N >> n.
2 However, this dramatically increases the costs of computing either
φ(①) or ✇ ❚φ(①).
3 In the following, we introduce an efficient way to implicitly compute
✇ ❚φ(①).
4 Specifically, we use the representer theorem to show that computing
✇ ❚φ(①) can be transformed to computing m
i=1 αiφ(①✐)❚φ(①),
where {αi}m
i=1 are learnable parameters. 5 Then we introduce the kernel methods, which significantly reduce the
costs of computing φ(①✐)❚φ(①).
6 / 24
Given the training data {①✐, yi}m
i=1 (①✐ ∈ Rn, yi ∈ R) and a feature
mapping φ : Rn → RN, to solve a supervised learning task (regression or classification), we need to do the following steps: Compute the feature vectors {φ(①✐)}m
i=1 of all training samples;
Initialize a linear model with a parameter vector ✇ ∈ RN; Minimize the task-specific loss function L on {φ(①✐), yi}m
i=1 with
respect to ✇.
7 / 24
1 Since loss L is a function of z = ✇ ❚φ(①) and y, we can write it as
L(✇ ❚φ(①), y). Minimizing L(✇ ❚φ(①), y) on {φ(①✐), yi}m
i=1 is an
min
✇
1 m
m
L
(5)
2 However, in many situations, minimizing L only may cause the
problem of over-fitting. A common method to address over-fitting is to apply the ℓ2-regularization, changing Equation (5) into Equation (6): min
✇
1 m
m
L(✇ ❚φ(①✐), yi) + λ 2 ||✇||2
2, λ ≥ 0,
(6) where λ ≥ 0 is a hyper-parameter, known as the regularization parameter, controlling the extent to which we penalize large ℓ2-norms
8 / 24
min
✇
1 m
m
L(✇ ❚φ(①✐), yi) + λ 2 ||✇||2
2, λ ≥ 0,
(7) In order to derive a solutions to this optimization problem, we introduce the following theorem, which is a special case of the well-known Representer Theorem. The Representer Theorem is the theoretical foundation of kernel methods. Theorem If the optimization problem in Equation (6) (copied above) has optimal solutions, there must exist an optimal solution with the form ✇ ∗ = m
i=1 αiφ(①✐).
9 / 24
Since elements of {φ(①✐)}m
i=1 are all in RN, they can form a subspace
V ⊆ RN such that V = {① : ① = m
i=1 ciφ(①✐)}. Assume {✈ 1, ..., ✈ ♥′} is
an orthonormal basis of V , where n′ ≤ N. V also has an orthogonal complement subspace V ⊥, which has an orthonormal basis {✉1, ..., ✉◆−♥′}. Clearly, ✈ ❚
❦ ✉❥ = 0 for any 1 ≤ k ≤ n′ and 1 ≤ j ≤ N −n′.
For an arbitrary vector ✇ ∈ RN, we can decompose it into a linear combination of orthonormal basis vectors in the subspace V and V ⊥. That is, we can write ✇ as ✇ = ✇ ❱ + ✇ ❱ ⊥ =
n′
sk✈ ❦ +
N−n′
tj✉❥. (8)
10 / 24
First, we can show that ||✇||2
2 =
sk✈ ❦ +
N−n′
tj✉❥
2
=
n′
sk✈ ❚
❦ + N−n′
tj✉❚
❥
n′
sk✈ ❦ +
N−n′
tj✉❥ =
n′
s2
k||✈ ❦||2 2 + N−n′
t2
j ||✉❥||2 2 + 2 n′
N−n′
sktj✈ ❚
❦ ✉❥
=
n′
s2
k||✈ ❦||2 2 + N−n′
t2
j ||✉❥||2 2
≥
n′
s2
k||✈ ❦||2 2 = ||✇ ❱ ||2 2.
(9)
11 / 24
Second, because each φ(①✐), 1 ≤ i ≤ m is a vector in V and {✈ 1, ..., ✈ ♥′} is an orthonormal basis of V , we have φ(①✐) = n′
k=1 βik✈ ❦. This leads to
the following equalities: ✇ ❚φ(①✐) =
n′
sk✈ ❚
❦ + N−n′
tj✉❚
❥
n′
βik✈ ❦
n′
skβik||✈ ❦||2
2 = ✇ ❚ ❱ φ(①✐).
(10)
12 / 24
Based on the results in Eqn. (9) and Eqn. (10), and the fact that ✇ ❱ is a vector in V = {① : ① = m
i=1 ciφ(①✐)}, we can derive that, for an
arbitrary ✇, there always exists a ✇ ❱ = m
i=1 αiφ(①✐) satisfying
1 m
m
L(✇ ❚φ(①✐), yi) + λ 2 ||✇||2
2 ≥ 1
m
m
L(✇ ❚
❱ φ(①✐), yi) + λ
2 ||✇ ❱ ||2
2.
(11) In other words, if a vector ✇ ∗ minimizes
1 m
m
i=1 L(✇ ❚φ(①✐), yi) + λ 2||✇||2 2, the corresponding ✇ ∗ ❱ must also
minimize it, and there exist some {αi}m
i=1 such that ✇ ∗ ❱ = m i=1 αiφ(①✐).
13 / 24
1 According to Theorem 1, we only need to consider
✇ ∈ {m
i=1 αiφ(①✐)} when solving the optimization problem in
Equation (6).
2 Therefore, we have the following transformed optimization problem by
replacing ✇ in Equation (6) with m
i=1 αiφ(①✐):
min
α1,...,αm
1 m
m
L m
αiφ(①✐)❚φ(①❥), yj
2
m
m
αiαjφ(①✐)❚φ(①❥), (12) for λ ≥ 0
3 As a result, if we know φ(①✐)❚φ(①❥) for all 1 ≤ j, i ≤ m, we can
compute the optimization objective in Equation (12) without explicitly knowing {φ(①✐)}m
i=1.
14 / 24
1 In addition, the output of the linear model with parameter ✇ for any
input φ(①) only depends on ✇ ❚φ(①), i.e., m
i=1 αiφ(①✐)❚φ(①). 2 So for any unseen ① that is not in the training set, we can make
predictions directly without computing φ(①) first if we know φ(①✐)❚φ(①) for all 1 ≤ i ≤ m.
3 In summary, in both training and prediction, what we really need is
the inner product of two feature vectors, not the feature vectors themselves.
15 / 24
1 In the following discussion, we call
k(①✐, ①❥) = φ(①✐)❚φ(①❥) the kernel function of ①✐, ①❥.
2 Note that, although k(①✐, ①❥) is the inner product of φ(①✐) and
φ(①❥), we compute it directly from ①✐ and ①❥.
3 The advantage of using kernel function is that, in many cases,
computing φ(·) may be very expensive or even infeasible, while computing k(①✐, ①❥) is much easier.
4 For example, consider
φ(①) = [1, √ 2x1, ..., √ 2xn−1, x2
1, ..., x2 n−1,
√ 2x1x2, ..., √ 2xixj(i < j), ..., √ 2xn−2xn−1]T, where ① = [1, x1, ..., xn−1]T.
5 For any two n-dimensional feature vectors ❛ and ❜, φ(❛) and φ(❜)
are both O(n2)-dimensional vectors.
6 So, the time complexity of computing φ(❛) and φ(❜) first and then
their inner product is O(n2).
16 / 24
1 However, we show that the result can be computed from ❛❚❜ more
efficiently as φ(❛)❚φ(❜) = 1 +
n−1
2aibi +
n−1
a2
i b2 i + n−1
n−1
2aiajbibj =
n−1
a2
i b2 i + n−1
n−1
2aiajbibj, (a0 = b0 = 1) = n−1
aibi 2 = (❛❚❜)2, (13) where the computational complexity is decreased from O(n2) to O(n).
2 We will introduce some concrete kernel functions like the radial basis
In this case, using the kernel function is necessary.
17 / 24
1 With the kernel function, the optimization problem in Equation (12)
can be re-written as min
α1,...,αm
1 m
m
L m
αik(①✐, ①❥), yj
2
m
m
αiαjk(①✐, ①❥), λ ≥ 0. (14)
2 We can initialize α = (α1, ..., αm)T, select the task-specific loss
function L, and minimize the objective in Equation (14) on training data via gradient descent. This method is know as the kernel method.
18 / 24
1 Concretely, using the example of logistic regression for binary
classification problems, we now have h(①) = σ(✇ ❚φ(①)) = σ m
αik(①✐, ①)
1 1 + exp (− m
i=1 αik(①✐, ①)). 2 When applying the binary cross entropy loss
L(z, y) = −y log
1 + exp (−z)
1 1 + exp (−z)
addition, if we apply the hinge loss L(z, y) = max(1 − yz, 0), we
∈ ℝ ℎ
We now move one step further from the kernel methods by comparing it with a two-layer feed forward neural network. In the kernel logistic regression model for binary classification with training data {①✐, yi}m
i=1, given input ①, the output is
σ m
αik(①✐, ①)
where α = [α1, α2, . . . , αm]T is the learnable parameter and ❦ = [k(①1, ①), k(①2, ①), . . . , k(①♠, ①)]T is an m-dimensional vector computed by a sequence of kernel functions. Basically, we compute the kernel functions first and obtain an m-dimensional vector ❦. We then perform the the original logistic regression with ❦ as input. In a two-layer feed-forward neural network with a hidden size of t for the same task, we need to compute a hidden t-dimensional vector
regression in Eqn. (2). Note that the hidden size t here is fixed.
20 / 24
1 If we treat the computation of ❦ as the first “hidden layer”, the kernel
logistic regression model can be thought of as a two-layer feed-forward neural network. But here the hidden size and the number
2 This will lead to much computation and storage cost when there are
millions of training samples. Borrowed from the idea of fixed hidden size in feed-forward neural networks, we can build the first hidden layer from t representatives of all training samples, where t is a fixed number.
3 A simple way to obtain t representatives is to use the t centroids
{❝✐}t
i=1 obtained from running k-means clustering algorithm on all
training samples {①✐}m
i=1. 4 When the hidden size is fixed to t, the dimension of the parameter
vector α is also fixed to t.
21 / 24
1 In practice, a very frequently used kernel function is the radial basis
function k(①✐, ①❥) = exp
2
2σ2
2 If we build the hidden layer from t clustering centroids {❝✐}t i=1 and
use the radial basis function in the kernel logistic regression model, we
∈ ℝ ℎ
1 We now discuss the difference between the radial basis function
network and the two-layer feed-forward neural network.
2 Basically, in the radial basis function network, the first hidden layer is
not trained in an end-to-end fashion.
3 In other words, we fix the feature mapping φ(·) when training the
linear model with kernel methods.
4 However, in a regular two-layer feed-forward neural networks, the
feature mapping φ(·) is trained and is potentially more powerful as it is more data and task specific.
23 / 24
24 / 24