Data Mining Techniques
CS 6220 - Section 2 - Spring 2017
Lecture 3
Jan-Willem van de Meent (credit: Yijun Zhao, Arthur Gretton Rasmussen & Williams)
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 3 - - PowerPoint PPT Presentation
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 3 Jan-Willem van de Meent ( credit : Yijun Zhao, Arthur Gretton Rasmussen & Williams) Linear Regression Linear Regression Assume f is a linear combination of D features
CS 6220 - Section 2 - Spring 2017
Jan-Willem van de Meent (credit: Yijun Zhao, Arthur Gretton Rasmussen & Williams)
ε ∼ Norm(0,σ2)
Assume f is a linear combination of D features
Learning task: Estimate w For N points we write
Mean Squared Error (MSE): E(w) = 1 N
N
X
n=1
(wTxn yn)2 = 1 N k Xw y k2 where X = 2 6 6 4 — x1T — — x2T — . . . — xNT — 3 7 7 5 y = 2 6 6 4 y1T y2T . . . yNT 3 7 7 5
E(w) = 1
N k Xw y k2
5E(w) = 2
NXT(Xw y) = 0
XTXw = XTy w = X†y where X† = (XTX)1XT is the ’pseudo-inverse’ of X
2
E(w) = 1
N k Xw y k2
5E(w) = 2
NXT(Xw y) = 0
XTXw = XTy w = X†y where X† = (XTX)1XT is the ’pseudo-inverse’ of X
2
Matrix Cookbook (on course website)
1 —
2 —
N —
1
2
N
Linear regression Basis function regression Polynomial regression For N samples
x t M = 0 1 −1 1 x t M = 1 1 −1 1 x t M = 3 1 −1 1 x t M = 9 1 −1 1
x t M = 0 1 −1 1 x t M = 1 1 −1 1 x t M = 3 1 −1 1 x t M = 9 1 −1 1
Underfit
x t M = 0 1 −1 1 x t M = 1 1 −1 1 x t M = 3 1 −1 1 x t M = 9 1 −1 1
Overfit
D
i=1
1,2,3,4,5,6 if we roll a dice six times?
like pizza. If three students are chosen at random with replacement, what is the probability that all three students like pizza?
Red bin Blue bin
uit
intro-
Apple Orange If I take a fruit from the red bin, what is the probability that I get an apple?
Conditional Probability P(fruit = apple | bin = red) = 2 / 8 Red bin Blue bin
uit
intro-
Apple Orange
Red bin Blue bin
uit
intro-
Apple Orange Joint Probability P(fruit = apple , bin = red) = 2 / 12
Red bin Blue bin
uit
intro-
Apple Orange Joint Probability P(fruit = apple , bin = blue) = ?
Red bin Blue bin
uit
intro-
Apple Orange Joint Probability P(fruit = apple , bin = blue) = 3 / 12
Red bin Blue bin
uit
intro-
Apple Orange Joint Probability P(fruit = orange , bin = blue) = ?
Red bin Blue bin
uit
intro-
Apple Orange Joint Probability P(fruit = orange , bin = blue) = 1 / 12
P(fruit = apple) = P(fruit = apple , bin = blue) + P(fruit = apple , bin = red) = ?
uit
intro-
P(fruit = apple) = P(fruit = apple , bin = blue) + P(fruit = apple , bin = red) = 3 / 12 + 2 / 12 = 5 / 12
uit
intro-
P(fruit = apple , bin = red) = P(fruit = apple | bin = red) p(bin = red) = ?
uit
intro-
P(fruit = apple , bin = red) = P(fruit = apple | bin = red) p(bin = red) = 2 / 8 * 8 / 12 = 2 / 12
uit
intro-
P(fruit = apple , bin = red) = P(bin = red | fruit = apple) p(fruit = apple) = ?
uit
intro-
P(fruit = apple , bin = red) = P(bin = red | fruit = apple) p(fruit = apple) = 2 / 5 * 5 / 12 = 2 / 12
uit
intro-
Posterior Likelihood Prior
Sum Rule: Product Rule:
Posterior Likelihood Prior
Probability of rare disease: 0.005 Probability of detection: 0.98 Probability of false positive: 0.05 Probability of disease when test positive?
Posterior Likelihood Prior
0.99 * 0.005 + 0.05 * 0.995 = 0.0547 0.99 * 0.005 = 0.00495 0.00495 / 0.0547 = 0.09
∼ ⇒
Density:
Density:
Parameters:
Σi j = E[(xi − µi)(x j − µj)]
Density:
1.0 0.0 0.0 1.0
0.0 0.0 0.5
0.5 0.5 1.0
1.0 −0.5 −0.5 1.0
corresponds to which plot?
Suppose that x and y are jointly Gaussian:
x x y
Question: What are the marginal distributions p(x) and p(y)?
z = x y
✓a b
A C CT B ◆
x ∼ N (a, A) y ∼ N (b, B)
Suppose that x and y are jointly Gaussian:
x x y
Question: What are the conditional distributions p(x | y) and p(y | x)?
z = x y
✓a b
A C CT B ◆
x|y ∼ N
y|x ∼ N
What is the probability ?
Least Squares Objective Likelihood
Least Squares Objective Log-Likelihood Maximizing the likelihood minimizes the sum of squares
Can we maximize ? (this is known as maximum a posteriori estimation)
From Bayes Rule
Maximum a Posteriori is Equivalent to Ridge Regression
Maximum a Posteriori is Equivalent to Ridge Regression
x t M = 3 1 −1 1
x t M = 3 1 −1 1
adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/
−1 1 2 −2 −1 1 2
M=0
−1 1 2 −4 −2 2 4
M=1
−1 1 2 −5 5
M=2
−1 1 2 −10 10
M=3
−1 1 2 −50 50
M=5
−1 1 2 −2 2 x 10
5 M=17
Idea: sampling w ~ p(w) defines a function wTφ(x), so p(w) is equivalent to a prior on functions.
adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/
−6 −4 −2 2 4 6 −2 2 −6 −4 −2 2 4 6 −2 2 −6 −4 −2 2 4 6 −2 2
Can we reason about the posterior on functions? Idea: sample w ~ p(w | X, y) and plot functions Increasing λ
adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/
n ΦΦ> + Σ1 p .
with Φ = Φ(X) equation we need
f⇤|x⇤, X, y ⇠ N 1 σ2
n
φ(x⇤)>A1Φy, φ(x⇤)>A1φ(x⇤)
for f⇤ , f(x⇤) w.r.t. the Gau
Predictive distribution on observations Predictive distribution on the function value Prior
−6 −4 −2 2 4 6 −2 2 −6 −4 −2 2 4 6 −2 2 −6 −4 −2 2 4 6 −2 2
adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/
Idea: Average over all possible values of w Increasing λ
Example: Mapping with linear and quadratic terms
Example: Mapping with linear and quadratic terms 1+d+d2/2 terms
Example: Mapping with linear and quadratic terms
ϕ (x)
Cost 100 features Quadratic > d2/2 terms up to degree 2 d2 N2 /4 2,500 N2 Cubic > d3/6 terms up to degree 3 d3 N2 /12 83,000 N2 Quartic > d4/24 terms up to degree 4 d4 N2 /48 1,960,000 N2
Define a kernel function such that k can be cheaper to evaluate than φ!
Define a kernel function such that k can be cheaper to evaluate than φ!
Kernel for polynomials up to degree q
Kernel for polynomials up to degree q
ϕ (x)
Cost 100 features Quadratic > d2/2 terms up to degree 2 d2 N2 /4 2,500 N2 Cubic > d3/6 terms up to degree 3 d3 N2 /12 83,000 N2 Quartic > d4/24 terms up to degree 4 d4 N2 /48 1,960,000 N2
Kernel for polynomials up to degree q
ϕ (x)
Cost 100 features Quadratic > d2/2 terms up to degree 2 d2 N2 /4 2,500 N2 Cubic > d3/6 terms up to degree 3 d3 N2 /12 83,000 N2 Quartic > d4/24 terms up to degree 4 d4 N2 /48 1,960,000 N2
100 100 100
Kernel for polynomials up to degree q
Borrowing from: Arthur Gretton (Gatsby, UCL)
Definition (Inner product) Let H be a vector space over R. A function h·, ·iH : H ⇥ H ! R is an inner product on H if
1 Linear: hα1f1 + α2f2, giH = α1 hf1, giH + α2 hf2, giH 2 Symmetric: hf , giH = hg, f iH 3 hf , f iH 0 and hf , f iH = 0 if and only if f = 0.
Norm induced by the inner product: kf kH := p hf , f iH
Fourier modes define a vector space
Definition Let X be a non-empty set. A function k : X × X → R is a kernel if there exists an R-Hilbert space and a map φ : X → H such that ∀x, x0 ∈ X, k(x, x0) := ⌦ φ(x), φ(x0) ↵
H .
Almost no conditions on X (eg, X itself doesn’t need an inner product, eg. documents). A single kernel can correspond to several possible features. A trivial example for X := R: φ1(x) = x and φ2(x) = x/ √ 2 x/ √ 2
Theorem (Sums of kernels are kernels) Given α > 0 and k, k1 and k2 all kernels on X, then αk and k1 + k2 are kernels on X. (Proof via positive definiteness: later!) A difference of kernels may not be a kernel (why?) Theorem (Mappings between spaces) Let X and e X be sets, and define a map A : X → e
kernel k on e
Example: k(x, x0) = x2 (x0)2 . Theorem (Products of kernels are kernels) Given k1 on X1 and k2 on X2, then k1 ⇥ k2 is a kernel on X1 ⇥ X2. If X1 = X2 = X, then k := k1 ⇥ k2 is a kernel on X. Proof: Main idea only!
Theorem (Polynomial kernels) Let x, x0 2 Rd for d 1, and let m 1 be an integer and c 0 be a positive real. Then k(x, x0) := ⌦ x, x0↵ + c m is a valid kernel. To prove: expand into a sum (with non-negative scalars) of kernels hx, x0i raised to integer powers. These individual terms are valid kernels by the product rule.
Definition The space `2 (square summable sequences) comprises all sequences a := (ai)i1 for which kak2
`2 = 1
X
i=1
a2
i < 1.
Definition Given sequence of functions (i(x))i1 in `2 where i : X ! R is the ith coordinate of (x). Then k(x, x0) :=
1
X
i=1
i(x)i(x0) (1)
Why square summable? By Cauchy-Schwarz,
X
i=1
φi(x)φi(x0)
so the sequence defining the inner product converges for all x, x0 2 X
Definition (Taylor series kernel) For r 2 (0, 1], with an 0 for all n 0 f (z) =
1
X
n=0
anzn |z| < r, z 2 R, Define X to be the pr-ball in Rd, sokxk < pr, k(x, x0) = f ⌦ x, x0↵ =
1
X
n=0
an ⌦ x, x0↵n . Example (Exponential kernel) k(x, x0) := exp ⌦ x, x0↵ .
Example (Gaussian kernel) The Gaussian kernel on Rd is defined as k(x, x0) := exp ⇣ −γ2 x − x0 2⌘ . Proof: an exercise! Use product rule, mapping rule, exponential kernel.
(also known as Radial Basis Function (RBF) kernel)
Example (Gaussian kernel) The Gaussian kernel on Rd is defined as k(x, x0) := exp ⇣ −γ2 x − x0 2⌘ . Proof: an exercise! Use product rule, mapping rule, exponential kernel.
(also known as Radial Basis Function (RBF) kernel) Squared Exponential (SE) Automatic Relevance Determination (ARD)
Lin × Lin SE × Per Lin × SE Lin × Per
x (with xÕ = 1) x − xÕ x (with xÕ = 1) x (with xÕ = 1) me: Squared-exp (SE) Periodic (Per) Linear (Lin) σ2
f exp
1
−(x≠xÕ)2
2¸2
2
σ2
f exp
1
− 2
¸2 sin2 1
π x≠xÕ
p
22
σ2
f(x − c)(xÕ − c)
): x − xÕ x − xÕ x (with xÕ = 1)
source: David Duvenaud (PhD Thesis)
Definition (Positive definite functions) A symmetric function k : X × X → R is positive definite if ∀n ≥ 1, ∀(a1, . . . an) ∈ Rn, ∀(x1, . . . , xn) ∈ X n,
n
X
i=1 n
X
j=1
aiajk(xi, xj) ≥ 0. The function k(·, ·) is strictly positive definite if for mutually distinct xi, the equality holds only when all the ai are zero.
Theorem Let H be a Hilbert space, X a non-empty set and φ : X ! H. Then hφ(x), φ(y)iH =: k(x, y) is positive definite. Proof.
n
X
i=1 n
X
j=1
aiajk(xi, xj) =
n
X
i=1 n
X
j=1
haiφ(xi), ajφ(xj)iH =
X
i=1
aiφ(xi)
H
0. Reverse also holds: positive definite k(x, x0) is inner product in a unique H (Moore-Aronsajn: coming later!).
adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/
Idea: Define a prior on functions using a kernel function to define the covariance
m(x) = E[f(x)], k(x, x0) = E[(f(x) m(x))(f(x0) m(x0))],
f(x) ⇠ GP
function values are Gaussian distributed
adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/
y f∗
✓ 0, K(X, X) + σ2
nI
K(X, X∗) K(X∗, X) K(X∗, X∗) ◆
Observations and function values are Gaussian Can fill in standard relations for Gaussians
z = x y
✓a b
A C CT B ◆
x|y ∼ N
adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/
y f∗
✓ 0, K(X, X) + σ2
nI
K(X, X∗) K(X∗, X) K(X∗, X∗) ◆
Observations and function values are Gaussian Can fill in standard relations for Gaussians
f∗|X, y, X∗ ⇠ N ¯ f∗, cov(f∗)
¯ f∗ , E[f∗|X, y, X∗] = K(X∗, X)[K(X, X) + σ2
nI]−1y,
cov(f∗) = K(X∗, X∗) K(X∗, X)[K(X, X) + σ2
nI
⇤−1K(X, X∗).
adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/
−5 5 −2 −1 1 2 input, x
−5 5 −2 −1 1 2 input, x
p(y⇤|x⇤, x, y) ∼ N
noiseI]−1y,
k(x⇤, x⇤) + σ2
noise − k(x⇤, x)>[K + σ2 noiseI]−1k(x⇤, x)
adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/
−6 −4 −2 2 4 6 −6 −4 −2 2 4 6 −2 −1 1 2 3 4 5 6 7 8
Function drawn at random from a Gaussian Process with Gaussian covariance
adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/
−5 5 10 −0.5 0.0 0.5 1.0 input, x function value, y too long about right too short
The mean posterior predictive function is plotted for 3 different length scales (the
noisexx0.
Characteristic Lengthscales