[PPT] - Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 3 PowerPoint Presentation

SLIDE 1

Data Mining Techniques

CS 6220 - Section 2 - Spring 2017

Lecture 3

Jan-Willem van de Meent (credit: Yijun Zhao, Arthur Gretton  Rasmussen & Williams)

SLIDE 2

Linear Regression

SLIDE 3

ε ∼ Norm(0,σ2)

Assume f is a linear combination of D features

Linear Regression

Learning task: Estimate w For N points we write

SLIDE 4

Linear Regression

SLIDE 5

Error Measure: Sum of Squares

Mean Squared Error (MSE): E(w) = 1 N

N

X

n=1

(wTxn yn)2 = 1 N k Xw y k2 where X = 2 6 6 4 — x1T — — x2T — . . . — xNT — 3 7 7 5 y = 2 6 6 4 y1T y2T . . . yNT 3 7 7 5

SLIDE 6

Minimizing the Error

E(w) = 1

N k Xw y k2

5E(w) = 2

NXT(Xw y) = 0

XTXw = XTy w = X†y where X† = (XTX)1XT is the ’pseudo-inverse’ of X

2

SLIDE 7

E(w) = 1

N k Xw y k2

5E(w) = 2

NXT(Xw y) = 0

XTXw = XTy w = X†y where X† = (XTX)1XT is the ’pseudo-inverse’ of X

2

Minimizing the Error

Matrix Cookbook (on course website)

SLIDE 8

Ordinary Least Squares

Construct matrix X and the vector y from the dataset {(x1, y1), x2, y2), . . . , (xN, yN)} (each x includes x0 = 1) as follows: X =     — xT

1 —

— xT

2 —

. . . — xT

N —

    y =     y T

1

y T

2

. . . y T

N

    Compute X† = (XTX)−1XT Return w = X†y

SLIDE 9

Basis function regression

Linear regression Basis function regression Polynomial regression For N samples

SLIDE 10

Polynomial Regression

x t M = 0 1 −1 1 x t M = 1 1 −1 1 x t M = 3 1 −1 1 x t M = 9 1 −1 1

SLIDE 11

Polynomial Regression

x t M = 0 1 −1 1 x t M = 1 1 −1 1 x t M = 3 1 −1 1 x t M = 9 1 −1 1

Underfit

SLIDE 12

Polynomial Regression

x t M = 0 1 −1 1 x t M = 1 1 −1 1 x t M = 3 1 −1 1 x t M = 9 1 −1 1

Overfit

SLIDE 13

L2 regularization (ridge regression) minimizes: E(w) = 1 N k Xw y k2 + λ k w k2 where λ 0 and k w k2 = wTw

k

k L1 regularization (LASSO) minimizes: E(w) = 1 N k Xw y k2 + λ|w|1 where λ 0 and |w|1 =

D

P

i=1

|ωi|

Regularization

SLIDE 14

Regularization

SLIDE 15

L2: closed form solution w = (XTX + λI)1XTy L1: No closed form solution. Use quadratic programming: minimize k Xw y k2 s.t. k w k1 s

Regularization

SLIDE 16

Review: Probability

SLIDE 17

Examples: Independent Events

1. What’s the probability of getting a sequence of

1,2,3,4,5,6 if we roll a dice six times?

2. A school survey found that 9 out of 10 students

like pizza. If three students are chosen at random with replacement, what is the probability that all three students like pizza?

SLIDE 18

Dependent Events

Red bin Blue bin

uit

r-

intro-

Apple Orange If I take a fruit from the red bin, what is the probability that I get an apple?

SLIDE 19

Dependent Events

Conditional Probability P(fruit = apple | bin = red) = 2 / 8 Red bin Blue bin

uit

r-

intro-

Apple Orange

SLIDE 20

Dependent Events

Red bin Blue bin

uit

r-

intro-

Apple Orange Joint Probability P(fruit = apple , bin = red) = 2 / 12

SLIDE 21

Dependent Events

Red bin Blue bin

uit

r-

intro-

Apple Orange Joint Probability P(fruit = apple , bin = blue) = ?

SLIDE 22

Dependent Events

Red bin Blue bin

uit

r-

intro-

Apple Orange Joint Probability P(fruit = apple , bin = blue) = 3 / 12

SLIDE 23

Dependent Events

Red bin Blue bin

uit

r-

intro-

Apple Orange Joint Probability P(fruit = orange , bin = blue) = ?

SLIDE 24

Dependent Events

Red bin Blue bin

uit

r-

intro-

Apple Orange Joint Probability P(fruit = orange , bin = blue) = 1 / 12

SLIDE 25

Two rules of Probability

1. Sum Rule (Marginal Probabilities)

P(fruit = apple) = P(fruit = apple , bin = blue) + P(fruit = apple , bin = red) = ?

uit

r-

intro-

SLIDE 26

Two rules of Probability

1. Sum Rule (Marginal Probabilities)

P(fruit = apple) = P(fruit = apple , bin = blue) + P(fruit = apple , bin = red) = 3 / 12 + 2 / 12 = 5 / 12

uit

r-

intro-

SLIDE 27

Two rules of Probability

2. Product Rule

P(fruit = apple , bin = red) = P(fruit = apple | bin = red) p(bin = red) = ?

uit

r-

intro-

SLIDE 28

Two rules of Probability

2. Product Rule

P(fruit = apple , bin = red) = P(fruit = apple | bin = red) p(bin = red) = 2 / 8 * 8 / 12 = 2 / 12

uit

r-

intro-

SLIDE 29

Two rules of Probability

2. Product Rule (reversed)

P(fruit = apple , bin = red) = P(bin = red | fruit = apple) p(fruit = apple) = ?

uit

r-

intro-

SLIDE 30

Two rules of Probability

2. Product Rule (reversed)

P(fruit = apple , bin = red) = P(bin = red | fruit = apple) p(fruit = apple) = 2 / 5 * 5 / 12 = 2 / 12

uit

r-

intro-

SLIDE 31

Bayes' Rule

Posterior Likelihood Prior

Sum Rule: Product Rule:

SLIDE 32

Bayes' Rule

Posterior Likelihood Prior

Probability of rare disease: 0.005 Probability of detection: 0.98 Probability of false positive: 0.05 Probability of disease when test positive?

SLIDE 33

Bayes' Rule

Posterior Likelihood Prior

0.99 * 0.005 + 0.05 * 0.995 = 0.0547 0.99 * 0.005 = 0.00495 0.00495 / 0.0547 = 0.09

SLIDE 34

Normal Distribution

∼ ⇒

Density:

SLIDE 35

Density:

Multivariate Normal

Parameters:

Σi j = E[(xi − µi)(x j − µj)]

SLIDE 36

Covariance Matrices

Density:

 1.0 0.0 0.0 1.0

 2.0

0.0 0.0 0.5

 1.0

0.5 0.5 1.0



1.0 −0.5 −0.5 1.0

Question: Which covariance matrix Σ

corresponds to which plot?

SLIDE 37

Marginals and Conditionals

Suppose that x and y are jointly Gaussian:

x x y

Question: What are the marginal  distributions p(x) and p(y)?

z = x y

∼ N

✓a b

,

 A C CT B ◆

x ∼ N (a, A) y ∼ N (b, B)

SLIDE 38

Marginals and Conditionals

Suppose that x and y are jointly Gaussian:

x x y

Question: What are the conditional  distributions p(x | y) and p(y | x)?

z = x y

∼ N

✓a b

,

 A C CT B ◆

x|y ∼ N

a + CB−1(y − b), A − CB−1CT

y|x ∼ N

b + CT A−1(x − a), B − CT A−1C

SLIDE 39

Maximum Likelihood

SLIDE 40

Regression: Probabilistic Interpretation

What is the probability ?

SLIDE 41

Regression: Probabilistic Interpretation

Least Squares   Objective Likelihood

SLIDE 42

Maximum Likelihood

Least Squares   Objective Log-Likelihood Maximizing the likelihood minimizes the sum of squares

SLIDE 43

Maximum a Posteriori

SLIDE 44

Regression with Priors

Can we maximize ? (this is known as maximum a posteriori estimation)

SLIDE 45

Regression with Priors

From Bayes Rule

SLIDE 46

Maximum a Posteriori

Maximum a Posteriori is Equivalent to Ridge Regression

SLIDE 47

Maximum a Posteriori

Maximum a Posteriori is Equivalent to Ridge Regression

SLIDE 48

Basis Function Regression

x t M = 3 1 −1 1

SLIDE 49

Basis Function Regression

x t M = 3 1 −1 1

SLIDE 50

Predictive Posterior

SLIDE 51

adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

Priors on Functions

−1 1 2 −2 −1 1 2

M=0

−1 1 2 −4 −2 2 4

M=1

−1 1 2 −5 5

M=2

−1 1 2 −10 10

M=3

−1 1 2 −50 50

M=5

−1 1 2 −2 2 x 10

5 M=17

Idea: sampling w ~ p(w) defines a function wTφ(x), so p(w) is equivalent to a prior on functions.

SLIDE 52

adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

Posterior Uncertainty

−6 −4 −2 2 4 6 −2 2 −6 −4 −2 2 4 6 −2 2 −6 −4 −2 2 4 6 −2 2

Can we reason about the posterior on functions? Idea: sample w ~ p(w | X, y) and plot functions Increasing λ

SLIDE 53

adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

The Predictive Distribution

and A = σ2

n ΦΦ> + Σ1 p .

invert the A matrix of

with Φ = Φ(X) equation we need

f⇤|x⇤, X, y ⇠ N 1 σ2

n

φ(x⇤)>A1Φy, φ(x⇤)>A1φ(x⇤)

ameter is typic

for f⇤ , f(x⇤) w.r.t. the Gau

Predictive distribution on observations Predictive distribution on the function value Prior

SLIDE 54

−6 −4 −2 2 4 6 −2 2 −6 −4 −2 2 4 6 −2 2 −6 −4 −2 2 4 6 −2 2

adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

The Predictive Distribution

Idea: Average over all possible values of w Increasing λ

SLIDE 55

The Kernel Trick

SLIDE 56

Cost of Feature Computation

Example: Mapping with linear and quadratic terms

SLIDE 57

Example: Mapping with linear and quadratic terms 1+d+d2/2 terms

Cost of Feature Computation

SLIDE 58

Example: Mapping with linear and quadratic terms

Polynomial

ϕ (x)

Cost 100 features Quadratic > d2/2 terms up to degree 2 d2 N2 /4 2,500 N2 Cubic > d3/6 terms up to degree 3 d3 N2 /12 83,000 N2 Quartic > d4/24 terms up to degree 4 d4 N2 /48 1,960,000 N2

Cost of Feature Computation

SLIDE 59

The Kernel Trick

Define a kernel function such that k can be cheaper to evaluate than φ!

SLIDE 60

Define a kernel function such that k can be cheaper to evaluate than φ!

The Kernel Trick

SLIDE 61

Kernel for polynomials up to degree q

The Kernel Trick

SLIDE 62

Computational Cost

Kernel for polynomials up to degree q

Polynomial

ϕ (x)

Cost 100 features Quadratic > d2/2 terms up to degree 2 d2 N2 /4 2,500 N2 Cubic > d3/6 terms up to degree 3 d3 N2 /12 83,000 N2 Quartic > d4/24 terms up to degree 4 d4 N2 /48 1,960,000 N2

SLIDE 63

Computational Cost

Kernel for polynomials up to degree q

Polynomial

ϕ (x)

Cost 100 features Quadratic > d2/2 terms up to degree 2 d2 N2 /4 2,500 N2 Cubic > d3/6 terms up to degree 3 d3 N2 /12 83,000 N2 Quartic > d4/24 terms up to degree 4 d4 N2 /48 1,960,000 N2

100 100 100

SLIDE 64

Computational Cost

Kernel for polynomials up to degree q

SLIDE 65

Intermezzo: Kernels

Borrowing from:  Arthur Gretton   (Gatsby, UCL)

SLIDE 66

Hilbert Spaces

Definition (Inner product) Let H be a vector space over R. A function h·, ·iH : H ⇥ H ! R is an inner product on H if

1 Linear: hα1f1 + α2f2, giH = α1 hf1, giH + α2 hf2, giH 2 Symmetric: hf , giH = hg, f iH 3 hf , f iH 0 and hf , f iH = 0 if and only if f = 0.

Norm induced by the inner product: kf kH := p hf , f iH

SLIDE 67

Example: Fourier Bases

SLIDE 68

Example: Fourier Bases

SLIDE 69

Example: Fourier Bases

SLIDE 70

Example: Fourier Bases

Fourier modes define a vector space

SLIDE 71

Kernels

Definition Let X be a non-empty set. A function k : X × X → R is a kernel if there exists an R-Hilbert space and a map φ : X → H such that ∀x, x0 ∈ X, k(x, x0) := ⌦ φ(x), φ(x0) ↵

H .

Almost no conditions on X (eg, X itself doesn’t need an inner product, eg. documents). A single kernel can correspond to several possible features. A trivial example for X := R: φ1(x) = x and φ2(x) =  x/ √ 2 x/ √ 2

SLIDE 72

Sums, Transformations, Products

Theorem (Sums of kernels are kernels) Given α > 0 and k, k1 and k2 all kernels on X, then αk and k1 + k2 are kernels on X. (Proof via positive definiteness: later!) A difference of kernels may not be a kernel (why?) Theorem (Mappings between spaces) Let X and e X be sets, and define a map A : X → e

X. Define the

kernel k on e

X. Then the kernel k(A(x), A(x0)) is a kernel on X.

Example: k(x, x0) = x2 (x0)2 . Theorem (Products of kernels are kernels) Given k1 on X1 and k2 on X2, then k1 ⇥ k2 is a kernel on X1 ⇥ X2. If X1 = X2 = X, then k := k1 ⇥ k2 is a kernel on X. Proof: Main idea only!

SLIDE 73

Polynomial Kernels

Theorem (Polynomial kernels) Let x, x0 2 Rd for d 1, and let m 1 be an integer and c 0 be a positive real. Then k(x, x0) := ⌦ x, x0↵ + c m is a valid kernel. To prove: expand into a sum (with non-negative scalars) of kernels hx, x0i raised to integer powers. These individual terms are valid kernels by the product rule.

SLIDE 74

Infinite Sequences

Definition The space `2 (square summable sequences) comprises all sequences a := (ai)i1 for which kak2

`2 = 1

X

i=1

a2

i < 1.

Definition Given sequence of functions (i(x))i1 in `2 where i : X ! R is the ith coordinate of (x). Then k(x, x0) :=

1

X

i=1

i(x)i(x0) (1)

SLIDE 75

Infinite Sequences

Why square summable? By Cauchy-Schwarz,

1

X

i=1

φi(x)φi(x0)

 kφ(x)k`2
φ(x0)
`2 ,

so the sequence defining the inner product converges for all x, x0 2 X

SLIDE 76

Taylor Series Kernels

Definition (Taylor series kernel) For r 2 (0, 1], with an 0 for all n 0 f (z) =

1

X

n=0

anzn |z| < r, z 2 R, Define X to be the pr-ball in Rd, sokxk < pr, k(x, x0) = f ⌦ x, x0↵ =

1

X

n=0

an ⌦ x, x0↵n . Example (Exponential kernel) k(x, x0) := exp ⌦ x, x0↵ .

SLIDE 77

Gaussian Kernel

Example (Gaussian kernel) The Gaussian kernel on Rd is defined as k(x, x0) := exp ⇣ −γ2 x − x0 2⌘ . Proof: an exercise! Use product rule, mapping rule, exponential kernel.

(also known as Radial Basis Function (RBF) kernel)

SLIDE 78

Gaussian Kernel

Example (Gaussian kernel) The Gaussian kernel on Rd is defined as k(x, x0) := exp ⇣ −γ2 x − x0 2⌘ . Proof: an exercise! Use product rule, mapping rule, exponential kernel.

(also known as Radial Basis Function (RBF) kernel) Squared Exponential (SE) Automatic Relevance   Determination (ARD)

SLIDE 79

Products of Kernels

Lin × Lin SE × Per Lin × SE Lin × Per

x (with xÕ = 1) x − xÕ x (with xÕ = 1) x (with xÕ = 1) me: Squared-exp (SE) Periodic (Per) Linear (Lin) σ2

f exp

1

−(x≠xÕ)2

2¸2

2

σ2

f exp

1

− 2

¸2 sin2 1

π x≠xÕ

p

22

σ2

f(x − c)(xÕ − c)

): x − xÕ x − xÕ x (with xÕ = 1)

source: David Duvenaud (PhD Thesis)

SLIDE 80

Positive Definiteness

Definition (Positive definite functions) A symmetric function k : X × X → R is positive definite if ∀n ≥ 1, ∀(a1, . . . an) ∈ Rn, ∀(x1, . . . , xn) ∈ X n,

n

X

i=1 n

X

j=1

aiajk(xi, xj) ≥ 0. The function k(·, ·) is strictly positive definite if for mutually distinct xi, the equality holds only when all the ai are zero.

SLIDE 81

Mercer’s Theorem

Theorem Let H be a Hilbert space, X a non-empty set and φ : X ! H. Then hφ(x), φ(y)iH =: k(x, y) is positive definite. Proof.

n

X

i=1 n

X

j=1

aiajk(xi, xj) =

n

X

i=1 n

X

j=1

haiφ(xi), ajφ(xj)iH =

n

X

i=1

aiφ(xi)

2

H

0. Reverse also holds: positive definite k(x, x0) is inner product in a unique H (Moore-Aronsajn: coming later!).

SLIDE 82

Gaussian Processes

SLIDE 83

adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

Gaussian Processes

Idea: Define a prior on functions using a   kernel function to define the covariance

m(x) = E[f(x)], k(x, x0) = E[(f(x) m(x))(f(x0) m(x0))],

f(x) ⇠ GP

m(x), k(x, x0)
If mean is 0, then for any set of points X, the

function values are Gaussian distributed

f⇤ ⇠ N

0, K(X⇤, X⇤)
alues as a function of the in

SLIDE 84

adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

Sequential Generation

 y f∗

⇠ N

✓ 0,  K(X, X) + σ2

nI

K(X, X∗) K(X∗, X) K(X∗, X∗) ◆

Observations and function values are Gaussian Can fill in standard relations for Gaussians

z = x y

∼ N

✓a b

,

 A C CT B ◆

x|y ∼ N

a + CB−1(y − b), A − CB−1CT

SLIDE 85

adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

Sequential Generation

 y f∗

⇠ N

✓ 0,  K(X, X) + σ2

nI

K(X, X∗) K(X∗, X) K(X∗, X∗) ◆

Observations and function values are Gaussian Can fill in standard relations for Gaussians

f∗|X, y, X∗ ⇠ N ¯ f∗, cov(f∗)

, where

¯ f∗ , E[f∗|X, y, X∗] = K(X∗, X)[K(X, X) + σ2

nI]−1y,

cov(f∗) = K(X∗, X∗) K(X∗, X)[K(X, X) + σ2

nI

⇤−1K(X, X∗).

SLIDE 86

adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

GP Prior and Posterior

−5 5 −2 −1 1 2 input, x

utput, f(x)

−5 5 −2 −1 1 2 input, x

utput, f(x)

p(y⇤|x⇤, x, y) ∼ N

k(x⇤, x)>[K + σ2

noiseI]−1y,

k(x⇤, x⇤) + σ2

noise − k(x⇤, x)>[K + σ2 noiseI]−1k(x⇤, x)

SLIDE 87

adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

Gaussian Process Sample

−6 −4 −2 2 4 6 −6 −4 −2 2 4 6 −2 −1 1 2 3 4 5 6 7 8

Function drawn at random from a Gaussian Process with Gaussian covariance

SLIDE 88

adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

Choosing Kernel Hyperparameters

−10

−5 5 10 −0.5 0.0 0.5 1.0 input, x function value, y too long about right too short

The mean posterior predictive function is plotted for 3 different length scales (the

function: k(x, x0) = v2 exp

− (x − x0)2

2`2

+ 2

noisexx0.

Characteristic Lengthscales