Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 3 - - PowerPoint PPT Presentation

data mining techniques
SMART_READER_LITE
LIVE PREVIEW

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 3 - - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 3 Jan-Willem van de Meent ( credit : Yijun Zhao, Arthur Gretton Rasmussen & Williams) Linear Regression Linear Regression Assume f is a linear combination of D features


slide-1
SLIDE 1

Data Mining Techniques

CS 6220 - Section 2 - Spring 2017

Lecture 3

Jan-Willem van de Meent (credit: Yijun Zhao, Arthur Gretton
 Rasmussen & Williams)

slide-2
SLIDE 2

Linear Regression

slide-3
SLIDE 3

ε ∼ Norm(0,σ2)

Assume f is a linear combination of D features

Linear Regression

Learning task: Estimate w For N points we write

slide-4
SLIDE 4

Linear Regression

slide-5
SLIDE 5

Error Measure: Sum of Squares

Mean Squared Error (MSE): E(w) = 1 N

N

X

n=1

(wTxn yn)2 = 1 N k Xw y k2 where X = 2 6 6 4 — x1T — — x2T — . . . — xNT — 3 7 7 5 y = 2 6 6 4 y1T y2T . . . yNT 3 7 7 5

slide-6
SLIDE 6

Minimizing the Error

E(w) = 1

N k Xw y k2

5E(w) = 2

NXT(Xw y) = 0

XTXw = XTy w = X†y where X† = (XTX)1XT is the ’pseudo-inverse’ of X

2

slide-7
SLIDE 7

E(w) = 1

N k Xw y k2

5E(w) = 2

NXT(Xw y) = 0

XTXw = XTy w = X†y where X† = (XTX)1XT is the ’pseudo-inverse’ of X

2

Minimizing the Error

Matrix Cookbook (on course website)

slide-8
SLIDE 8

Ordinary Least Squares

Construct matrix X and the vector y from the dataset {(x1, y1), x2, y2), . . . , (xN, yN)} (each x includes x0 = 1) as follows: X =     — xT

1 —

— xT

2 —

. . . — xT

N —

    y =     y T

1

y T

2

. . . y T

N

    Compute X† = (XTX)−1XT Return w = X†y

slide-9
SLIDE 9

Basis function regression

Linear regression Basis function regression Polynomial regression For N samples

slide-10
SLIDE 10

Polynomial Regression

x t M = 0 1 −1 1 x t M = 1 1 −1 1 x t M = 3 1 −1 1 x t M = 9 1 −1 1

slide-11
SLIDE 11

Polynomial Regression

x t M = 0 1 −1 1 x t M = 1 1 −1 1 x t M = 3 1 −1 1 x t M = 9 1 −1 1

Underfit

slide-12
SLIDE 12

Polynomial Regression

x t M = 0 1 −1 1 x t M = 1 1 −1 1 x t M = 3 1 −1 1 x t M = 9 1 −1 1

Overfit

slide-13
SLIDE 13

L2 regularization (ridge regression) minimizes: E(w) = 1 N k Xw y k2 + λ k w k2 where λ 0 and k w k2 = wTw

  • k

k L1 regularization (LASSO) minimizes: E(w) = 1 N k Xw y k2 + λ|w|1 where λ 0 and |w|1 =

D

P

i=1

|ωi|

Regularization

slide-14
SLIDE 14

Regularization

slide-15
SLIDE 15

L2: closed form solution w = (XTX + λI)1XTy L1: No closed form solution. Use quadratic programming: minimize k Xw y k2 s.t. k w k1 s

Regularization

slide-16
SLIDE 16

Review: Probability

slide-17
SLIDE 17

Examples: Independent Events

  • 1. What’s the probability of getting a sequence of

1,2,3,4,5,6 if we roll a dice six times?

  • 2. A school survey found that 9 out of 10 students

like pizza. If three students are chosen at random with replacement, what is the probability that all three students like pizza?

slide-18
SLIDE 18

Dependent Events

Red bin Blue bin

uit

  • r-

intro-

Apple Orange If I take a fruit from the red bin, what is the probability that I get an apple?

slide-19
SLIDE 19

Dependent Events

Conditional Probability P(fruit = apple | bin = red) = 2 / 8 Red bin Blue bin

uit

  • r-

intro-

Apple Orange

slide-20
SLIDE 20

Dependent Events

Red bin Blue bin

uit

  • r-

intro-

Apple Orange Joint Probability P(fruit = apple , bin = red) = 2 / 12

slide-21
SLIDE 21

Dependent Events

Red bin Blue bin

uit

  • r-

intro-

Apple Orange Joint Probability P(fruit = apple , bin = blue) = ?

slide-22
SLIDE 22

Dependent Events

Red bin Blue bin

uit

  • r-

intro-

Apple Orange Joint Probability P(fruit = apple , bin = blue) = 3 / 12

slide-23
SLIDE 23

Dependent Events

Red bin Blue bin

uit

  • r-

intro-

Apple Orange Joint Probability P(fruit = orange , bin = blue) = ?

slide-24
SLIDE 24

Dependent Events

Red bin Blue bin

uit

  • r-

intro-

Apple Orange Joint Probability P(fruit = orange , bin = blue) = 1 / 12

slide-25
SLIDE 25

Two rules of Probability

  • 1. Sum Rule (Marginal Probabilities)

P(fruit = apple) = P(fruit = apple , bin = blue) + P(fruit = apple , bin = red) = ?

uit

  • r-

intro-

slide-26
SLIDE 26

Two rules of Probability

  • 1. Sum Rule (Marginal Probabilities)

P(fruit = apple) = P(fruit = apple , bin = blue) + P(fruit = apple , bin = red) = 3 / 12 + 2 / 12 = 5 / 12

uit

  • r-

intro-

slide-27
SLIDE 27

Two rules of Probability

  • 2. Product Rule

P(fruit = apple , bin = red) = P(fruit = apple | bin = red) p(bin = red) = ?

uit

  • r-

intro-

slide-28
SLIDE 28

Two rules of Probability

  • 2. Product Rule

P(fruit = apple , bin = red) = P(fruit = apple | bin = red) p(bin = red) = 2 / 8 * 8 / 12 = 2 / 12

uit

  • r-

intro-

slide-29
SLIDE 29

Two rules of Probability

  • 2. Product Rule (reversed)

P(fruit = apple , bin = red) = P(bin = red | fruit = apple) p(fruit = apple) = ?

uit

  • r-

intro-

slide-30
SLIDE 30

Two rules of Probability

  • 2. Product Rule (reversed)

P(fruit = apple , bin = red) = P(bin = red | fruit = apple) p(fruit = apple) = 2 / 5 * 5 / 12 = 2 / 12

uit

  • r-

intro-

slide-31
SLIDE 31

Bayes' Rule

Posterior Likelihood Prior

Sum Rule: Product Rule:

slide-32
SLIDE 32

Bayes' Rule

Posterior Likelihood Prior

Probability of rare disease: 0.005 Probability of detection: 0.98 Probability of false positive: 0.05 Probability of disease when test positive?

slide-33
SLIDE 33

Bayes' Rule

Posterior Likelihood Prior

0.99 * 0.005 + 0.05 * 0.995 = 0.0547 0.99 * 0.005 = 0.00495 0.00495 / 0.0547 = 0.09

slide-34
SLIDE 34

Normal Distribution

∼ ⇒

Density:

slide-35
SLIDE 35

Density:

Multivariate Normal

Parameters:

Σi j = E[(xi − µi)(x j − µj)]

slide-36
SLIDE 36

Covariance Matrices

Density:

 1.0 0.0 0.0 1.0

  •  2.0

0.0 0.0 0.5

  •  1.0

0.5 0.5 1.0

1.0 −0.5 −0.5 1.0

  • Question: Which covariance matrix Σ 


corresponds to which plot?

slide-37
SLIDE 37

Marginals and Conditionals

Suppose that x and y are jointly Gaussian:

x x y

Question: What are the marginal
 distributions p(x) and p(y)?

z = x y

  • ∼ N

✓a b

  • ,

 A C CT B ◆

x ∼ N (a, A) y ∼ N (b, B)

slide-38
SLIDE 38

Marginals and Conditionals

Suppose that x and y are jointly Gaussian:

x x y

Question: What are the conditional
 distributions p(x | y) and p(y | x)?

z = x y

  • ∼ N

✓a b

  • ,

 A C CT B ◆

x|y ∼ N

  • a + CB−1(y − b), A − CB−1CT

y|x ∼ N

  • b + CT A−1(x − a), B − CT A−1C
slide-39
SLIDE 39

Maximum Likelihood

slide-40
SLIDE 40

Regression: Probabilistic Interpretation

What is the probability ?

slide-41
SLIDE 41

Regression: Probabilistic Interpretation

Least Squares 
 Objective Likelihood

slide-42
SLIDE 42

Maximum Likelihood

Least Squares 
 Objective Log-Likelihood Maximizing the likelihood minimizes the sum of squares

slide-43
SLIDE 43

Maximum a Posteriori

slide-44
SLIDE 44

Regression with Priors

Can we maximize ? (this is known as maximum a posteriori estimation)

slide-45
SLIDE 45

Regression with Priors

From Bayes Rule

slide-46
SLIDE 46

Maximum a Posteriori

Maximum a Posteriori is Equivalent to Ridge Regression

slide-47
SLIDE 47

Maximum a Posteriori

Maximum a Posteriori is Equivalent to Ridge Regression

slide-48
SLIDE 48

Basis Function Regression

x t M = 3 1 −1 1

slide-49
SLIDE 49

Basis Function Regression

x t M = 3 1 −1 1

slide-50
SLIDE 50

Predictive Posterior

slide-51
SLIDE 51

adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

Priors on Functions

−1 1 2 −2 −1 1 2

M=0

−1 1 2 −4 −2 2 4

M=1

−1 1 2 −5 5

M=2

−1 1 2 −10 10

M=3

−1 1 2 −50 50

M=5

−1 1 2 −2 2 x 10

5 M=17

Idea: sampling w ~ p(w) defines a function wTφ(x), so p(w) is equivalent to a prior on functions.

slide-52
SLIDE 52

adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

Posterior Uncertainty

−6 −4 −2 2 4 6 −2 2 −6 −4 −2 2 4 6 −2 2 −6 −4 −2 2 4 6 −2 2

Can we reason about the posterior on functions? Idea: sample w ~ p(w | X, y) and plot functions Increasing λ

slide-53
SLIDE 53

adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

The Predictive Distribution

  • and A = σ2

n ΦΦ> + Σ1 p .

invert the A matrix of

with Φ = Φ(X) equation we need

f⇤|x⇤, X, y ⇠ N 1 σ2

n

φ(x⇤)>A1Φy, φ(x⇤)>A1φ(x⇤)

  • ameter is typic

for f⇤ , f(x⇤) w.r.t. the Gau

Predictive distribution on observations Predictive distribution on the function value Prior

slide-54
SLIDE 54

−6 −4 −2 2 4 6 −2 2 −6 −4 −2 2 4 6 −2 2 −6 −4 −2 2 4 6 −2 2

adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

The Predictive Distribution

Idea: Average over all possible values of w Increasing λ

slide-55
SLIDE 55

The Kernel Trick

slide-56
SLIDE 56

Cost of Feature Computation

Example: Mapping with linear and quadratic terms

slide-57
SLIDE 57

Example: Mapping with linear and quadratic terms 1+d+d2/2 terms

Cost of Feature Computation

slide-58
SLIDE 58

Example: Mapping with linear and quadratic terms

  • Polynomial

ϕ (x)

Cost 100 features Quadratic > d2/2 terms up to degree 2 d2 N2 /4 2,500 N2 Cubic > d3/6 terms up to degree 3 d3 N2 /12 83,000 N2 Quartic > d4/24 terms up to degree 4 d4 N2 /48 1,960,000 N2

Cost of Feature Computation

slide-59
SLIDE 59

The Kernel Trick

Define a kernel function such that k can be cheaper to evaluate than φ!

slide-60
SLIDE 60

Define a kernel function such that k can be cheaper to evaluate than φ!

The Kernel Trick

slide-61
SLIDE 61

Kernel for polynomials up to degree q

The Kernel Trick

slide-62
SLIDE 62

Computational Cost

Kernel for polynomials up to degree q

  • Polynomial

ϕ (x)

Cost 100 features Quadratic > d2/2 terms up to degree 2 d2 N2 /4 2,500 N2 Cubic > d3/6 terms up to degree 3 d3 N2 /12 83,000 N2 Quartic > d4/24 terms up to degree 4 d4 N2 /48 1,960,000 N2

slide-63
SLIDE 63

Computational Cost

Kernel for polynomials up to degree q

  • Polynomial

ϕ (x)

Cost 100 features Quadratic > d2/2 terms up to degree 2 d2 N2 /4 2,500 N2 Cubic > d3/6 terms up to degree 3 d3 N2 /12 83,000 N2 Quartic > d4/24 terms up to degree 4 d4 N2 /48 1,960,000 N2

100 100 100

slide-64
SLIDE 64

Computational Cost

Kernel for polynomials up to degree q

slide-65
SLIDE 65

Intermezzo: Kernels

Borrowing from:
 Arthur Gretton 
 (Gatsby, UCL)

slide-66
SLIDE 66

Hilbert Spaces

Definition (Inner product) Let H be a vector space over R. A function h·, ·iH : H ⇥ H ! R is an inner product on H if

1 Linear: hα1f1 + α2f2, giH = α1 hf1, giH + α2 hf2, giH 2 Symmetric: hf , giH = hg, f iH 3 hf , f iH 0 and hf , f iH = 0 if and only if f = 0.

Norm induced by the inner product: kf kH := p hf , f iH

slide-67
SLIDE 67

Example: Fourier Bases

slide-68
SLIDE 68

Example: Fourier Bases

slide-69
SLIDE 69

Example: Fourier Bases

slide-70
SLIDE 70

Example: Fourier Bases

Fourier modes define a vector space

slide-71
SLIDE 71

Kernels

Definition Let X be a non-empty set. A function k : X × X → R is a kernel if there exists an R-Hilbert space and a map φ : X → H such that ∀x, x0 ∈ X, k(x, x0) := ⌦ φ(x), φ(x0) ↵

H .

Almost no conditions on X (eg, X itself doesn’t need an inner product, eg. documents). A single kernel can correspond to several possible features. A trivial example for X := R: φ1(x) = x and φ2(x) =  x/ √ 2 x/ √ 2

slide-72
SLIDE 72

Sums, Transformations, Products

Theorem (Sums of kernels are kernels) Given α > 0 and k, k1 and k2 all kernels on X, then αk and k1 + k2 are kernels on X. (Proof via positive definiteness: later!) A difference of kernels may not be a kernel (why?) Theorem (Mappings between spaces) Let X and e X be sets, and define a map A : X → e

  • X. Define the

kernel k on e

  • X. Then the kernel k(A(x), A(x0)) is a kernel on X.

Example: k(x, x0) = x2 (x0)2 . Theorem (Products of kernels are kernels) Given k1 on X1 and k2 on X2, then k1 ⇥ k2 is a kernel on X1 ⇥ X2. If X1 = X2 = X, then k := k1 ⇥ k2 is a kernel on X. Proof: Main idea only!

slide-73
SLIDE 73

Polynomial Kernels

Theorem (Polynomial kernels) Let x, x0 2 Rd for d 1, and let m 1 be an integer and c 0 be a positive real. Then k(x, x0) := ⌦ x, x0↵ + c m is a valid kernel. To prove: expand into a sum (with non-negative scalars) of kernels hx, x0i raised to integer powers. These individual terms are valid kernels by the product rule.

slide-74
SLIDE 74

Infinite Sequences

Definition The space `2 (square summable sequences) comprises all sequences a := (ai)i1 for which kak2

`2 = 1

X

i=1

a2

i < 1.

Definition Given sequence of functions (i(x))i1 in `2 where i : X ! R is the ith coordinate of (x). Then k(x, x0) :=

1

X

i=1

i(x)i(x0) (1)

slide-75
SLIDE 75

Infinite Sequences

Why square summable? By Cauchy-Schwarz,

  • 1

X

i=1

φi(x)φi(x0)

  •  kφ(x)k`2
  • φ(x0)
  • `2 ,

so the sequence defining the inner product converges for all x, x0 2 X

slide-76
SLIDE 76

Taylor Series Kernels

Definition (Taylor series kernel) For r 2 (0, 1], with an 0 for all n 0 f (z) =

1

X

n=0

anzn |z| < r, z 2 R, Define X to be the pr-ball in Rd, sokxk < pr, k(x, x0) = f ⌦ x, x0↵ =

1

X

n=0

an ⌦ x, x0↵n . Example (Exponential kernel) k(x, x0) := exp ⌦ x, x0↵ .

slide-77
SLIDE 77

Gaussian Kernel

Example (Gaussian kernel) The Gaussian kernel on Rd is defined as k(x, x0) := exp ⇣ −γ2 x − x0 2⌘ . Proof: an exercise! Use product rule, mapping rule, exponential kernel.

(also known as Radial Basis Function (RBF) kernel)

slide-78
SLIDE 78

Gaussian Kernel

Example (Gaussian kernel) The Gaussian kernel on Rd is defined as k(x, x0) := exp ⇣ −γ2 x − x0 2⌘ . Proof: an exercise! Use product rule, mapping rule, exponential kernel.

(also known as Radial Basis Function (RBF) kernel) Squared Exponential (SE) Automatic Relevance 
 Determination (ARD)

slide-79
SLIDE 79

Products of Kernels

Lin × Lin SE × Per Lin × SE Lin × Per

x (with xÕ = 1) x − xÕ x (with xÕ = 1) x (with xÕ = 1) me: Squared-exp (SE) Periodic (Per) Linear (Lin) σ2

f exp

1

−(x≠xÕ)2

2¸2

2

σ2

f exp

1

− 2

¸2 sin2 1

π x≠xÕ

p

22

σ2

f(x − c)(xÕ − c)

): x − xÕ x − xÕ x (with xÕ = 1)

source: David Duvenaud (PhD Thesis)

slide-80
SLIDE 80

Positive Definiteness

Definition (Positive definite functions) A symmetric function k : X × X → R is positive definite if ∀n ≥ 1, ∀(a1, . . . an) ∈ Rn, ∀(x1, . . . , xn) ∈ X n,

n

X

i=1 n

X

j=1

aiajk(xi, xj) ≥ 0. The function k(·, ·) is strictly positive definite if for mutually distinct xi, the equality holds only when all the ai are zero.

slide-81
SLIDE 81

Mercer’s Theorem

Theorem Let H be a Hilbert space, X a non-empty set and φ : X ! H. Then hφ(x), φ(y)iH =: k(x, y) is positive definite. Proof.

n

X

i=1 n

X

j=1

aiajk(xi, xj) =

n

X

i=1 n

X

j=1

haiφ(xi), ajφ(xj)iH =

  • n

X

i=1

aiφ(xi)

  • 2

H

0. Reverse also holds: positive definite k(x, x0) is inner product in a unique H (Moore-Aronsajn: coming later!).

slide-82
SLIDE 82

Gaussian Processes

slide-83
SLIDE 83

adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

Gaussian Processes

Idea: Define a prior on functions using a 
 kernel function to define the covariance

m(x) = E[f(x)], k(x, x0) = E[(f(x) m(x))(f(x0) m(x0))],

f(x) ⇠ GP

  • m(x), k(x, x0)
  • If mean is 0, then for any set of points X, the 


function values are Gaussian distributed

f⇤ ⇠ N

  • 0, K(X⇤, X⇤)
  • alues as a function of the in
slide-84
SLIDE 84

adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

Sequential Generation

 y f∗

  • ⇠ N

✓ 0,  K(X, X) + σ2

nI

K(X, X∗) K(X∗, X) K(X∗, X∗) ◆

Observations and function values are Gaussian Can fill in standard relations for Gaussians

z = x y

  • ∼ N

✓a b

  • ,

 A C CT B ◆

x|y ∼ N

  • a + CB−1(y − b), A − CB−1CT
slide-85
SLIDE 85

adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

Sequential Generation

 y f∗

  • ⇠ N

✓ 0,  K(X, X) + σ2

nI

K(X, X∗) K(X∗, X) K(X∗, X∗) ◆

Observations and function values are Gaussian Can fill in standard relations for Gaussians

f∗|X, y, X∗ ⇠ N ¯ f∗, cov(f∗)

  • , where

¯ f∗ , E[f∗|X, y, X∗] = K(X∗, X)[K(X, X) + σ2

nI]−1y,

cov(f∗) = K(X∗, X∗) K(X∗, X)[K(X, X) + σ2

nI

⇤−1K(X, X∗).

slide-86
SLIDE 86

adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

GP Prior and Posterior

−5 5 −2 −1 1 2 input, x

  • utput, f(x)

−5 5 −2 −1 1 2 input, x

  • utput, f(x)

p(y⇤|x⇤, x, y) ∼ N

  • k(x⇤, x)>[K + σ2

noiseI]−1y,

k(x⇤, x⇤) + σ2

noise − k(x⇤, x)>[K + σ2 noiseI]−1k(x⇤, x)

slide-87
SLIDE 87

adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

Gaussian Process Sample

−6 −4 −2 2 4 6 −6 −4 −2 2 4 6 −2 −1 1 2 3 4 5 6 7 8

Function drawn at random from a Gaussian Process with Gaussian covariance

slide-88
SLIDE 88

adapted from: Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

Choosing Kernel Hyperparameters

  • −10

−5 5 10 −0.5 0.0 0.5 1.0 input, x function value, y too long about right too short

The mean posterior predictive function is plotted for 3 different length scales (the

function: k(x, x0) = v2 exp

  • − (x − x0)2

2`2

  • + 2

noisexx0.

Characteristic Lengthscales