Assignment 3 Zahra Sheikhbahaee Zeou Hu & Colin Vandenhof - - PDF document

assignment 3
SMART_READER_LITE
LIVE PREVIEW

Assignment 3 Zahra Sheikhbahaee Zeou Hu & Colin Vandenhof - - PDF document

Assignment 3 Zahra Sheikhbahaee Zeou Hu & Colin Vandenhof February 2020 1 [2 points] Mixture of Bernoullis A mixture of Bernoullis model is like the Gaussian mixture model which weve discussed in this course. Each of the mixture


slide-1
SLIDE 1

Assignment 3

Zahra Sheikhbahaee Zeou Hu & Colin Vandenhof February 2020

1 [2 points] Mixture of Bernoullis

A mixture of Bernoullis model is like the Gaussian mixture model which we’ve discussed in this course. Each of the mixture components consists of a collection

  • f independent Bernoulli random variables. In general, a mixture model assumes

the data are generated by the following process: first we sample z, and then we sample the observables x from a distribution which depends on z, i.e. p(x, z) = p(x|z)p(z) (1) In mixture models, p(z) is always a multinomial distribution with parameter π = {π1, ..., πK} which are mixture weights satisfying

K

  • k=1

πk = 1, πk ≥ 0 (2) Consider a set of N binary random variable in a D-dimensional space xj, where j = 1, ..., N, each of which is governed by a Bernoulli distribution with param- eter θjk p(xi|zi = k, θ) =

D

  • j=1

θxij

jk (1 − θjk)(1−xij)

(3) We can write the generative model of a mixture model as p(z|π) ∼ Multinoimal(π) =

K

  • k=1

πzk

k

p(x|z, θ) ∼ Bernoulli(θ) =

K

  • k=1

[θx

k(1 − θk)(1−x)]zk

(4) The second distribution is the mixture proportion and πk is the weight of k-th

  • proportion. So the Bernoulli mixture model is given as

p(x) =

K

  • k=1

πk

N

  • i=1

D

  • j=1

θxij

jk (1 − θjk)(1−xij)

(5) 1

slide-2
SLIDE 2
  • Show the associated directed graphical model and write down the incomplete-

data log likelihood. The complete data log likelihood for this model can be written as ln p(x, z|π, θ) =

N

  • i=1

ln K

  • k=1
  • πk

D

  • j=1

p(xij|θjk) zik (6) In order to drive the EM algorithm, we take the expectation of the com- plete data log-likelihood with respect to the posterior distribution of latent variable z. Write down Q(ξ; ξ(old)), where ξ = {θ, π} and the posterior distribution of the latent variable z.

  • Derive the update for π and θ in the M-step for ML estimation in terms
  • f E[zik] and write down E[zik].
  • Consider a mixture distribution p(x) and show that

E[x] =

K

  • k=1

πkθk cov[x] =

K

  • k=1

πk{Σk + θkθT

k } − E[x]E[x]T

(7) where Σk = diag[θki(1 − θki)]. Hint- Solve the second equation in a general case by adding and subtract- ing a term which is a function of E[x|k] = θk.

  • We now consider a Bayesian model in which we impose priors on the
  • parameters. We impose the natural conjugate priors, i.e., a Beta prior for

each θjk and a Dirichlet prior for π. p(π|α) ∼ Dir(α) p(θjk|a, b) ∼ Beta(a, b) (8) show that the M-step for MAP estimation of a mixture Bernoullis is given by θkj = (

i E[zik]xij + a − 1)

(

i E[zik]) + a + b − 2

πk = (

i E[zik]) + αk − 1

N +

k αk − K

(9) Hint- For the maximization w.r.t. πk in the M-step, you need to use Lagrange multiplier to enforce the constraint about π. 2

slide-3
SLIDE 3

2 [2 points] Variational Lower Bound for the mixture of Bernoullis

In the mixture of Bernoullis, the multinomial distribution chooses the mixtures. One can assume the conditional probability of each observed component follows a Bernoulli distribution given as Equation 3. We have priors over π and θ as p(π|α) = Γ(

k αk)

  • k Γ(αk)

K

  • k=1

παk−1

k

p(θjk|a, b) = Γ(a + b) Γ(a)Γ(b)θa−1

jk (1 − θjk)b−1

(10)

  • If you consider a variational distribution which factorizes between the

latent variables and the parameters then you must show how the lower bound has the following form L = E[ln P(x|z, θ)] + E[ln P(z|π)] + E[ln P(π|α)] + E[ln P(θ|a, b)] − E[ln q(z)] − E[ln q(π)] − E[ln q(θ)] (11)

  • Let’s assume that the approximate distribution of parameters of the model

has the following form q(θ|η, ν) ∼ Beta(η, ν) q(π|ρ) ∼ Dir(ρ) q(zk|τk) ∼ Cat(τk) (12) Here ρ, τ, η and ν are variational parameters. Derive the variational update equations for the three variational distributions using the mean field approximation which should yield to ρk = α +

N

  • i=1

τik ηjk = a +

N

  • i=1

τikxij νjk = b +

N

  • i=1

τik(1 − xij) τik ∝ exp

  • ψ(ρk) − ψ(

K

  • k′=1

ρk′) +

D

  • j=1

xij[ψ(ηjk) − ψ(ηjk + νjk)] +

D

  • j=1

(1 − xij)[ψ(νjk) − ψ(ηjk + νjk)]

  • (13)

3

slide-4
SLIDE 4

Hint-Use following properties Eq(θ)[ln θjk] = ψ(ηjk) − ψ(ηjk + νjk) Eq(θ)[ln(1 − θjk)] = ψ(νjk) − ψ(ηjk + νjk) Eq(π)[ln πk] = ψ(ρk) − ψ(

K

  • k′=1

ρk′) Eq(zk)[zk] = τk (14) where ψ(.) is the digamma function.

3 [2 points] Kernel methods

  • 1. The k-nearest neighbors classifier assigns a point x to the majority class
  • f its k nearest neighbors in the training set. Assume that we use squared

Euclidean distance to measure the distance to some point xn in the train- ing set, x − xn2. Reformulate this classifier for a nonlinear kernel k using the kernel trick.

  • 2. The file circles.csv contains a toy dataset. Each example has two fea-

tures that represent its coordinates (x1, x2) in 2D space. Points belong to

  • ne of 5 classes which correspond to different circles centered at the ori-
  • gin. We would like to perform classification with an additional feature for

the squared Euclidean distance to the origin. Write out the appropriate feature map φ((x1, x2)) and kernel function k(x, x′).

  • 3. Perform k-nearest neighbors classification with k = 15 using the kernel

from (2) and the standard linear kernel. Compare accuracies over 10-fold cross validation. Which version gives better results? 4

slide-5
SLIDE 5

4 [2 points] Support Vector Machine

  • 1. Recall the formulation of soft-margin (linear) SVM:

argminw,b,ξ 1 2w2 + C

n

  • i=1

ξi s.t. y(i) wT x(i) + b

  • ≥ 1 − ξi,

i = 1, . . . , n ξi ≥ 0, i = 1, . . . , n (15) During lecture, support vector machine is introduced geometrically as find- ing the Max-Margin Classifier. While this geometric interpretation pro- vides useful intuition about how SVM works, it is hard to relate to other machine learning algorithms such as Logistic Regression. In this exercise, we show that soft-margin SVM is equivalent to minimizing a loss function (to be specific, the hinge loss) with L2-regularization. And thus connect it to logistic regression and the goal of binary classification. The hinge loss is defined as V (y, f(x)) = (1 − yf(x))+ where (s)+ = max(s, 0). Show that argmin

w,b

  • 1

n

n

  • i=1

V (yi, f(xi)) + λw2

2

  • (16)

is equivalent to formulation (15) for some C, where f(x) = wT x + b; what is the corresponding C (in terms of n and λ)?

  • 2. In the previous question, we chose V (y, f(x)) = (1 − yf(x))+ (the hinge

loss) as our loss function; however, there are other reasonable loss func- tions that we can choose. For example, we can choose V (y, f(x)) =

1 log(2) log

  • 1 + e−yf(x)

which is usually called the logistic loss; and V (y, f(x)) = 0, yf(x) ≥ 0 1, yf(x) < 0 which is called the 0-1 loss. Please plot the above three loss functions in one figure, with yf(x) as the horizontal axis and V (y, f(x)) as the vertical axis. Explain your observa- tion.

  • 3. [Bonus] Answer the following questions as precisely as you can. What is

(16) if we choose V (y, f(x)) =

1 log(2) log

  • 1 + e−yf(x)

? What is (16) if we choose V (y, f(x)) =

  • 0,

yf(x) ≥ 0 1, yf(x) < 0 ? (Long answers receive no score) 5