Assignment 3 Zahra Sheikhbahaee Zeou Hu & Colin Vandenhof - - PDF document

▶

assignment 3

Assignment 3 Zahra Sheikhbahaee Zeou Hu & Colin Vandenhof - - PDF document

Aug 17, 2022 156 likes •221 views

Assignment 3 Zahra Sheikhbahaee Zeou Hu & Colin Vandenhof February 2020 1 [2 points] Mixture of Bernoullis A mixture of Bernoullis model is like the Gaussian mixture model which weve discussed in this course. Each of the mixture

slide-1

SLIDE 1

Assignment 3

Zahra Sheikhbahaee Zeou Hu & Colin Vandenhof February 2020

1 [2 points] Mixture of Bernoullis

A mixture of Bernoullis model is like the Gaussian mixture model which we’ve discussed in this course. Each of the mixture components consists of a collection

f independent Bernoulli random variables. In general, a mixture model assumes

the data are generated by the following process: first we sample z, and then we sample the observables x from a distribution which depends on z, i.e. p(x, z) = p(x|z)p(z) (1) In mixture models, p(z) is always a multinomial distribution with parameter π = {π1, ..., πK} which are mixture weights satisfying

K

k=1

πk = 1, πk ≥ 0 (2) Consider a set of N binary random variable in a D-dimensional space xj, where j = 1, ..., N, each of which is governed by a Bernoulli distribution with param- eter θjk p(xi|zi = k, θ) =

D

j=1

θxij

jk (1 − θjk)(1−xij)

(3) We can write the generative model of a mixture model as p(z|π) ∼ Multinoimal(π) =

K

k=1

πzk

k

p(x|z, θ) ∼ Bernoulli(θ) =

K

k=1

[θx

k(1 − θk)(1−x)]zk

(4) The second distribution is the mixture proportion and πk is the weight of k-th

proportion. So the Bernoulli mixture model is given as

p(x) =

K

k=1

πk

N

i=1

D

j=1

θxij

jk (1 − θjk)(1−xij)

(5) 1

slide-2

SLIDE 2

Show the associated directed graphical model and write down the incomplete-

data log likelihood. The complete data log likelihood for this model can be written as ln p(x, z|π, θ) =

N

i=1

ln K

k=1
πk

D

j=1

p(xij|θjk) zik (6) In order to drive the EM algorithm, we take the expectation of the com- plete data log-likelihood with respect to the posterior distribution of latent variable z. Write down Q(ξ; ξ(old)), where ξ = {θ, π} and the posterior distribution of the latent variable z.

Derive the update for π and θ in the M-step for ML estimation in terms
f E[zik] and write down E[zik].
Consider a mixture distribution p(x) and show that

E[x] =

K

k=1

πkθk cov[x] =

K

k=1

πk{Σk + θkθT

k } − E[x]E[x]T

(7) where Σk = diag[θki(1 − θki)]. Hint- Solve the second equation in a general case by adding and subtract- ing a term which is a function of E[x|k] = θk.

We now consider a Bayesian model in which we impose priors on the
parameters. We impose the natural conjugate priors, i.e., a Beta prior for

each θjk and a Dirichlet prior for π. p(π|α) ∼ Dir(α) p(θjk|a, b) ∼ Beta(a, b) (8) show that the M-step for MAP estimation of a mixture Bernoullis is given by θkj = (

i E[zik]xij + a − 1)

(

i E[zik]) + a + b − 2

πk = (

i E[zik]) + αk − 1

N +

k αk − K

(9) Hint- For the maximization w.r.t. πk in the M-step, you need to use Lagrange multiplier to enforce the constraint about π. 2

slide-3

SLIDE 3

2 [2 points] Variational Lower Bound for the mixture of Bernoullis

In the mixture of Bernoullis, the multinomial distribution chooses the mixtures. One can assume the conditional probability of each observed component follows a Bernoulli distribution given as Equation 3. We have priors over π and θ as p(π|α) = Γ(

k αk)

k Γ(αk)

K

k=1

παk−1

k

p(θjk|a, b) = Γ(a + b) Γ(a)Γ(b)θa−1

jk (1 − θjk)b−1

(10)

If you consider a variational distribution which factorizes between the

latent variables and the parameters then you must show how the lower bound has the following form L = E[ln P(x|z, θ)] + E[ln P(z|π)] + E[ln P(π|α)] + E[ln P(θ|a, b)] − E[ln q(z)] − E[ln q(π)] − E[ln q(θ)] (11)

Let’s assume that the approximate distribution of parameters of the model

has the following form q(θ|η, ν) ∼ Beta(η, ν) q(π|ρ) ∼ Dir(ρ) q(zk|τk) ∼ Cat(τk) (12) Here ρ, τ, η and ν are variational parameters. Derive the variational update equations for the three variational distributions using the mean field approximation which should yield to ρk = α +

N

i=1

τik ηjk = a +

N

i=1

τikxij νjk = b +

N

i=1

τik(1 − xij) τik ∝ exp

ψ(ρk) − ψ(

K

k′=1

ρk′) +

D

j=1

xij[ψ(ηjk) − ψ(ηjk + νjk)] +

D

j=1

(1 − xij)[ψ(νjk) − ψ(ηjk + νjk)]

(13)

3

slide-4

SLIDE 4

Hint-Use following properties Eq(θ)[ln θjk] = ψ(ηjk) − ψ(ηjk + νjk) Eq(θ)[ln(1 − θjk)] = ψ(νjk) − ψ(ηjk + νjk) Eq(π)[ln πk] = ψ(ρk) − ψ(

K

k′=1

ρk′) Eq(zk)[zk] = τk (14) where ψ(.) is the digamma function.

3 [2 points] Kernel methods

1. The k-nearest neighbors classifier assigns a point x to the majority class
f its k nearest neighbors in the training set. Assume that we use squared

Euclidean distance to measure the distance to some point xn in the train- ing set, x − xn2. Reformulate this classifier for a nonlinear kernel k using the kernel trick.

2. The file circles.csv contains a toy dataset. Each example has two fea-

tures that represent its coordinates (x1, x2) in 2D space. Points belong to

ne of 5 classes which correspond to different circles centered at the ori-
gin. We would like to perform classification with an additional feature for

the squared Euclidean distance to the origin. Write out the appropriate feature map φ((x1, x2)) and kernel function k(x, x′).

3. Perform k-nearest neighbors classification with k = 15 using the kernel

from (2) and the standard linear kernel. Compare accuracies over 10-fold cross validation. Which version gives better results? 4

slide-5

SLIDE 5

4 [2 points] Support Vector Machine

1. Recall the formulation of soft-margin (linear) SVM:

argminw,b,ξ 1 2w2 + C

n

i=1

ξi s.t. y(i) wT x(i) + b

≥ 1 − ξi,

i = 1, . . . , n ξi ≥ 0, i = 1, . . . , n (15) During lecture, support vector machine is introduced geometrically as find- ing the Max-Margin Classifier. While this geometric interpretation pro- vides useful intuition about how SVM works, it is hard to relate to other machine learning algorithms such as Logistic Regression. In this exercise, we show that soft-margin SVM is equivalent to minimizing a loss function (to be specific, the hinge loss) with L2-regularization. And thus connect it to logistic regression and the goal of binary classification. The hinge loss is defined as V (y, f(x)) = (1 − yf(x))+ where (s)+ = max(s, 0). Show that argmin

w,b

1

n

n

i=1

V (yi, f(xi)) + λw2

2

(16)

is equivalent to formulation (15) for some C, where f(x) = wT x + b; what is the corresponding C (in terms of n and λ)?

2. In the previous question, we chose V (y, f(x)) = (1 − yf(x))+ (the hinge

loss) as our loss function; however, there are other reasonable loss func- tions that we can choose. For example, we can choose V (y, f(x)) =

1 log(2) log

1 + e−yf(x)

which is usually called the logistic loss; and V (y, f(x)) = 0, yf(x) ≥ 0 1, yf(x) < 0 which is called the 0-1 loss. Please plot the above three loss functions in one figure, with yf(x) as the horizontal axis and V (y, f(x)) as the vertical axis. Explain your observa- tion.

3. [Bonus] Answer the following questions as precisely as you can. What is

(16) if we choose V (y, f(x)) =

1 log(2) log

1 + e−yf(x)

? What is (16) if we choose V (y, f(x)) =

0,

yf(x) ≥ 0 1, yf(x) < 0 ? (Long answers receive no score) 5