Review Gibbs sampling MH with proposal Q( X | X ) = P( X B(i) | X - - PowerPoint PPT Presentation

review
SMART_READER_LITE
LIVE PREVIEW

Review Gibbs sampling MH with proposal Q( X | X ) = P( X B(i) | X - - PowerPoint PPT Presentation

Review Gibbs sampling MH with proposal Q( X | X ) = P( X B(i) | X B(i) ) I( X B(i) = X B(i) ) / #B failure mode: lock-down Relational learning (properties of sets of entities) document clustering,


slide-1
SLIDE 1

Review

  • Gibbs sampling
  • MH with proposal
  • Q(X | X’) = P(XB(i) | X¬B(i)) I(X¬B(i) = X’¬B(i)) / #B
  • failure mode: “lock-down”
  • Relational learning (properties of sets of

entities)

  • document clustering, recommender systems,

eigenfaces

1

slide-2
SLIDE 2

Review

  • Latent-variable models
  • PCA, pPCA, Bayesian PCA
  • everything Gaussian
  • E(X | U,V) = UVT
  • MLE: use SVD
  • Mean subtraction, example weights

2

slide-3
SLIDE 3

PageRank

  • SVD is pretty useful: turns out to be main

computational step in other models too

  • A famous one: PageRank
  • Given: web graph (V, E)
  • Predict: which pages are important

3

slide-4
SLIDE 4

PageRank: adjacency matrix

4

slide-5
SLIDE 5

Random surfer model

  • W. p. α:
  • W. p. (1–α):
  • Intuition: page is important

if a random surfer is likely to land there

5

slide-6
SLIDE 6

Stationary distribution

A B C D 0.1 0.2 0.3 0.4 0.5

6

slide-7
SLIDE 7

Thought experiment

  • What if A is symmetric?
  • note: we’re going to stop distinguishing A, A’
  • So, stationary dist’n for symmetric A is:
  • What do people do instead?

7

slide-8
SLIDE 8

Spectral embedding

  • Another famous model: spectral embedding

(and its cousin, spectral clustering)

  • Embedding: assign low-D coordinates to

vertices (e.g., web pages) so that similar nodes in graph ⇒ nearby coordinates

  • A, B similar = random surfer tends to reach

the same places when starting from A or B

8

slide-9
SLIDE 9

Where does random surfer reach?

  • Given graph:
  • Start from distribution π
  • after 1 step: P(k | π, 1-step) =
  • after 2 steps: P(k | π, 2-step) =
  • after t steps:

9

slide-10
SLIDE 10

Similarity

  • A, B similar = random surfer tends to reach

the same places when starting from A or B

  • P(k | π, t-step) =
  • If π has all mass on i:
  • Compare i & j:
  • Role of Σt:

10

slide-11
SLIDE 11

Role of Σt (real data)

2 4 6 8 10 0.2 0.4 0.6 0.8 1 t=1 t=3 t=5 t=10

11

slide-12
SLIDE 12

Example: dolphins

  • 62-dolphin social network near Doubtful

Sound, New Zealand

  • Aij = 1 if dolphin i friends dolphin j

(Lusseau et al., 2003)

20 40 60 10 20 30 40 50 60

12

slide-13
SLIDE 13

Dolphin network

!!"# !!"$ ! !"$ !"# !"% !!"% !!"# !!"$ ! !"$ !"# !"%

13

slide-14
SLIDE 14

Comparisons

!!"# !!"$ !!"% ! !"% !"$ !"# !!"# !!"$ !!"% ! !"% !"$ !"# !"& !!"# !!"$ ! !"$ !"# !!"# !!"% !!"$ !!"& ! !"& !"$ !"% !"#

random embedding of dolphin data spectral embedding of random data

14

slide-15
SLIDE 15

Spectral clustering

  • Use your favorite clustering algorithm on

coordinates from spectral embedding

!!"# !!"$ ! !"$ !"# !"% !!"% !!"# !!"$ ! !"$ !"# !"%

15

slide-16
SLIDE 16

PCA: the good, the bad, and the ugly

  • The good: simple, successful
  • The bad: linear, Gaussian
  • E(X) = UVT
  • X, U, V ~ Gaussian
  • The ugly: failure to generalize to new

entities

16

slide-17
SLIDE 17

Consistency

  • Linear & logistic regression are consistent
  • What would consistency mean for PCA?
  • forget about row/col means for now
  • Consistency:
  • #users, #movies, #ratings (= nnz(W))
  • numel(U), numel(V)
  • consistency =

17

slide-18
SLIDE 18

Failure to generalize

  • What does this mean for generalization?
  • new user’s rating of moviej: only info is
  • new movie rated by useri: only info is
  • all our carefully-learned factors give us:
  • Generalization is:

18

slide-19
SLIDE 19

Hierarchical model

  • ld, non-hierarchical

model

19

slide-20
SLIDE 20

Benefit of hierarchy

  • Now: only k μU latents, k μV latents (and

corresponding σs)

  • can get consistency for these if we observe

more and more Xij

  • For a new user or movie:

20

slide-21
SLIDE 21

Mean subtraction

  • Can now see that mean subtraction is a

special case of our hierarchical model

  • Fix Vj1 = 1 for all j; then Ui1 =
  • Fix Ui2 = 1 for all i; then Vj2 =
  • global mean:

21

slide-22
SLIDE 22

What about the second rating for a new user?

  • Estimating Ui from one rating:
  • knowing μU:
  • result:
  • How should we fix?
  • Note: often we have only a few ratings per

user

22

slide-23
SLIDE 23

MCMC for PCA

  • Can do Bayesian inference by Gibbs

sampling—for simplicity, assume σs known

23

slide-24
SLIDE 24

Recognizing a Gaussian

  • Suppose X ~ N(X | μ, σ2)
  • L = –log P(X=x | μ, σ2) =
  • dL/dx =
  • d2L/dx2 =
  • So: if we see d2L/dx2 = a, dL/dx = a(x – b)
  • μ = σ2 =

24

slide-25
SLIDE 25

Gibbs step for an element of μU

25

slide-26
SLIDE 26

Gibbs step for an element of U

26

slide-27
SLIDE 27

In reality

  • We’d do blocked Gibbs instead
  • Blocks contain entire rows of U or V
  • take gradient, Hessian to get mean, covariance
  • formulas look a lot like linear regression

(normal equations)

  • And, we’d fit σU, σV too
  • sample 1/σ2 from a Gamma (or Σ–1 from a

Wishart) distribution

27

slide-28
SLIDE 28

Nonlinearity: conjunctive features

P(rent) Comedy Foreign

28

slide-29
SLIDE 29

Disjunctive features

P(rent) Comedy Foreign

29

slide-30
SLIDE 30

“Other”

P(rent) Comedy Foreign

30

slide-31
SLIDE 31

Non-Gaussian

  • X, U, and V could each be non-Gaussian
  • e.g., binary!
  • rents(U, M), comedy(M), female(U)
  • For X: predicting –0.1 instead of 0 is only as

bad as predicting +0.1 instead of 0

  • For U, V: might infer –17% comedy or 32%

female

31

slide-32
SLIDE 32

Logistic PCA

  • Regular PCA: Xij ~ N(Ui ⋅ Vj, σ2)
  • Logistic PCA:

32

slide-33
SLIDE 33

More generally…

  • Can have
  • Xij ∼ Poisson(μij), μij = exp(Ui ⋅ Vj)
  • Xij ∼ Bernoulli(μij), μij = σ(Ui ⋅ Vj)
  • Called exponential family PCA
  • Might expect optimization to be difficult

33

slide-34
SLIDE 34

Application: fMRI

:-) :-)) ;->

fMRI fMRI fMRI Brain activity

Stimulus Voxels

Y

stimulus: “dog” stimulus: “cat” stimulus: “hammer” credit: Ajit Singh

34

slide-35
SLIDE 35

Results (logistic PCA)

credit: Ajit Singh

0.2 0.4 0.6 0.8 1 1.2 1.4 Mean Squared Error HBCMF HCMF CMF

Better Lower is

Y (fMRI data): Fold-in

Maximum a posteriori (fixed hyperparameters) Just using fMRI data Augmenting fMRI data with word co-occurrence

35