[PPT] - Pengtao Xie Joint work with Yuntian Deng and Eric Xing Carnegie PowerPoint Presentation

SLIDE 1

Mutual Angular Regularization of Latent Variable Models: Theory, Algorithm and Applications

Pengtao Xie

Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University

1

SLIDE 2

Latent Variable Models (LVMs)

2

Machine Learning Latent Variable Models Pattern

SLIDE 3

Latent Variable Models

3

Topic Models Gaussian Mixture Model Words Topics Feature vectors Groups

Hidden Markov Model, Kalman Filtering, Restricted Boltzmann Machine, Deep Belief Network, Factor Analysis, etc. Neural Network, Sparse Coding, Matrix Factorization, Distance Metric learning, Principal Component Analysis, etc.

SLIDE 4

Latent Variable Models

4

Latent Factors Behind Data

Obama Constitution Government Politics

Components in LVMs

Topic Models Gaussian Mixture Model Topics in Documents Groups in Images

GDP Bank Marketing Economics University Knowledge Student Education

Tiger Car Food

SLIDE 5

Motivation I: Popularity of latent factors is skewed

5

 Popularity of latent factors follows a power-law distribution

Topics in News Groups in Flickr Photos

Obama Constitution Government Politics GDP Bank Marketing Economics

Dominant Topics

Sofa Closet Curtain Furniture Rose Tulip Lily Flower

Long-Tail Topics Dominant Groups Long-Tail Groups

Car Food Painting Diamond

SLIDE 6

Standard LVMs are insufficient to capture long-tail factors

6

 Latent Dirichlet Allocation (LDA)

 “Extremely common words tend to dominate all topics” (Wallach,

2009)

 Tencent Peacock LDA, “When learning ≥ 105 topics, around 20% ∼

40% topics have duplicates in practice” (Wang, 2015)

 Restricted Boltzmann Machine

 Ran on 20-Newsgroup dataset  Many duplicate topics (e.g., the three

exemplar topics are all about politics)

 Common words occur repeatedly

across topics, such as iraq, clinton, united, weapons

Topic 1 Topic 2 Topic 3 president iraq iraq clinton united un iraq un iraqi united weapons lewinsky spkr iraqi saddam house nuclear clinton people india baghdad lewinsky minister inspectors government saddam weapons white military white

SLIDE 7

Standard LVMs are insufficient to capture long-tail factors

7

Latent factors behind data Components in LVM

SLIDE 8

Long-tail factors are important

8

 The amount of long-tail factors is large  Long-tail factors are more important than dominant factors

in some applications

 Example: Tencent applied topic models for advertisement and

showed that long-tail topics such as “lose weight”, “nursing” improves click-through rate by 40% (Jin, 2015) Long-tail factors

SLIDE 9

Diversification

9

Latent factors behind data Components in LVM

SLIDE 10

Motivation II: Tradeoff induced by the number of components k

10 10

 Tradeoff between Expressiveness and Complexity

 Small k: low expressiveness, low complexity  Large k: high expressiveness, high complexity

 Can we achieve the best of both worlds?

 Small k: high expressiveness, low complexity

SLIDE 11

Reduce model complexity without sacrificing expressiveness

11 11

Without diversification With diversification

Data Samples Components in LVM

Use components to capture the principal directions of data point cloud

SLIDE 12

Mutual Angular Regularization of LVMs

 Goal: encourage the components to diversely spread out

to (1) improve the coverage of long-tail latent factors; (2) reduce model complexity without compromising expressiveness

 Approach:

 Define a score based on mutual angles to measure the

diversity of components

 Use the score to regularize latent variable models and control

the geometry of the latent space during learning

12 12

SLIDE 13

Outline

 Mutual Angular Regularizer  Algorithm  Applications  Theory

13 13

SLIDE 14

Mutual Angular Regularizer

 Components are parametrized by vectors

 In Latent Dirichlet Allocation, each topic has a multinomial vector  In Sparse Coding, each dictionary item has a real vector

 Measure the dissimilarity between two vectors  Measure the diversity of a vector set

14 14

SLIDE 15

Dissimilarity between two vectors

 Invariant to scale, translation, rotation and orientation of the

two vectors

 Euclidean distance, L1 distance

 Distance 𝑒 is variant to scale

 Negative cosine similarity

 Negative cosine similarity 𝑏 is variant to orientation

15 15 O O O O

d d a=0.6 a=-0.6

SLIDE 16

Dissimilarity between two vectors

 Non-obtuse angle 𝜄  Invariant to scale, translation, rotation and orientation of the

two vectors

 Definition

16 16 O

𝜄

O

𝜄

O

𝜄           y x y x arccos 

SLIDE 17

Measure the diversity of a vector set

 Based on the pairwise dissimilarity measure between vectors  The diversity of a set of vectors is defined as  Mean: summarize how these vectors are different from each

ther on the whole

 Variance: encourage the vectors to evenly spread out

17 17

 

K i i 1 

 a A

2 1 1 1 1 1 1

1 1 1 ( ) ( 1) ( 1) ( 1)

K K K K K K ij ij pq i j i j p q j i j i q p

K K K K K K   

        

                

  

A          

j i j i ij

a a a a arccos 

Mutual Angular Regularizer Mean of angles Variance of angles

SLIDE 18

LVM with Mutual Angular Regularization (MAR-LVM)

18 18

max ( ; ) ( ) L D   

A

A A

         

j i j i ij

a a a a arccos 

2 1 1 1 1 1 1

1 1 1 ( ) ( 1) ( 1) ( 1)

K K K K K K ij ij pq i j i j p q j i j i q p

K K K K K K   

        

                

  

A

SLIDE 19

Algorithm

19 19

 Challenge: the mutual angular regularizer is non-smooth and

non-convex w.r.t the parameter vectors

 Derive a smooth lower bound

 The lower bound is easier to derive if the parameter vectors lie on

a sphere

 Decompose the parameter vectors into magnitudes and directions

 Proved that optimizing the lower bound with gradient ascent

method can increase the mutual angular regularizer in each iteration  

K i i 1 

 a A

SLIDE 20

Optimization

20 20

, 1 ~ , ) ~ ( ) ~ ; ( . . max

~ ,

    

i i

g i D L t s a A A g

A g



1 diag( )

i i i i i i

g g     a a a a A g A

Fix , optimize

A ~ g

1 ~ , ) ~ ( ) ~ ; ( . . max ~    

i

i D L t s a A A g

A



Fix , optimize

, ) ~ ; ( . . max  

i

g i D L t s A g

g

g A ~

Reparametrize

,

max ( ; ) ( ) L D A A   

g A

Ω 𝐁 = Ω diag(𝐡)𝐁

Magnitude Direction

Alternating Optimization

SLIDE 21

Optimize 𝑩

21 21

1 ~ , ) ~ ( ) ~ ; ( . . max ~    

i

i D L t s a A A g

A



Lower bound

 

2

( ) ( ) arcsin det arcsin det 2

T T

             A A A A A A

1 ~ , ) ~ ( ) ~ ; ( . . max ~    

i

i D L t s a A A g

A



Intuition of the lower bound: det 𝑩 𝑈𝑩 is the volume of the parallelipiped formed by the vectors in 𝑩 . The larger det 𝑩 𝑈𝑩 is, the more likely that the vectors in 𝑩 have larger angles (not surely). Γ 𝑩 is an increasing function w.r.t det 𝑩 𝑈𝑩 . Hence larger Γ 𝑩 is likely to yield larger Ω 𝑩 . Optimize the lower bound, which is smooth and much more amenable for

ptimization

SLIDE 22

Close Alignment between the Regularizer and its Lower Bound

22 22

 If the lower bound is optimized with projected gradient

ascent (PGA), the mutual angular regularizer can be increased in each iteration of the PGA procedure

 Optimizing the lower bound with PGA can increase the mean of the

angles in each iteration

 Optimizing the lower bound with PGA can decrease the variance of

the angles in each iteration

1 1

1 ( ) ( 1)

K K ij i j j i

K K 

  

    A

Mean Variance

2 1 1 1 1

1 1 ( 1) ( 1)

K K K K ij pq i j p q j i q p

K K K K  

     

            

 

SLIDE 23

Geometry Interpretation of the Close Alignment

 The gradient of the lower bound w.r.t is orthogonal to all

ther vectors

 Move along its gradient direction would enlarge its angle

with other vectors

i

a

   

1 2

, , ,

K i

a a a a

23 23

i

a

2

a

3

a

1

a

1

g

1

ˆ a  

1

a

2

a

3

a

are parameter vectors

1

g is the gradient of 1 a and are orthogonal to 2 a

3

a

1 1 1

ˆ     a a g

The angle between and is greater than between and



1

ˆ a

3

a 

1

a

3

a

SLIDE 24

Summary of Algorithm for MAR-LVM

24 24

While Not Converge

1. Fixing 𝑩

, solving the following sub-problem with projected gradient ascent or other methods

2. Fixing 𝒉, solving the following sub-problem with

projected gradient ascent

, 1 ~ , ) ~ ( ) ~ ; ( . . max

~ ,

    

i i

g i D L t s a A A g

A g



, ) ~ ; ( . . max  

i

g i D L t s A g

g

1 ~ , ) ~ ( ) ~ ; ( . . max ~    

i

i D L t s a A A g

A



SLIDE 25

Case Study --- Restricted Boltzmann Machine with Mutual Angular Regularization (MAR-RBM)

25 25

max ( ; ) ( ) L D   

A

A A

A

SLIDE 26

Experiments

 Task: learn representations for documents  Datasets  Baselines

 Bag-of-Words (BOW); Latent Dirichlet Allocation (LDA); LDA regularized with

Determinantal Point Process prior (DPP-LDA); Pitman-Yor Process Topic Model (PYTM); Latent IBP Compound Dirichlet Allocation (LIDA); Neural Autoregressive Topic Model (DocNADE); Paragraph Vector (PV); Restricted Boltzmann Machine

 Evaluation

 Retrieval: precision@100  Clustering: accuracy

#categories #samples vocab. size TDT 30 9394 5000 20-News 20 18846 5000 Reuters 9 7195 5000 26 26

SLIDE 27

Retrieval Precision

20 40 60 80 100 25 50 100 200 500 Precision@100 (%) Number of hidden units K

Retrieval Precision on TDT

RBM MAR-RBM 5 10 15 20 25 30 25 50 100 200 500 Precision@100 (%) Number of hidden units K

Retrieval Precision on 20-News

RBM MAR-RBM 20 40 60 80 25 50 100 200 500 Precision@100 (%) Number of hidden units K

Retrieval Precision on Reuters

RBM MAR-RBM

27 27

SLIDE 28

Retrieval Precision

TDT 20-News Reuters BOW 40.9 7.4 69.3 LDA 79.4 19.6 68.5 DPP-LDA 81.9 18.2 69.9 PYTM 78.7 20.1 70.6 LIDA 77.9 21.8 71.4 DocNADE 80.3 16.8 72.6 PV 81.7 19.1 76.9 RBM 47.4 22.3 70.1 MAR-RBM 84.2 24.9 75.9

28 28

SLIDE 29

Clustering Accuracy

10 20 30 40 50 60

25 50 100 200 500

Accuracy (%) Number of hidden units K

Clustering Accuracy on TDT

RBM MAR-RBM

5 10 15 20 25 30 35

25 50 100 200 500 Accuracy (%) Number of hidden units K

Clustering Accuracy on 20-News

RBM MAR-RBM 20 40 60 80 25 50 100 200 500 Accuracy (%) Number of hidden units K

Clustering Accuracy on Reuters

RBM DRBM

29 29

Accuracy=

𝕁(ti=𝑛𝑏𝑞 ci )

N i=1

N

ti -- true label of document i ci -- cluster label map -- Kuhn-Munkres permutation mapping 𝕁 ∙ -- indicator function

SLIDE 30

Clustering Accuracy

TDT 20-News Reuters BOW 51.3 21.3 49.7 LDA 45.2 21.9 51.2 DPP-LDA 46.3 10.9 49.3 PYTM 46.9 21.5 51.7 LIDA 47.3 17.4 53.1 DocNADE 45.7 18.7 48.7 PV 48.2 24.3 52.8 RBM 23.3 22.7 47.6 MAR-RBM 52.4 29.4 60.9

30 30

SLIDE 31

Improvement Breakdown

-Retrieval on Reuters

Category ID 1 2 3 4 5 6 7 8 9 Number of Documents 3713 2055 321 298 245 197 142 114 110 Precision@100 of RBM 0.69 0.44 0.09 0.10 0.06 0.04 0.04 0.03 0.03 Precision@100 of MAR-RBM 0.90 0.80 0.31 0.40 0.27 0.23 0.09 0.14 0.13 Relative Improvement of MAR- RBM over RBM 31% 81% 245% 289% 324% 421% 148% 366% 397% 31 31

SLIDE 32

 Wide applications in retrieval, clustering and classification

Case Study --- Distance Metric Learning with Mutual Angular Regularization (MAR-DML)

32 32 Distance Metric

Similar Dissimilar

SLIDE 33

Distance Metric Learning

33 33

 Projection matrix  k incurs tradeoff between model complexity and expressiveness  Distance Metric Learning  Distance Metric Learning with Mutual Angular Regularization

d k

A



R

⊗ =

Latent Representation Original Feature Vector

SLIDE 34

Experiments

 Datasets  Baselines

 Euclidean distance (EUC); Distance Metric Learning (DML); Large Margin Nearest

Neighbor (LMNN) DML; Information Theoretical Metric Learning (ITML); Distance Metric Learning with Eigenvalue Optimization (DML-eig); Information-theoretic Semi- supervised Metric Learning via Entropy Regularization (Seraph)

 Evaluation

 Retrieval: precision  Clustering: accuracy

Feature Dim. #training data #data pairs 20-News 5000 11.3K 200K 15-Scenes 1000 3.2K 200K 6-Activities 561 7.4K 200K 34 34

SLIDE 35

Retrieval Precision

68 70 72 74 76 78 80 82 10 100 300 500 700 900 Precision (%) Number of components K

Retrieval Precision on 20-News

DML MAR-DML 77 78 79 80 81 82 83 84 10 50 100 150 200 Precision (%) Number of components K

Retrieval Precision on 15-Scenes

DML MAR-DML 91 92 93 94 95 96 97 10 50 100 150 200 Precision (%) Number of components K

Retrieval Precision on 6-Activities

DML MAR-DML

35 35

SLIDE 36

Retrieval Precision

20-News 15-Scenes 6-Activities EUC 62.8 65.3 85 DML 76.2 80.8 94.5 LMNN 67 70.3 71.5 ITML 74.7 79.1 94.2 DML-eig 71.2 71.3 86.7 Seraph 75.8 82 89.2 MAR-DML 81.1 83.6 96.2

36 36

SLIDE 37

Clustering Accuracy

10 20 30 40 50 10 100 300 500 700 900 Accuracy (%) Number of components K

Clustering Accuracy on 20-News

DML MAR-DML 10 20 30 40 50 60 10 50 100 150 200 Accuracy (%) Number of components K

Clustering Accuracy on 15-Scenes

DML MAR-DML 20 40 60 80 100 120 10 50 100 150 200 Accuracy (%) Number of components K

Clustering Accuracy on 6-Activities

DML MAR-DML

37 37

SLIDE 38

Clustering Accuracy

20-News 15-Scenes 6-Activities EUC 36.5 29 61.6 DML 28.4 40.1 76.1 LMNN 32.9 33.6 56.9 ITML 34.5 38.2 93.4 DML-eig 27.3 26.6 63.3 Seraph 48.1 48.2 74.8 MAR-DML 44.6 51.3 96.6

38 38

SLIDE 39

Theoretical Analysis

 Study how the mutual angular regularizer affects

generalization error (including estimation error and approximation error) of supervised latent variable models, using the Probably Approximately Correct (PAC) framework

 Use multi-layer perceptron (MLP) as a specific instance

 MLP is a widely used supervised latent space model  Rich PAC based analysis for MLP with one hidden layer exists

and can be leveraged for our study

 Major results

 Increasing the mutual angles can reduce estimation error  Choosing proper mutual angles can reduce approximation error 39

39

SLIDE 40

Recap of Statistical Learning Theory

 Setup

 Predict an output 𝑧 ∈ 𝒵 given an input 𝑦 ∈ 𝒴  Let ℋ be a set of hypotheses  Let ℓ: 𝒴 × 𝒵 × ℋ → ℝ be a loss function  Let 𝑞∗ be the distribution over 𝒴 × 𝒵

 Definitions

 Generalization error 𝑀 ℎ = 𝔽 𝑦,𝑧 ~𝑞∗[ℓ( 𝑦, 𝑧 , ℎ)]  Expected risk minimizer ℎ∗ ∈ argminℎ∈ℋ 𝑀 ℎ  Empirical risk 𝑀

ℎ =

1 𝑜

ℓ 𝑦(𝑗), 𝑧(𝑗) , ℎ

𝑜 𝑗=1

 Empirical risk minimizer ℎ

∈ argminℎ∈ℋ𝑀 ℎ

 Generalization error of ℎ

𝑀 ℎ = 𝑀 ℎ

𝑀 ℎ∗ + 𝑀 ℎ∗

 Estimation error is the difference between the generalization error

f ℎ

and ℎ∗

 Approximation error is the best generalization error that can be

achieved by the hypotheses set

40 40

Estimation Error Approximation Error

SLIDE 41

Setup (Basic Case)

 Task: univariate regression  Network structure: input layer, one hidden layer, output layer  Activation function: h(t), Lipschitz continuous with constant L, e.g.,

sigmoid, tanh, rectified linear.

 Let 𝐲 ∈ R𝑒 be the input feature vector with 𝐲 2 ≤ 𝐷1  Let y be the response value with 𝑧 ≤ 𝐷2  Let 𝐱

𝑘 ∈ R𝑒 be the weight vector of the jth hidden unit, j = 1, … , m, with

𝒙𝑘

2 ≤ 𝐷3. Further, we assume the angle 𝜍(𝒙𝑗, 𝒙𝑘) between 𝒙𝑗 and

𝒙𝑘 is lower bounded by a constant θ for all i ≠ j.

 Let 𝛃 ∈ R𝑛 be the weights connecting the hidden units to the output

with 𝛃 2 ≤ 𝐷4

 Hypothesis: f(𝐲) =

𝛽𝑘ℎ 𝒙𝑘

T𝐲 𝑛 𝑘=1

, let ℱ denote the hypothesis set

 Loss function: 𝑚(𝐲, y, f) = 𝑔(𝒚) − 𝑧 2

41 41

SLIDE 42

Theorem 1 With probability at least 1 − 𝜀

𝑀 𝑔 − 𝑀 𝑔∗ ≤ 8 𝐾 + 𝐷2 2𝑀𝐷1𝐷3𝐷4 + 𝐷4 ℎ 0 𝑛 𝑜 + 𝐾 + 𝐷2

𝟑

2 𝑚𝑝𝑕 2/𝜀 𝑜 Where

Estimation Error of Multi-Layer Perceptron under Mutual Angular Regularizer (MAR-MLP)

42 42

 𝐾 is a decreasing function w.r.t 𝜄, hence the estimation error

bound decreases as 𝜄 increases.

 Increasing the mutual angular regularizer increases the mean

and decreases the variance of the pairwise angles, hence increases the lower bound 𝜄 of these angles

 The analysis has been extended to

 Multiple output  Multiple hidden layers  Other losses: hinge loss, logistic loss, cross entropy loss

Larger angle induces lower estimation error bound

2 2 2 2 2 2 2 4 1 3 4 1 3 4

(0) (( 1)cos 1) 2 (0) ( 1)cos 1 J mC h L C C C m mC C C L h m         

SLIDE 43

Approximation Error of MAR-MLP

43 43

 Additional Setup

 Let 𝐻 = 𝑕|𝑕 = 𝛽ℎ 𝒙T𝐲 , 𝑕

≤ 𝑐𝑕

 The hypothesis function 𝑔

𝑛 is constructed iteratively

𝑔

1 = 𝑕1

𝑔

𝑛 = 𝛾𝑔 𝑛−1 + (1 − 𝛾)𝑕𝑛

0 < 𝛾 ≤ 𝑑 < 1

 Let 𝐻𝑛 = 𝑕|𝑕 ∈ 𝐻; ∀j < 𝑛, 𝜍(𝒙, 𝒙𝑘) ≥ 𝜄 , where 𝑥 and 𝒙𝑘 are the

weight vectors of 𝑕 and 𝑕𝑘 respectively; 𝑕𝑘 is the function selected in step 𝑘 when constructing 𝑔

𝑘 = 𝛾𝑔 𝑘−1 + (1 − 𝛾)𝑕𝑘

 The target function 𝑔 satisfies 𝑔

≤ 𝑐𝑔, 𝑔, 𝑕 < ∞ for all 𝑕 ∈ 𝐻𝜄

 Approximation error: 𝑔

𝑛 − 𝑔 2

SLIDE 44

Approximation Error of MAR-MLP

44 44

Theorem 2 Let 𝑓 denote where 𝑊 is the volume of the input L2 ball and 𝑡 𝜄 is a non-decreasing function w.r.t 𝜄. Suppose 𝑔

1 is chosen to satisfy

𝑔

1 − 𝑔 2 ≤ 𝑗𝑜𝑔 𝑕∈𝐻 𝑕 − 𝑔 2 + 𝜗1

and iteratively 𝑔

𝑛 is chosen to satisfy

𝑔

𝑛 − 𝑔 2 ≤ 𝑗𝑜𝑔 0<𝛾≤𝑑𝑗𝑜𝑔 𝑕∈𝐻𝑛 𝛾𝑔 𝑛−1 + (1 − 𝛾)𝑕𝑛 − 𝑔 2 + 𝜗𝑛

where 𝜗𝑛 ≤

𝜍 𝜍+1 𝑓 𝑛 𝑛+𝜍 and 𝜍 is a small constant. Then for every 𝑛 ≥ 1,

𝑔

𝑛 − 𝑔 2 ≤ 𝜍 + 1 𝑓

𝑛

2 2 2 2 2 2 1 3 4

1 2 2 2 cos ( ) 1 1 1 2 1

g f g f

c c c b b b b C C C V s c c c c                 

 One term in 𝑓 decreases w.r.t 𝜄, another is non-increasing w.r.t

𝜄, hence a proper chosen 𝜄 can reduce approximation error bound

Decreasing w.r.t 𝜄 Non-increasing w.r.t 𝜄

SLIDE 45

Empirical Corroboration of the Theory

Classification accuracy versus tradeoff parameter 𝜇 in MAR-MLP, on TIMIT

speech dataset, under different number of hidden units. Larger 𝜇 induces larger angles.

A proper 𝜇 needs to be chosen to achieve the best accuracy, in accordance

with the theory that a proper lower bound of the angles yields the lowest generation error 45 45

SLIDE 46

Conclusions

Mutual Angular Regularization (MAR) of Latent Variable Models

A mutual angle based regularizer
Capture long-tail factors
Reduce model complexity while

preserving expressiveness

Theory

Generalization performance of MAR-MLP
MAR can reduce estimation and

approximation errors

Algorithm

Smooth lower bound of the regularizer
Optimizing the lower bound with gradient

ascent can increase the regularizer in each iteration

Applications

MAR-RBM and MAR-DML
Strong empirical performance

46 46

SLIDE 47

47 47

Thank you! Questions?

Papers/slides/code/documents are available at http://www.cs.cmu.edu/~pengtaox/projects/dlvm.html