[PPT] - Contras(ve learning, mul(-view redundancy, and linear models PowerPoint Presentation

SLIDE 1

Contras(ve learning, mul(-view redundancy, and linear models

Daniel Hsu Columbia University Joint work with: Akshay Krishnamurthy (Microsoft Research) Christopher Tosh (Columbia University)

Johns Hopkins MINDS & CIS Seminar - October 6th, 2020

SLIDE 2

Learning representations of data

Probabilis)c modeling Deep learning

Image credit: stats.stackexchange.com; bdtechtalks.com

SLIDE 3

Goal of representa-on learning

Image credit: towardsdatascience.com

Learned from data

SLIDE 4

Deep neural networks: Already doing it?

task 1

utput y1

task 3

utput y3

task 2

utput y2

Task%A% Task%B% Task%C% %output% %input% %shared% subsets%of% factors%

Image credit: [Bengio, Courville, Vincent, 2014]

SLIDE 5

Unsupervised / semi-supervised learning

Unlabeled data

!

Feature map

Labeled data

"

Predictor

S e l f

s

u p e r v i s e d l e a r n i n g D

w

n

s

t r e a m s u p e r v i s e d t a s k

SLIDE 6

"Self-supervised learning"

Idea: Learn to solve self-derived predic3on problems, then introspect.
Example: Images
Predict color channel from grayscale channel

[Zhang, Isola, Efros, 2017]

Example: Text documents
Predict missing word in a sentence from context

[Mikolov, Sutskever, Chen, Corrado, Dean, 2013; Dhillon, Foster, Ungar, 2011]

Example: Dynamical systems
Predict future observa3ons from past observa3ons

[Yule, 1927; Langford, Salakhutdinov, Zhang, 2009]

SLIDE 7

2 positive examples 2 negative examples

Se Self lf-su super ervised sed l lea earning p prob

blem

em wit with t text d documen ents

Positive examples: Documents from a natural corpus Negative examples: First half of a document, randomly paired with second half of another document Can create training data from unlabeled documents!

SLIDE 8

Representations from self-supervised learning

Improves down-stream supervised learning performance in many cases

[Mikolov, Sutskever, Chen, Corrado, Dean, 2013; Logeswaran & Lee, 2018; Oord, Li, Vinyals, 2018; Arora, Khandeparkar, Khodak, Plevrakis, Saunshi, 2019]

!("The S&P 500 fell more than 3.3 percent") !("European markets recorded their worst session since 2016") !("The new mascot appears to have bushier eyebrows")

Q: For what problems can we prove these are representations useful?

SLIDE 9

What's in the representation?

To understand the representations, we look to probabilistic modeling…

!

Our focus: Representa0ons " derived from "Contras0ve Learning"

SLIDE 10

Our theore)cal results (informally)

1. Assume unlabeled data follow a topic model (e.g., LDA). Then:

representa?on ! " = linear transform of topic posterior moments (of order up to document length).

sports science politics business

∼iid

1 5 + 2 5 + 2 5 +0

SLIDE 11

Our theore)cal results (informally)

2. Assume unlabeled data has two views ! and !", each with near-
p9mal MSE for predic9ng a target variable # (possibly using non-

linear func9ons). Then: a linear func+on of $(!) can achieve near-op9mal MSE

SLIDE 12

Our theore)cal results (informally)

3. Error transform theorem:

i.e., be3er solu6ons to self-supervised learning problem yield be3er representa6ons for down-stream supervised learning task

Excess error in down-stream supervised learning task with linear functions of ! "($)

≤

Excess error in self- supervised learning problem

'

! "

SLIDE 13

Rest of the talk

1. Representa+on learning method & topic model analysis
2. Mul+-view redundancy analysis
3. Experimental study

SLIDE 14

1. Representa,on learning method

& topic model analysis

SLIDE 15

The plan

a. Formalize the contrastive learning problem and representation
b. Interpret the representation in context of topic models

Unlabeled data

!

Feature map

Labeled data

"

Predictor

SLIDE 16

2 positive examples 2 negative examples

Se Self lf-su super ervised sed l lea earning p prob

blem

em wit with t text d documen ents

Positive examples: Documents from a natural corpus Negative examples: First half of a document, randomly paired with second half of another document Can create training data from unlabeled documents!

SLIDE 17

"Contrastive learning"

Learn predictor to discriminate between

(", "′) ∼ '(,() [positive example] and ", "′ ∼ '

( ⊗ '() [negative example]

Specifically, es6mate odds-ra6o

:∗ ", "′ = Pr positive ∣ (", "′) Pr negative ∣ (", "′) by training a neural network (or whatever) using a loss func6on like logis6c loss on random posi6ve & nega6ve examples (which are, WLOG, evenly balanced: 0.5 '(,() + 0.5 '

( ⊗ '()).

[ Steinwart, Hush, Scovel, 2005; Abe, Zadrozny, Langford, 2006; Gutmann & Hyvärinen, 2010; Oord, Li, Vinyals, 2018; Arora, Khandeparkar, Khodak, Plevrakis, Saunshi, 2019; … ]

SLIDE 18

Construc)ng the representa)on

Given an es)mate !

" of "∗, construct embedding func)on for document halves: $ % & ≔ ! " &, )* ∶ , = 1, … , 0 ∈ ℝ3 where )4, … , )3 are "landmark documents" & )4 )5 )6

SLIDE 19

Topic model

! topics, each specifies a distribu1on over the vocabulary
A document is associated with its own distribu1on " over ! topics
Words in document (BoW): i.i.d. from induced mixture distribu1on
Assume they are arbitrarily par11oned into two halves, # and #′

sports science politics business

∼iid

1 5 + 2 5 + 2 5 +0 E.g., [Hofmann, 1999; Blei, Ng, Jordan, 2003; …]

SLIDE 20

Simple case: One topic per document

Suppose ! ∈ #$, … , #'

(i.e., document is about only one topic)

Fact: Odds ra9o = density ra9o:

(∗ *, *′ ≔ Pr positive ∣ (*, *′) Pr negative ∣ (*, *′) = =>,>?(*, *′) =

> * =>?(*′)

Es9mated using contras9ve learning Interpret via data generating distribution

SLIDE 21

Interpreting the density ratio…

!","$(&, &') !

" & !"$(&') = * +,- . Pr 1+ Pr & ∣ 1+ Pr &′ ∣ 1+

!

" & !"$(&′)

= *

+,- . Pr 1+ ∣ & Pr &′ ∣ 1+

!"$(&′) = 4 & 56 &′ !"$ &′

Posterior over topics given & Likelihood of topics given &′ Density ra7o Using BoW assump7on

SLIDE 22

Inside the embedding

Embedding: !∗ # = %∗ #, '( ∶ * = 1, … , - where

%∗ #, #′ = / # 01 #′ 234(#′)

Therefore

!∗ # = 234 '7

871 '7

⋮ 234 ':

871 ':

/(#)

Posterior over topics given # (Scaled) likelihoods of topics given '('s

; / #

SLIDE 23

Upshot in the simple case

In the "one topic per document" case, document embedding is a

linear transforma7on of the posterior over topics !∗ # = % & #

Theorem: If % is full-rank, every linear func7on of topic posterior can

be expressed as a linear func7on of !∗(⋅)

SLIDE 24

General case: Exploit bag-of-word structure

In general, posterior distribution over ! (topic distribution) given " is

not summarized by just a #-dimensional vector.

If " and "′ each have % words:
Let &' ( ≔ (*: , ≤ % where (* = ∏0∈ 2 (0

*3 for ( ∈ ℝ2

Let 5 " ≔ 6[&' ! ∣ "] (order % multivariate conditional moments of !)
There is a corresponding :(⋅) (that depends on topic model params) such that

>∗ ", "′ = 5 " A:("′) BCD "′

Theorem: There is a choice of landmark documents such that E∗ "

yields (linear transform of) conditional moments of ! of orders ≤ %.

SLIDE 25

2. Multi-view redundancy analysis

SLIDE 26

The plan

a. Recap multi-view prediction setting
b. How contrastive learning fares in the multi-view setting

Unlabeled data

!

Feature map

Labeled data

"

Predictor

SLIDE 27

Assume (unlabeled) data provides two "views" ! and !′, each equally

good at predicting a target #

Example: topic identification
# = topic of article
! = text of abstract
!% = text of article
Example: web page classification
# = web page type
! = text of web page
!% = text of hyper-links pointing to page

Setting for multi-view prediction

! !′

SLIDE 28

Mul$-view learning methods

Co-training [Blum & Mitchell, 1998]:
If ! ⊥ !# ∣ %, then bootstrapping methods "work"
Canonical Correlation Analysis [Kakade & Foster, 2007]:
Suppose there is redundancy of views via linear predictors:

for each & ∈ !, !′ *+,,

≥ * /,/0 ,,
− 2
Then CCA-based (linear) dimension reduction doesn't hurt much
(No assumption of conditional independence!)

Q: What if views are redundant only via non-linear predictors?

SLIDE 29

! " ≔ $ $ % ∣ '( ∣ ' = "

Surrogate predictor via multi-view redundancy

Best (possibly non-linear) prediction of % using '′

Our strategy: Learn a representa9on + " such that ! " ≈ linear func9on of + " Lemma: If $ $ %

− $ %

', '(

0 ≤ 2 for each - ∈ ', '′ ,

then $ ! ' − $ % ', '(

0 ≤ 42.

SLIDE 30

! " = $ $ % &' & = " = $ $ % ∣ &' )∗ ", &' ≈ 1 . /

012 3

$ % &' = 40 )∗ ", 40 = 567∗(")

Linearization of the surrogate predictor

since )∗ ", "' :;< ="' = :;<|;1?(="') using 7∗ " ≔ )∗ ", 42 , … , )∗ ", 43 with 42, … , 43 ∼00C :;<

Theorem: Under D-multi-view redundancy assumption, min

H $ 567∗ & − $ %

&, &'

J ≤ 4D + N

1 .

SLIDE 31

Odds and ends

Similar results for "bivariate" odds-ra.o es.mators of the form

!∗ #, #% = ' # () #% where now we use *∗ # = ' # , assuming a "low-dimensional" + such that , ⊥ ,% ∣ +

Error transforma.on: Can analyze /

* based on odds-ra<o es<mator 0 ! that only approximately solves contras<ve learning problem. min

4 5

/ * # (6 − 5 8 ∣ #, #%

9 = : error 0

! + 4@ + :(1/D)

Error in down-stream supervised task Contrastive learning error

SLIDE 32

3. Experimental study

SLIDE 33

Study dataset and comparisons

AG News [Del Corso, Gulli, Romani, 2005; Zhang, Zhao, LeCun, 2015]:

Four categories (world, sports, business, sci/tech) of news articles

16,700 words in vocabulary after removing rare words; avg. ~45 words/document
Use 4 x 29,000 unlabeled examples for contrastive learning to get !

"

Use (up to) 4 x 1,000 labeled examples to train linear classifier (multi-class logreg)
Use 4 x 1,900 labeled examples for test set
Our embedding !

" (called "NCE" for Noise Contrastive Embedding):

Three-layer ReLU networks with ~300 nodes/layer
Dropout regularization, batch normalization, PyTorch initialization
Trained using RMSProp
Baseline embeddings !

":

word2vec [Mikolov et al, 2013], Latent Dirichlet Allocation [Blei et al, 2003], BoW

SLIDE 34

Accuracy on supervised task vs # sample size

! " # ∈ ℝ& for ' = 100

SLIDE 35

Varying the number of layers

SLIDE 36

Performance on contras-ve task vs accuracy

SLIDE 37

t-SNE embeddings of test documents

NCE word2vec

[Van der Maaten & Hinton, 2008]

SLIDE 38

Summary

Broader theme: Study self-supervised representa3on learning in the context of probabilis3c models

Topic models (and other mul3-view mixture models)
Mul3-view redundancy (à la CCA)
…

! "

SLIDE 39

Acknowledgements

Thanks to Miro Dudík for ini2al discussions & sugges2ons
Support from NSF CCF-1740833 & JP Morgan Faculty Award

Thanks!

! "

SLIDE 40

Related / complementary work

Steinwart, Hush, Scovel (2005), Abe, Zadrozny, Langford (2006)
"Noise contrasDve esDmaDon" used to esDmate density level sets / outlier detecDon
Gutmann & Hyvärinen (2010)
"Noise contrasDve esDmaDon" used to fit staDsDcal models
Arora, Khandeparkar, Khodak, Plevrakis, Saunshi (2019)
Analysis of similar contrasDve learning method
If !, !# are condi*onally independent given class label, then contrasDve learning

gives linearly useful representaDons

Lee, Lei, Saunshi, Zhuo (2020)
Different representaDon obtain from predictor of $ʹ using $
Analysis under approximate condiDonal independence given a hidden state

SLIDE 41

Synthetic topic model recovery experiments

SLIDE 42

Domain adapta+on in mul+-view se3ng

Suppose at test *me, !" → $" and !"% → $"%.
If !"%|" remains unchanged, then can adapt representa*on by

choosing landmarks from $"% instead of from !"%.