Contras(ve learning, mul(-view redundancy, and linear models - - PowerPoint PPT Presentation

contras ve learning mul view redundancy and linear models
SMART_READER_LITE
LIVE PREVIEW

Contras(ve learning, mul(-view redundancy, and linear models - - PowerPoint PPT Presentation

Contras(ve learning, mul(-view redundancy, and linear models Daniel Hsu Columbia University Joint work with: Akshay Krishnamurthy ( Microsoft Research ) Christopher Tosh ( Columbia University ) Johns Hopkins MINDS & CIS Seminar - October 6


slide-1
SLIDE 1

Contras(ve learning, mul(-view redundancy, and linear models

Daniel Hsu Columbia University Joint work with: Akshay Krishnamurthy (Microsoft Research) Christopher Tosh (Columbia University)

Johns Hopkins MINDS & CIS Seminar - October 6th, 2020

slide-2
SLIDE 2

Learning representations of data

Probabilis)c modeling Deep learning

Image credit: stats.stackexchange.com; bdtechtalks.com

slide-3
SLIDE 3

Goal of representa-on learning

Image credit: towardsdatascience.com

Learned from data

slide-4
SLIDE 4

Deep neural networks: Already doing it?

task 1

  • utput y1

task 3

  • utput y3

task 2

  • utput y2

Task%A% Task%B% Task%C% %output% %input% %shared% subsets%of% factors%

Image credit: [Bengio, Courville, Vincent, 2014]

slide-5
SLIDE 5

Unsupervised / semi-supervised learning

Unlabeled data

!

Feature map

Labeled data

"

Predictor

S e l f

  • s

u p e r v i s e d l e a r n i n g D

  • w

n

  • s

t r e a m s u p e r v i s e d t a s k

slide-6
SLIDE 6

"Self-supervised learning"

  • Idea: Learn to solve self-derived predic3on problems, then introspect.
  • Example: Images
  • Predict color channel from grayscale channel

[Zhang, Isola, Efros, 2017]

  • Example: Text documents
  • Predict missing word in a sentence from context

[Mikolov, Sutskever, Chen, Corrado, Dean, 2013; Dhillon, Foster, Ungar, 2011]

  • Example: Dynamical systems
  • Predict future observa3ons from past observa3ons

[Yule, 1927; Langford, Salakhutdinov, Zhang, 2009]

slide-7
SLIDE 7

2 positive examples 2 negative examples

Se Self lf-su super ervised sed l lea earning p prob

  • blem

em wit with t text d documen ents

Positive examples: Documents from a natural corpus Negative examples: First half of a document, randomly paired with second half of another document Can create training data from unlabeled documents!

slide-8
SLIDE 8

Representations from self-supervised learning

Improves down-stream supervised learning performance in many cases

[Mikolov, Sutskever, Chen, Corrado, Dean, 2013; Logeswaran & Lee, 2018; Oord, Li, Vinyals, 2018; Arora, Khandeparkar, Khodak, Plevrakis, Saunshi, 2019]

!("The S&P 500 fell more than 3.3 percent") !("European markets recorded their worst session since 2016") !("The new mascot appears to have bushier eyebrows")

Q: For what problems can we prove these are representations useful?

slide-9
SLIDE 9

What's in the representation?

To understand the representations, we look to probabilistic modeling…

!

Our focus: Representa0ons " derived from "Contras0ve Learning"

slide-10
SLIDE 10

Our theore)cal results (informally)

  • 1. Assume unlabeled data follow a topic model (e.g., LDA). Then:

representa?on ! " = linear transform of topic posterior moments (of order up to document length).

sports science politics business

∼iid

1 5 + 2 5 + 2 5 +0

slide-11
SLIDE 11

Our theore)cal results (informally)

  • 2. Assume unlabeled data has two views ! and !", each with near-
  • p9mal MSE for predic9ng a target variable # (possibly using non-

linear func9ons). Then: a linear func+on of $(!) can achieve near-op9mal MSE

slide-12
SLIDE 12

Our theore)cal results (informally)

  • 3. Error transform theorem:

i.e., be3er solu6ons to self-supervised learning problem yield be3er representa6ons for down-stream supervised learning task

Excess error in down-stream supervised learning task with linear functions of ! "($)

Excess error in self- supervised learning problem

'

! "

slide-13
SLIDE 13

Rest of the talk

  • 1. Representa+on learning method & topic model analysis
  • 2. Mul+-view redundancy analysis
  • 3. Experimental study
slide-14
SLIDE 14
  • 1. Representa,on learning method

& topic model analysis

slide-15
SLIDE 15

The plan

  • a. Formalize the contrastive learning problem and representation
  • b. Interpret the representation in context of topic models

Unlabeled data

!

Feature map

Labeled data

"

Predictor

slide-16
SLIDE 16

2 positive examples 2 negative examples

Se Self lf-su super ervised sed l lea earning p prob

  • blem

em wit with t text d documen ents

Positive examples: Documents from a natural corpus Negative examples: First half of a document, randomly paired with second half of another document Can create training data from unlabeled documents!

slide-17
SLIDE 17

"Contrastive learning"

  • Learn predictor to discriminate between

(", "′) ∼ '(,() [positive example] and ", "′ ∼ '

( ⊗ '() [negative example]

  • Specifically, es6mate odds-ra6o

:∗ ", "′ = Pr positive ∣ (", "′) Pr negative ∣ (", "′) by training a neural network (or whatever) using a loss func6on like logis6c loss on random posi6ve & nega6ve examples (which are, WLOG, evenly balanced: 0.5 '(,() + 0.5 '

( ⊗ '()).

[ Steinwart, Hush, Scovel, 2005; Abe, Zadrozny, Langford, 2006; Gutmann & Hyvärinen, 2010; Oord, Li, Vinyals, 2018; Arora, Khandeparkar, Khodak, Plevrakis, Saunshi, 2019; … ]

slide-18
SLIDE 18

Construc)ng the representa)on

  • Given an es)mate !

" of "∗, construct embedding func)on for document halves: $ % & ≔ ! " &, )* ∶ , = 1, … , 0 ∈ ℝ3 where )4, … , )3 are "landmark documents" & )4 )5 )6

slide-19
SLIDE 19

Topic model

  • ! topics, each specifies a distribu1on over the vocabulary
  • A document is associated with its own distribu1on " over ! topics
  • Words in document (BoW): i.i.d. from induced mixture distribu1on
  • Assume they are arbitrarily par11oned into two halves, # and #′

sports science politics business

∼iid

1 5 + 2 5 + 2 5 +0 E.g., [Hofmann, 1999; Blei, Ng, Jordan, 2003; …]

slide-20
SLIDE 20

Simple case: One topic per document

  • Suppose ! ∈ #$, … , #'

(i.e., document is about only one topic)

  • Fact: Odds ra9o = density ra9o:

(∗ *, *′ ≔ Pr positive ∣ (*, *′) Pr negative ∣ (*, *′) = =>,>?(*, *′) =

> * =>?(*′)

Es9mated using contras9ve learning Interpret via data generating distribution

slide-21
SLIDE 21

Interpreting the density ratio…

!","$(&, &') !

" & !"$(&') = * +,- . Pr 1+ Pr & ∣ 1+ Pr &′ ∣ 1+

!

" & !"$(&′)

= *

+,- . Pr 1+ ∣ & Pr &′ ∣ 1+

!"$(&′) = 4 & 56 &′ !"$ &′

Posterior over topics given & Likelihood of topics given &′ Density ra7o Using BoW assump7on

slide-22
SLIDE 22

Inside the embedding

  • Embedding: !∗ # = %∗ #, '( ∶ * = 1, … , - where

%∗ #, #′ = / # 01 #′ 234(#′)

  • Therefore

!∗ # = 234 '7

871 '7

⋮ 234 ':

871 ':

/(#)

Posterior over topics given # (Scaled) likelihoods of topics given '('s

; / #

slide-23
SLIDE 23

Upshot in the simple case

  • In the "one topic per document" case, document embedding is a

linear transforma7on of the posterior over topics !∗ # = % & #

  • Theorem: If % is full-rank, every linear func7on of topic posterior can

be expressed as a linear func7on of !∗(⋅)

slide-24
SLIDE 24

General case: Exploit bag-of-word structure

  • In general, posterior distribution over ! (topic distribution) given " is

not summarized by just a #-dimensional vector.

  • If " and "′ each have % words:
  • Let &' ( ≔ (*: , ≤ % where (* = ∏0∈ 2 (0

*3 for ( ∈ ℝ2

  • Let 5 " ≔ 6[&' ! ∣ "] (order % multivariate conditional moments of !)
  • There is a corresponding :(⋅) (that depends on topic model params) such that

>∗ ", "′ = 5 " A:("′) BCD "′

  • Theorem: There is a choice of landmark documents such that E∗ "

yields (linear transform of) conditional moments of ! of orders ≤ %.

slide-25
SLIDE 25
  • 2. Multi-view redundancy analysis
slide-26
SLIDE 26

The plan

  • a. Recap multi-view prediction setting
  • b. How contrastive learning fares in the multi-view setting

Unlabeled data

!

Feature map

Labeled data

"

Predictor

slide-27
SLIDE 27
  • Assume (unlabeled) data provides two "views" ! and !′, each equally

good at predicting a target #

  • Example: topic identification
  • # = topic of article
  • ! = text of abstract
  • !% = text of article
  • Example: web page classification
  • # = web page type
  • ! = text of web page
  • !% = text of hyper-links pointing to page

Setting for multi-view prediction

! !′

slide-28
SLIDE 28

Mul$-view learning methods

  • Co-training [Blum & Mitchell, 1998]:
  • If ! ⊥ !# ∣ %, then bootstrapping methods "work"
  • Canonical Correlation Analysis [Kakade & Foster, 2007]:
  • Suppose there is redundancy of views via linear predictors:

for each & ∈ !, !′ *+,,

  • ≥ * /,/0 ,,
  • − 2
  • Then CCA-based (linear) dimension reduction doesn't hurt much
  • (No assumption of conditional independence!)

Q: What if views are redundant only via non-linear predictors?

slide-29
SLIDE 29

! " ≔ $ $ % ∣ '( ∣ ' = "

Surrogate predictor via multi-view redundancy

Best (possibly non-linear) prediction of % using '′

Our strategy: Learn a representa9on + " such that ! " ≈ linear func9on of + " Lemma: If $ $ %

  • − $ %

', '(

0 ≤ 2 for each - ∈ ', '′ ,

then $ ! ' − $ % ', '(

0 ≤ 42.

slide-30
SLIDE 30

! " = $ $ % &' & = " = $ $ % ∣ &' )∗ ", &' ≈ 1 . /

012 3

$ % &' = 40 )∗ ", 40 = 567∗(")

Linearization of the surrogate predictor

since )∗ ", "' :;< ="' = :;<|;1?(="') using 7∗ " ≔ )∗ ", 42 , … , )∗ ", 43 with 42, … , 43 ∼00C :;<

Theorem: Under D-multi-view redundancy assumption, min

H $ 567∗ & − $ %

&, &'

J ≤ 4D + N

1 .

slide-31
SLIDE 31

Odds and ends

  • Similar results for "bivariate" odds-ra.o es.mators of the form

!∗ #, #% = ' # () #% where now we use *∗ # = ' # , assuming a "low-dimensional" + such that , ⊥ ,% ∣ +

  • Error transforma.on: Can analyze /

* based on odds-ra<o es<mator 0 ! that only approximately solves contras<ve learning problem. min

4 5

/ * # (6 − 5 8 ∣ #, #%

9 = : error 0

! + 4@ + :(1/D)

Error in down-stream supervised task Contrastive learning error

slide-32
SLIDE 32
  • 3. Experimental study
slide-33
SLIDE 33

Study dataset and comparisons

  • AG News [Del Corso, Gulli, Romani, 2005; Zhang, Zhao, LeCun, 2015]:

Four categories (world, sports, business, sci/tech) of news articles

  • 16,700 words in vocabulary after removing rare words; avg. ~45 words/document
  • Use 4 x 29,000 unlabeled examples for contrastive learning to get !

"

  • Use (up to) 4 x 1,000 labeled examples to train linear classifier (multi-class logreg)
  • Use 4 x 1,900 labeled examples for test set
  • Our embedding !

" (called "NCE" for Noise Contrastive Embedding):

  • Three-layer ReLU networks with ~300 nodes/layer
  • Dropout regularization, batch normalization, PyTorch initialization
  • Trained using RMSProp
  • Baseline embeddings !

":

  • word2vec [Mikolov et al, 2013], Latent Dirichlet Allocation [Blei et al, 2003], BoW
slide-34
SLIDE 34

Accuracy on supervised task vs # sample size

! " # ∈ ℝ& for ' = 100

slide-35
SLIDE 35

Varying the number of layers

slide-36
SLIDE 36

Performance on contras-ve task vs accuracy

slide-37
SLIDE 37

t-SNE embeddings of test documents

NCE word2vec

[Van der Maaten & Hinton, 2008]

slide-38
SLIDE 38

Summary

Broader theme: Study self-supervised representa3on learning in the context of probabilis3c models

  • Topic models (and other mul3-view mixture models)
  • Mul3-view redundancy (à la CCA)

! "

slide-39
SLIDE 39

Acknowledgements

  • Thanks to Miro Dudík for ini2al discussions & sugges2ons
  • Support from NSF CCF-1740833 & JP Morgan Faculty Award

Thanks!

! "

slide-40
SLIDE 40

Related / complementary work

  • Steinwart, Hush, Scovel (2005), Abe, Zadrozny, Langford (2006)
  • "Noise contrasDve esDmaDon" used to esDmate density level sets / outlier detecDon
  • Gutmann & Hyvärinen (2010)
  • "Noise contrasDve esDmaDon" used to fit staDsDcal models
  • Arora, Khandeparkar, Khodak, Plevrakis, Saunshi (2019)
  • Analysis of similar contrasDve learning method
  • If !, !# are condi*onally independent given class label, then contrasDve learning

gives linearly useful representaDons

  • Lee, Lei, Saunshi, Zhuo (2020)
  • Different representaDon obtain from predictor of $ʹ using $
  • Analysis under approximate condiDonal independence given a hidden state
slide-41
SLIDE 41

Synthetic topic model recovery experiments

slide-42
SLIDE 42

Domain adapta+on in mul+-view se3ng

  • Suppose at test *me, !" → $" and !"% → $"%.
  • If !"%|" remains unchanged, then can adapt representa*on by

choosing landmarks from $"% instead of from !"%.