Contras(ve learning, mul(-view redundancy, and linear models
Daniel Hsu Columbia University Joint work with: Akshay Krishnamurthy (Microsoft Research) Christopher Tosh (Columbia University)
Johns Hopkins MINDS & CIS Seminar - October 6th, 2020
Contras(ve learning, mul(-view redundancy, and linear models - - PowerPoint PPT Presentation
Contras(ve learning, mul(-view redundancy, and linear models Daniel Hsu Columbia University Joint work with: Akshay Krishnamurthy ( Microsoft Research ) Christopher Tosh ( Columbia University ) Johns Hopkins MINDS & CIS Seminar - October 6
Daniel Hsu Columbia University Joint work with: Akshay Krishnamurthy (Microsoft Research) Christopher Tosh (Columbia University)
Johns Hopkins MINDS & CIS Seminar - October 6th, 2020
Probabilis)c modeling Deep learning
Image credit: stats.stackexchange.com; bdtechtalks.com
Image credit: towardsdatascience.com
Learned from data
task 1
task 3
task 2
Task%A% Task%B% Task%C% %output% %input% %shared% subsets%of% factors%
Image credit: [Bengio, Courville, Vincent, 2014]
Unlabeled data
Feature map
Labeled data
Predictor
S e l f
u p e r v i s e d l e a r n i n g D
n
t r e a m s u p e r v i s e d t a s k
[Zhang, Isola, Efros, 2017]
[Mikolov, Sutskever, Chen, Corrado, Dean, 2013; Dhillon, Foster, Ungar, 2011]
[Yule, 1927; Langford, Salakhutdinov, Zhang, 2009]
2 positive examples 2 negative examples
Se Self lf-su super ervised sed l lea earning p prob
em wit with t text d documen ents
Positive examples: Documents from a natural corpus Negative examples: First half of a document, randomly paired with second half of another document Can create training data from unlabeled documents!
Improves down-stream supervised learning performance in many cases
[Mikolov, Sutskever, Chen, Corrado, Dean, 2013; Logeswaran & Lee, 2018; Oord, Li, Vinyals, 2018; Arora, Khandeparkar, Khodak, Plevrakis, Saunshi, 2019]
!("The S&P 500 fell more than 3.3 percent") !("European markets recorded their worst session since 2016") !("The new mascot appears to have bushier eyebrows")
Q: For what problems can we prove these are representations useful?
To understand the representations, we look to probabilistic modeling…
Our focus: Representa0ons " derived from "Contras0ve Learning"
representa?on ! " = linear transform of topic posterior moments (of order up to document length).
sports science politics business
∼iid
1 5 + 2 5 + 2 5 +0
linear func9ons). Then: a linear func+on of $(!) can achieve near-op9mal MSE
i.e., be3er solu6ons to self-supervised learning problem yield be3er representa6ons for down-stream supervised learning task
Excess error in down-stream supervised learning task with linear functions of ! "($)
≤
Excess error in self- supervised learning problem
! "
Unlabeled data
Feature map
Labeled data
Predictor
2 positive examples 2 negative examples
Se Self lf-su super ervised sed l lea earning p prob
em wit with t text d documen ents
Positive examples: Documents from a natural corpus Negative examples: First half of a document, randomly paired with second half of another document Can create training data from unlabeled documents!
(", "′) ∼ '(,() [positive example] and ", "′ ∼ '
( ⊗ '() [negative example]
:∗ ", "′ = Pr positive ∣ (", "′) Pr negative ∣ (", "′) by training a neural network (or whatever) using a loss func6on like logis6c loss on random posi6ve & nega6ve examples (which are, WLOG, evenly balanced: 0.5 '(,() + 0.5 '
( ⊗ '()).
[ Steinwart, Hush, Scovel, 2005; Abe, Zadrozny, Langford, 2006; Gutmann & Hyvärinen, 2010; Oord, Li, Vinyals, 2018; Arora, Khandeparkar, Khodak, Plevrakis, Saunshi, 2019; … ]
" of "∗, construct embedding func)on for document halves: $ % & ≔ ! " &, )* ∶ , = 1, … , 0 ∈ ℝ3 where )4, … , )3 are "landmark documents" & )4 )5 )6
sports science politics business
∼iid
1 5 + 2 5 + 2 5 +0 E.g., [Hofmann, 1999; Blei, Ng, Jordan, 2003; …]
(i.e., document is about only one topic)
(∗ *, *′ ≔ Pr positive ∣ (*, *′) Pr negative ∣ (*, *′) = =>,>?(*, *′) =
> * =>?(*′)
Es9mated using contras9ve learning Interpret via data generating distribution
!","$(&, &') !
" & !"$(&') = * +,- . Pr 1+ Pr & ∣ 1+ Pr &′ ∣ 1+
!
" & !"$(&′)
= *
+,- . Pr 1+ ∣ & Pr &′ ∣ 1+
!"$(&′) = 4 & 56 &′ !"$ &′
Posterior over topics given & Likelihood of topics given &′ Density ra7o Using BoW assump7on
%∗ #, #′ = / # 01 #′ 234(#′)
!∗ # = 234 '7
871 '7
⋮ 234 ':
871 ':
/(#)
Posterior over topics given # (Scaled) likelihoods of topics given '('s
linear transforma7on of the posterior over topics !∗ # = % & #
be expressed as a linear func7on of !∗(⋅)
not summarized by just a #-dimensional vector.
*3 for ( ∈ ℝ2
>∗ ", "′ = 5 " A:("′) BCD "′
yields (linear transform of) conditional moments of ! of orders ≤ %.
Unlabeled data
Feature map
Labeled data
Predictor
good at predicting a target #
! !′
for each & ∈ !, !′ *+,,
Q: What if views are redundant only via non-linear predictors?
! " ≔ $ $ % ∣ '( ∣ ' = "
Best (possibly non-linear) prediction of % using '′
Our strategy: Learn a representa9on + " such that ! " ≈ linear func9on of + " Lemma: If $ $ %
', '(
0 ≤ 2 for each - ∈ ', '′ ,
then $ ! ' − $ % ', '(
0 ≤ 42.
! " = $ $ % &' & = " = $ $ % ∣ &' )∗ ", &' ≈ 1 . /
012 3
$ % &' = 40 )∗ ", 40 = 567∗(")
since )∗ ", "' :;< ="' = :;<|;1?(="') using 7∗ " ≔ )∗ ", 42 , … , )∗ ", 43 with 42, … , 43 ∼00C :;<
Theorem: Under D-multi-view redundancy assumption, min
H $ 567∗ & − $ %
&, &'
J ≤ 4D + N
1 .
!∗ #, #% = ' # () #% where now we use *∗ # = ' # , assuming a "low-dimensional" + such that , ⊥ ,% ∣ +
* based on odds-ra<o es<mator 0 ! that only approximately solves contras<ve learning problem. min
4 5
/ * # (6 − 5 8 ∣ #, #%
9 = : error 0
! + 4@ + :(1/D)
Error in down-stream supervised task Contrastive learning error
Four categories (world, sports, business, sci/tech) of news articles
"
" (called "NCE" for Noise Contrastive Embedding):
":
! " # ∈ ℝ& for ' = 100
[Van der Maaten & Hinton, 2008]
Broader theme: Study self-supervised representa3on learning in the context of probabilis3c models
! "
Acknowledgements
! "
gives linearly useful representaDons
choosing landmarks from $"% instead of from !"%.