[PPT] - An old Ar(ficial Intelligence dream that comes true: PowerPoint Presentation

SLIDE 1

An ¡old ¡Ar(ficial ¡Intelligence ¡dream ¡that ¡ comes ¡true: ¡ Merging ¡language ¡and ¡vision ¡modali(es ¡

Raffaella ¡Bernardi ¡ University ¡of ¡Trento ¡

SLIDE 2

An ¡old ¡AI ¡dream ¡

A. ¡Turing, ¡Compu(ng ¡machinery ¡and ¡intelligence, ¡Mind ¡59, ¡pp. ¡433-‑460, ¡1950 ¡

¡ Need ¡of: ¡

Natural ¡Language ¡Processing ¡(NLP) ¡
Knowledge ¡Representa(on ¡
Reasoning ¡
… ¡

SLIDE 3

AI ¡

Knowledge ¡ Representa(on ¡ Planning ¡ Machine ¡ Learning ¡ Natural ¡ Language ¡ Processing ¡ computer ¡ vision ¡ Reasoning ¡ Robot ¡ Social ¡ Intelligence ¡ Crea(vity ¡

SLIDE 4

Natural ¡Language ¡Processing ¡(NLP): ¡ ¡

¡

Part ¡of ¡Speech ¡Tagging ¡(PoS) ¡
Syntax ¡
Seman(cs ¡
Discourse ¡
Dialogue ¡

SLIDE 5

Distribu(onal ¡Seman(cs ¡

The ¡meaning ¡of ¡a ¡word ¡is ¡given ¡by ¡its ¡context ¡

SLIDE 6

Distribu(onal ¡Seman(cs: ¡ ¡ coun(ng ¡words ¡distribu(on ¡

Words ¡are ¡represented ¡by ¡vectors ¡harvested ¡from ¡a ¡ corpus ¡of ¡texts ¡by ¡coun(ng ¡word ¡co-‑occurences. ¡

SLIDE 7

Distribu(onal ¡Seman(cs: ¡ ¡ Predict ¡the ¡context ¡

The ¡vector ¡represen(ng ¡a ¡word ¡is ¡obtained ¡by ¡learning ¡ to ¡predict ¡its ¡nearby ¡words. ¡(Mikolov ¡et ¡al, ¡2013) ¡

SLIDE 8

Seman(c ¡Rela(onship ¡ Mikolov ¡et ¡al. ¡NIPS ¡2013 ¡ ¡

SLIDE 9

Pause: ¡ Neural ¡Network ¡

It's ¡a ¡composi(on ¡of ¡func(ons ¡(neurons) ¡that ¡goes ¡from ¡ an ¡n-‑dimensional ¡vector ¡to ¡class ¡scores. ¡ Each ¡neuron ¡receives ¡some ¡inputs, ¡performs ¡a ¡dot ¡ product ¡and ¡op(onally ¡follows ¡it ¡with ¡a ¡non ¡

linearity. ¡On ¡the ¡last ¡(fully-‑connected) ¡layer, ¡they ¡

have ¡a ¡loss ¡func(on ¡(e.g., ¡So]max). ¡

SLIDE 10

Pause: ¡ Recurrent ¡NN ¡

Tradi(onal ¡neural ¡networks ¡cannot ¡use ¡the ¡ informa(on ¡about ¡previous ¡inputs ¡to ¡inform ¡later ¡

nes. ¡
Recurrent ¡neural ¡networks ¡(RNNs) ¡address ¡this ¡

issue: ¡They ¡are ¡networks ¡with ¡loops ¡in ¡them, ¡ allowing ¡informa(on ¡to ¡persist. ¡They ¡work ¡well ¡ with ¡short ¡dependencies. ¡

Long ¡Short ¡Term ¡Memory ¡(LSTM) ¡are ¡a ¡special ¡

kind ¡of ¡RNN, ¡capable ¡of ¡learning ¡long-‑term ¡

dependencies. ¡

SLIDE 11

LSTM: ¡ Sentence ¡representa(on ¡

Star(ng ¡from ¡word2vec ¡word ¡representa(ons ¡or ¡ from ¡the ¡plain ¡words, ¡obtain ¡the ¡sentence ¡ representa(on ¡via ¡LSTM: ¡

SLIDE 12

Distribu(onal ¡Seman(cs: ¡ A ¡successful ¡story.. ¡

Lexical ¡meaning ¡

Synonyms ¡
Concept ¡categoriza(on ¡(eg. ¡car ¡ISA ¡vehicle) ¡
Selec(onal ¡preferences ¡(e.g. ¡eat ¡chocolate ¡vs. ¡*eat ¡

sympathy) ¡

Rela(on ¡classifica(on ¡(exam-‑anxiety ¡CAUSE-‑EFFECT ¡

rela(on) ¡

Salient ¡proper(es ¡(car-‑wheels) ¡

Composi5onality: ¡Phrase ¡and ¡Sentence ¡ ¡

Similarity ¡ ¡
Entailment ¡

SLIDE 13

Distribu(onal ¡Seman(cs: ¡ .. ¡but ¡Grounding ¡Problem ¡

Grounding ¡language ¡representa(on ¡into ¡the ¡world: ¡ point ¡to ¡the ¡reference ¡of ¡our ¡mental ¡representa(on. ¡

SLIDE 14

Computer ¡Vision: ¡ From ¡pixels ¡to ¡Meaning ¡

SLIDE 15

Computer ¡Vision: ¡ Abstract ¡Features ¡

SLIDE 16

CV ¡tradi(onal ¡tasks: ¡ Objects ¡

Image ¡classifica(on: ¡

Object ¡localiza(on: ¡

From ¡objects ¡to ¡scene ¡classifica(on ¡

SLIDE 17

CV ¡first ¡important ¡revolu(on: ¡ ImageNet ¡

ImageNet: ¡ ¡

Stanford ¡Vision ¡Lab, ¡Stanford ¡University ¡& ¡

Princeton ¡University. ¡

Image ¡database ¡organized ¡according ¡to ¡the ¡

WordNet ¡hierarchy. ¡

Challenges: ¡2007-‑present ¡
AMT: ¡48,940 ¡annotators ¡from ¡167 ¡countries ¡
15M ¡images ¡
22K ¡categories ¡of ¡objects ¡

SLIDE 18

CV ¡second ¡important ¡revolu(on: ¡ Convolu(onal ¡Neural ¡Networks ¡

ImageNet ¡Classifica(on ¡with ¡ Deep ¡Convolu(onal ¡Neural ¡ Networks ¡ ¡ Alex ¡Krizhevsky, ¡Ilya ¡Sutskever ¡ and ¡Georey ¡E. ¡Hinton, ¡2012 ¡

¡

2012: ¡Krizhevsky ¡outperformed ¡the ¡
ther ¡systems ¡using ¡CNN ¡
2013: ¡half ¡of ¡the ¡systems ¡used ¡CNN ¡
2014: ¡All ¡of ¡the ¡systems ¡used ¡CNN. ¡

SLIDE 19

CNN: ¡ Hierarchy ¡of ¡features ¡

SLIDE 20

CNN: ¡

ff-‑the-‑shelf ¡vector ¡representa(on ¡
Train ¡a ¡CNN ¡on ¡a ¡vision ¡task ¡(e.g. ¡AlexNet ¡on ¡ImageNet) ¡
Do ¡a ¡forward ¡pass ¡given ¡an ¡image ¡input ¡
Transfer ¡one ¡or ¡more ¡layers ¡(e.g. ¡FC7 ¡or ¡C5) ¡

SLIDE 21

Language ¡and ¡Vision ¡

Language ¡and ¡Visual ¡Spaces ¡can ¡be ¡combined! ¡ Cogni(ve ¡Angle: ¡ ¡ Language ¡and ¡Vision ¡Representa(ons ¡ ¡ must ¡be ¡combined! ¡ Applied ¡Angle: ¡ Combining ¡Language ¡and ¡Vision ¡Representa(ons ¡ gives ¡very ¡useful ¡ ¡

SLIDE 22

Language ¡and ¡Vision ¡

Mul(modal ¡Tasks: ¡

– Exploit ¡language ¡to ¡improve ¡on ¡tradi(onal ¡CV ¡tasks ¡ – Exploit ¡vision ¡to ¡improve ¡on ¡tradi(onal ¡NLP ¡tasks ¡ – New ¡Mul(modal ¡Tasks ¡

Mul(modal ¡Representa(ons: ¡

– learned ¡separately ¡and ¡translated ¡one ¡into ¡the ¡other ¡ – learned ¡separately ¡and ¡concatenated ¡ – learned ¡jointly ¡

¡

SLIDE 23

Mul(modal ¡Tasks: ¡ Improve ¡tradi(onal ¡CV ¡tasks ¡

Not ¡a ¡lemon, ¡it's ¡more ¡probable ¡a ¡tennis ¡ball. ¡-‑-‑ ¡Info ¡come ¡from ¡ a ¡KB ¡(word ¡similarity ¡list, ¡extracted ¡from ¡internet ¡Google ¡Sets). ¡

¡ Rabinovich, ¡A. ¡Vedaldi, ¡C. ¡Galleguillos, ¡E. ¡Wiewiora, ¡S. ¡Belongie ¡(ICCV ¡2007) ¡ Objects ¡in ¡Context. ¡

Use ¡of ¡Corpora ¡for ¡Ac(on ¡Recogni(on. ¡

Thu ¡Le ¡Dieu, ¡Jasper ¡Uijlings ¡and ¡R. ¡Bernardi ¡(2010, ¡2011) ¡

SLIDE 24

Mul(modal ¡Tasks: ¡ ¡ Improve ¡tradi(onal ¡NLP ¡tasks ¡

E. ¡Bruni, ¡G.B. ¡Tran ¡and ¡M. ¡Baroni ¡(GEMS ¡2011, ¡ACL ¡2012, ¡Journal ¡of ¡AI ¡2014), ¡
E. ¡Bruni, ¡G. ¡Boleda, ¡M. ¡Baroni ¡and ¡N. ¡Tran ¡(ACL ¡2012) ¡

SLIDE 25

Mul(modal ¡Vector ¡Spaces ¡

Kiros ¡et ¡al. ¡2014 ¡ ¡

SLIDE 26

New ¡Mul(modal ¡Tasks: ¡ Cross-‑Modal ¡Mapping ¡

Lazaridou, ¡Bruni ¡and ¡Baroni ¡ACL ¡2014 ¡

SLIDE 27

New ¡Mul(modal ¡Tasks: ¡ Image ¡Cap(oning ¡(IC) ¡

Datasets: ¡Flickr, ¡Pascal, ¡MS-‑COCO ¡(164K ¡images, ¡5 ¡cap(ons ¡each) ¡
Survey: ¡Automa(c ¡Descrip(on ¡Genera(on ¡from ¡Images: ¡A ¡Survey ¡of ¡Models, ¡

Datasets, ¡and ¡Evalua(on ¡Measures, ¡Bernardi ¡et ¡al. ¡JAIR ¡2016 ¡

Very ¡good ¡talk: ¡by ¡Karpathy ¡(2015): ¡

Limita5ons: ¡

Evalua(on ¡Measures: ¡Bleu, ¡Rouge, ¡etc. ¡but ¡not ¡precise. ¡
No ¡reasoning ¡

SLIDE 28

New ¡Mul(modal ¡Tasks: ¡ Visual ¡Ques(on ¡Answering ¡(VQA) ¡

Limita5ons: ¡

Language ¡prior ¡problem: ¡Blind ¡models ¡perform ¡preky ¡well ¡(50% ¡accuracy ¡on ¡COCO-‑

VQA!). ¡ è ¡But ¡see ¡development ¡of ¡new ¡real ¡image ¡datasets: ¡VQA2, ¡TDIUC ¡ Datasets: ¡DAQUAR ¡2014, ¡COCO-‑QA, ¡VQA, ¡Visual7W, ¡Visual ¡Genome, ¡VisWiz ¡ Survey: ¡Visual ¡Ques(on ¡Answering: ¡A ¡Survey ¡of ¡Methods ¡and ¡Datasets ¡Wu ¡et ¡ali, ¡ (2016) ¡

SLIDE 29

New ¡Mul(modal ¡Tasks ¡

Image-‑Text ¡Aiignment ¡ Datasets: ¡Faces ¡in ¡the ¡Wild, ¡Flickr ¡ 30k ¡En((es, ¡VRD, ¡Visual ¡Genome ¡ ¡ Duygulu ¡et ¡al ¡2002, ¡Barnard ¡et ¡al ¡ 2003, ¡Berg ¡et ¡al ¡2004, ¡Plummer ¡et ¡ al ¡2015, ¡Karpathy ¡and ¡Fei-‑Fei ¡2015, ¡ Zhu ¡et ¡al ¡2015, ¡Krishna ¡et ¡al ¡2016, ¡ Lu ¡et ¡al ¡2016 ¡ Referring ¡Expressions ¡ Datasets: ¡D-‑TUNA ¡Corpus, ¡Referit ¡ Game ¡Dataset, ¡Referit ¡Game ¡MS-‑ COCO ¡ ¡ Mitchell ¡et ¡al ¡2013, ¡Fitzgerald ¡et ¡ al ¡2013, ¡Kazemzadeh ¡et ¡al ¡2014, ¡ Mao ¡et ¡al ¡2015, ¡Yu ¡et ¡al ¡2016, ¡Hu ¡ et ¡al ¡2016, ¡Yu ¡et ¡al ¡2017, ¡Nagaraja ¡ et ¡al ¡2016, ¡Fang ¡et ¡al ¡2015 ¡ Credits: ¡Vicente ¡Ordóñez-‑Román ¡

SLIDE 30

New ¡Mul(modal ¡Tasks ¡ Diagnos(c ¡Datasets: ¡FOIL ¡

Shekhar ¡et ¡al ¡ACL ¡2017: ¡hkps://foilunitn.github.io/ ¡ ¡

SLIDE 31

New ¡Mul(modal ¡Tasks ¡ Diagnos(c ¡Dataset: ¡CLEVR ¡

Jonhson ¡et ¡al ¡CVRP ¡2017: ¡hkps://cs.stanford.edu/people/jcjohns/clevr/ ¡ ¡

SLIDE 32

New ¡Mul(modal ¡Tasks: ¡ Diagnos(c ¡Datasets: ¡NLVR ¡

Suhr ¡et ¡al ¡ACL ¡2017: ¡hkps://github.com/clic-‑lab/nlvr ¡ ¡

SLIDE 33

¡ Other ¡more ¡recent ¡ ¡ New ¡Mul(modal ¡Tasks: ¡ ¡

Spoken ¡VQA ¡
Mul(modal ¡Machine ¡Transla(on ¡
Image ¡Genera(on ¡
Visual ¡Dialogue ¡

¡

Visual ¡Story ¡Telling ¡(Huang ¡et ¡al. ¡2016) ¡
Ques(on ¡Genera(on ¡(Mostafazadeh ¡et ¡al ¡2016, ¡Jain ¡et ¡al ¡2017) ¡
Explana(on ¡(Park ¡et ¡al. ¡2018), ¡Counter-‑factual ¡(Hendricks ¡et ¡al. ¡

2018), ¡Inferences ¡(Iyyer ¡et ¡al. ¡2017), ¡Entailment ¡(Vu ¡et ¡al. ¡2018) ¡

Emo(on ¡recogni(on, ¡You ¡et ¡al. ¡2016 ¡
Learning ¡to ¡quan(fy ¡(vague ¡quan(fiers, ¡exact ¡numbers). ¡Pezzelle ¡et ¡
al. ¡2016, ¡2017, ¡2018 ¡
………… ¡

SLIDE 34

Visual ¡Dialogue: ¡ GuessWhat?! ¡game ¡

Collected ¡by ¡de ¡Vries ¡et ¡al ¡2017 ¡

via ¡AMT ¡

Two ¡par(cipants ¡see ¡an ¡image ¡

(from ¡MS-‑COCO). ¡

155K ¡dialogues ¡about ¡66K ¡

different ¡images ¡

Av. ¡of ¡QA ¡per ¡game: ¡5.2 ¡
84.6% ¡of ¡the ¡games ¡are ¡

completed ¡successfully ¡ ¡

See ¡also: ¡Visual ¡Dialog ¡hkps://visualdialog.org ¡ ¡ ¡

SLIDE 35

Mul(modal ¡Representa(on ¡

Multimodal Distributional Semantics Bruni, Tran and Baroni (2014) Combining Language and Vision with a Multimodal Skipgram Model Lazaridou, Phan and Baroni (2015)

SLIDE 36

Basic ¡Mul(modal ¡Models: ¡ Point-‑wise ¡mul(plica(on ¡

SLIDE 37

What ¡has ¡the ¡community ¡gained? ¡

Aken(on ¡Networks ¡
Hierarchical ¡Co-‑aken(on ¡
Bokom-‑up ¡Top-‑down ¡aken(on ¡
Composi(onality ¡
Mul(-‑modal ¡Pooling ¡

Bokom-‑Up ¡and ¡Top-‑Down ¡Aken(on ¡ Anderson ¡et ¡al., ¡CVPR ¡18 ¡ Mul(modal ¡Compact ¡Bilinear ¡Pooling ¡ Fukui ¡et ¡al., ¡EMNLP ¡16 ¡ Neural ¡Module ¡Networks ¡ Andreas ¡et ¡al., ¡CVPR ¡16 ¡ Hierarchical ¡Ques(on-‑Image ¡Co-‑Aken(on ¡ ¡ Lu ¡et ¡al., ¡NIPS ¡16 ¡ Stacked ¡Aken(on ¡Networks ¡ ¡ Yang ¡et ¡al., ¡CVPR ¡16 ¡

Credits: ¡Aishwarya ¡Agrawal ¡

SLIDE 38

Cuung-‑edge ¡fancy ¡models: ¡ Learning ¡Paradigms ¡

Adversarial ¡learning ¡
Reinforcement ¡Learning ¡
Coopera(ve ¡Learning ¡
… ¡

SLIDE 39

Surveys ¡

ACL ¡2017: ¡

hkps://www.cs.cmu.edu/~morency/MMML-‑ Tutorial-‑ACL2017.pdf ¡

COLING ¡2018

hkps://arxiv.org/abs/1806.06371 ¡

SLIDE 40

Some ¡research ¡groups ¡

Stanford ¡Vision ¡Lab ¡ ¡Le ¡Fei ¡Fei ¡hkp://vision.stanford.edu/ ¡
MIT: ¡Antonio ¡Torralba ¡hkp://web.mit.edu/torralba/www/ ¡
University ¡of ¡North ¡Carolina. ¡Tamara ¡Berg ¡hkp://www.tamaraberg.com/ ¡
Virginia ¡University ¡Devi ¡Parikh ¡hkps://filebox.ece.vt.edu/~parikh/CVL.html ¡
CLIC ¡hkp://clic.cimec.unitn.it/lavi/ ¡ ¡ ¡
Edinburgh ¡University ¡(M. ¡Lapata, ¡F. ¡Keller ¡) ¡
University ¡of ¡Sheffild ¡ ¡Lucia ¡Specia ¡

hkp://staffwww.dcs.shef.ac.uk/people/L.Specia/ ¡

Universitat ¡Pompeu ¡Fabra, ¡COLT ¡group, ¡Gemma ¡Boleda: ¡

hkp://gboleda.utcompling.com/ ¡

Facebook ¡FAIR ¡
Google ¡DeepMind ¡
More ¡on ¡the ¡iV&L ¡Net ¡Cost ¡Ac(on ¡

hkp://www.cost.eu/COST_Ac(ons/ict/Ac(ons/IC1307 ¡

SLIDE 41

LaVi ¡@ ¡UniTn ¡

Learning ¡the ¡meaning ¡of ¡Quan(fiers ¡from ¡

Language ¡and ¡Vision: ¡ hkps://quan(t-‑clic.github.io/ ¡

Visually ¡Grounded ¡Talking ¡Agents ¡(in ¡

collabora(on ¡with ¡UvA): ¡ hkps://vista-‑unitn-‑uva.github.io/ ¡

Grounded ¡TE ¡(in ¡collabora(on ¡with ¡Malta): ¡

hkps://github.com/claudiogreco/coling18-‑gte ¡ On ¡going ¡work: ¡

Be ¡Different ¡for ¡Be ¡Beker ¡(with ¡SAP) ¡
Con(nual ¡learning ¡

¡

SLIDE 42

UniTN ¡LaVi ¡People ¡

Ionut ¡(-‑>Barcelona) ¡ Sandro ¡ Ravi ¡ Aliia ¡ Claudio ¡ Alberto ¡ Aurelie ¡ me ¡

SLIDE 43

References ¡on ¡tasks ¡ ¡

Jain, ¡U., ¡Zhang, ¡Z., ¡Schwing, ¡A.G.: ¡Crea(vity: ¡Genera(ng ¡Diverse ¡Ques(ons ¡using ¡Varia(onal ¡
Autoencoders. ¡In: ¡CVPR. ¡(2017) ¡5415-‑5424 ¡
Li, ¡Y., ¡Huang, ¡C., ¡Tang, ¡X., ¡Change ¡Loy, ¡C.: ¡Learning ¡to ¡Disambiguate ¡by ¡Asking ¡Discrimina(ve ¡

Ques(ons. ¡In: ¡Proceedings ¡of ¡the ¡IEEE ¡Interna(onal ¡Conference ¡on ¡Computer ¡Vision. ¡(2017) ¡ 3419-‑3428 ¡

Vondrick, ¡C., ¡Oktay, ¡D., ¡Pirsiavash, ¡H., ¡Torralba, ¡A.: ¡Predic(ng ¡mo(va(ons ¡of ¡ac(ons ¡by ¡leveraging ¡
text. ¡In: ¡Proceedings ¡of ¡the ¡IEEE ¡Conference ¡on ¡Computer ¡Vision ¡and ¡Pakern ¡Recogni(on. ¡(2016) ¡

2997-‑3005 ¡

Park, ¡D.H., ¡Hendricks, ¡L.A., ¡Akata, ¡Z., ¡Rohrbach, ¡A., ¡Schiele, ¡B., ¡Darrell, ¡T., ¡Rohrbach, ¡M.: ¡Mul(modal ¡

Explana(ons: ¡Jus(fying ¡Decisions ¡and ¡Poin(ng ¡to ¡the ¡Evidence. ¡In: ¡31st ¡IEEE ¡Conference ¡on ¡ Computer ¡Vision ¡and ¡Pakern ¡Recogni(on. ¡(2018) ¡

Hendricks, ¡L.A., ¡Hu, ¡R., ¡Darrell, ¡T., ¡Akata, ¡Z.: ¡Genera(ng ¡Counterfactual ¡Explana(ons ¡with ¡Natural ¡
Language. ¡arXiv ¡preprint ¡arXiv:1806.09809 ¡(2018) ¡
You, ¡Q., ¡Luo, ¡J., ¡Jin, ¡H., ¡Yang, ¡J.: ¡Building ¡a ¡Large ¡Scale ¡Dataset ¡for ¡Image ¡Emo(on ¡Recogni(on: ¡The ¡

Fine ¡Print ¡and ¡The ¡Benchmark. ¡In: ¡AAAI. ¡(2016), ¡308-‑314 ¡

Yu, ¡L., ¡Park, ¡E., ¡Berg, ¡A.C., ¡Berg, ¡T.L.: ¡Visual ¡madlibs: ¡Fill ¡in ¡the ¡blank ¡descrip(on ¡genera(on ¡and ¡

ques(on ¡answering. ¡In: ¡Proceedings ¡of ¡the ¡IEEE ¡interna(onal ¡conference ¡on ¡computer ¡vision. ¡ (2015) ¡2461-‑2469 ¡

Iyyer, ¡M., ¡Manjunatha, ¡V., ¡Guha, ¡A., ¡Vyas, ¡Y., ¡Boyd-‑Graber, ¡J.L., ¡Daume ¡III, ¡H., ¡Davis, ¡L.S.: ¡The ¡

Amazing ¡Mysteries ¡of ¡the ¡Guker: ¡Drawing ¡Inferences ¡Between ¡Panels ¡in ¡Comic ¡Book ¡Narra(ves. ¡In: ¡

CVPR. ¡(2017) ¡6478-‑6487 ¡

è ¡For ¡ ¡a ¡rather ¡extensive ¡overview ¡see ¡Pezzelle ¡et ¡al. ¡SiVL ¡2018 ¡ ¡