Mat MattNet tNet: : Modu Modular Atten lar Attention tion - - PowerPoint PPT Presentation

mat mattnet tnet modu modular atten lar attention tion
SMART_READER_LITE
LIVE PREVIEW

Mat MattNet tNet: : Modu Modular Atten lar Attention tion - - PowerPoint PPT Presentation

March 2020 Mat MattNet tNet: : Modu Modular Atten lar Attention tion Network for Referring Network for Referring Expres Expression Comp sion Comprehe rehension nsion Tong Gao Background Referring expressions are natural language


slide-1
SLIDE 1

Tong Gao

March 2020

Mat MattNet tNet: : Modu Modular Atten lar Attention tion Network for Referring Network for Referring Expres Expression Comp sion Comprehe rehension nsion

slide-2
SLIDE 2

Background

  • Referring expressions are natural language utterances that

indicate particular objects within a scene

  • Most of these work uses a concatenation of all features as

input and a LSTM to encode/decode the whole expression, ignoring the features of referring expressions

slide-3
SLIDE 3

Introduction

  • MAttNet is the first modular network for general referring

expression comprehension task

  • Decompose referring expression into three phrase

embeddings, which are used to trigger visual modules for:

Subject

Location

Relation

slide-4
SLIDE 4

Our Model - Workflow

Given a candidate object 𝑝𝑗 and referring expression 𝑠:

1.

Language Attention Network -> 3 phrase embeddings

2.

Three visual modules -> matching scores for 𝑝𝑗 to phrase embeddings

3.

Weighted combination of these scores -> matching score for 𝑝𝑗, 𝑠

slide-5
SLIDE 5

Language Attention Network

slide-6
SLIDE 6

Language Attention Network

Constructed on Wording Embedding with 3 individual embeddings 𝑔

𝑛

slide-7
SLIDE 7

Visual Modules

  • Backbone: Faster R-CNN
  • ResNet as feature extractor
  • Crop C3 feature for each 𝑝𝑗, and further compute C4 feature
  • In the end, compute the matching scores

Subject: 𝑇 𝑝𝑗 𝑟𝑡𝑣𝑐𝑘

Location: 𝑇 𝑝𝑗 𝑟𝑚𝑝𝑑

Relationship: 𝑇 𝑝𝑗 𝑟𝑠𝑓𝑚

slide-8
SLIDE 8

Visual Modules – Subject Module – “woman in red”

slide-9
SLIDE 9

Visual Modules – Subject Module – “woman in red”

V 1. Compute attention score based on V, 𝑟𝑡𝑣𝑐𝑘 2. Get subject visual representation ෥ 𝑤𝑗

𝑡𝑣𝑐𝑘

෥ 𝑤𝑗

𝑡𝑣𝑐𝑘

slide-10
SLIDE 10

Visual Modules – Location Module - “cat on the right”

  • 5-d vector 𝑚, encoding top-left and bottom-right position and relative area to the

image

(Up to five)

slide-11
SLIDE 11

Visual Modules – Location Module - “second left person”

  • 5-d vector 𝜀𝑚𝑗𝑘=
  • Encoding relative location to same category neighbors

(Up to five)

slide-12
SLIDE 12

Visual Modules – Relationship Module - “cat on chaise lounge”

  • 5-d vector 𝜀𝑛𝑗𝑘=
  • Look for surrounding objects regardless of their categories
slide-13
SLIDE 13

Loss Function

  • Randomly sample two negative pairs (𝑝𝑗, 𝑠

𝑘), (𝑝𝑙, 𝑠 𝑗)

slide-14
SLIDE 14

Datasets

RefCOCO, RefCOCO+ RefCOCOg Collected in Interactive game interface Non-interactive setting Average length of expressions 3.5 8.4 Same-type object 3.9 1.63 Absolute location words Yes No

slide-15
SLIDE 15

Datasets

RefCOCO, RefCOCO+ RefCOCOg Splitting For evaluation:

  • Test A: Persons
  • Test B: Objects

No overlap between training, validation and testing sets First partition:

  • Spitted by objects
  • Same images could appear in

training and validation sets

  • No testing set (not released)

Second partition:

  • Randomly split into training,

validation and test set

slide-16
SLIDE 16

Evaluation

slide-17
SLIDE 17

Ablation Study

slide-18
SLIDE 18

Ablation Study

slide-19
SLIDE 19
slide-20
SLIDE 20
slide-21
SLIDE 21

Incorrect examples

slide-22
SLIDE 22

Critique

  • Focus on specific domain – referring expressions, carefully design the model with

prior knowledge

  • Compared to similar works, they utilize more visual hidden features – C3 & C4

features from ResNet

  • Take the unbalanced data issues into account (in loss function of attribute

prediction)

  • Good comparison and ablation study
slide-23
SLIDE 23

Critique

  • Location module & relationship module may double count the same object –

should this case be considered?

  • In the relationship module, they use unusual expression of relative object

locations, dependent on the width & height of given object 𝑝𝑗 – why not use 𝑋 and 𝐼?

  • May add pairs of ground truth expression and object with same type as negative

examples

slide-24
SLIDE 24

Critique

  • Can the model skip synonyms when selecting top-5 attributes to precept more

attribute information?

slide-25
SLIDE 25

Thank you!

slide-26
SLIDE 26