Mat MattNet tNet: : Modu Modular Atten lar Attention tion - - PowerPoint PPT Presentation

▶

Jul 14, 2023 156 likes •436 views

March 2020 Mat MattNet tNet: : Modu Modular Atten lar Attention tion Network for Referring Network for Referring Expres Expression Comp sion Comprehe rehension nsion Tong Gao Background Referring expressions are natural language

SLIDE 1

Tong Gao

March 2020

Mat MattNet tNet: : Modu Modular Atten lar Attention tion Network for Referring Network for Referring Expres Expression Comp sion Comprehe rehension nsion

SLIDE 2

Background

Referring expressions are natural language utterances that

indicate particular objects within a scene

Most of these work uses a concatenation of all features as

input and a LSTM to encode/decode the whole expression, ignoring the features of referring expressions

SLIDE 3

Introduction

MAttNet is the first modular network for general referring

expression comprehension task

Decompose referring expression into three phrase

embeddings, which are used to trigger visual modules for:

–

Subject

–

Location

–

Relation

SLIDE 4

Our Model - Workflow

Given a candidate object 𝑝𝑗 and referring expression 𝑠:

Language Attention Network -> 3 phrase embeddings

Three visual modules -> matching scores for 𝑝𝑗 to phrase embeddings

Weighted combination of these scores -> matching score for 𝑝𝑗, 𝑠

SLIDE 5

Language Attention Network

SLIDE 6

Language Attention Network

Constructed on Wording Embedding with 3 individual embeddings 𝑔

𝑛

SLIDE 7

Visual Modules

Backbone: Faster R-CNN
ResNet as feature extractor
Crop C3 feature for each 𝑝𝑗, and further compute C4 feature
In the end, compute the matching scores

–

Subject: 𝑇 𝑝𝑗 𝑟𝑡𝑣𝑐𝑘

–

Location: 𝑇 𝑝𝑗 𝑟𝑚𝑝𝑑

–

Relationship: 𝑇 𝑝𝑗 𝑟𝑠𝑓𝑚

SLIDE 8

Visual Modules – Subject Module – “woman in red”

SLIDE 9

Visual Modules – Subject Module – “woman in red”

V 1. Compute attention score based on V, 𝑟𝑡𝑣𝑐𝑘 2. Get subject visual representation ෥ 𝑤𝑗

𝑡𝑣𝑐𝑘

෥ 𝑤𝑗

𝑡𝑣𝑐𝑘

SLIDE 10

Visual Modules – Location Module - “cat on the right”

5-d vector 𝑚, encoding top-left and bottom-right position and relative area to the

image

(Up to five)

SLIDE 11

Visual Modules – Location Module - “second left person”

5-d vector 𝜀𝑚𝑗𝑘=
Encoding relative location to same category neighbors

(Up to five)

SLIDE 12

Visual Modules – Relationship Module - “cat on chaise lounge”

5-d vector 𝜀𝑛𝑗𝑘=
Look for surrounding objects regardless of their categories

SLIDE 13

Loss Function

Randomly sample two negative pairs (𝑝𝑗, 𝑠

𝑘), (𝑝𝑙, 𝑠 𝑗)

SLIDE 14

Datasets

RefCOCO, RefCOCO+ RefCOCOg Collected in Interactive game interface Non-interactive setting Average length of expressions 3.5 8.4 Same-type object 3.9 1.63 Absolute location words Yes No

SLIDE 15

Datasets

RefCOCO, RefCOCO+ RefCOCOg Splitting For evaluation:

Test A: Persons
Test B: Objects

No overlap between training, validation and testing sets First partition:

Spitted by objects
Same images could appear in

training and validation sets

No testing set (not released)

Second partition:

Randomly split into training,

validation and test set

SLIDE 16

Evaluation

SLIDE 17

Ablation Study

SLIDE 18

Ablation Study

SLIDE 19

SLIDE 20

SLIDE 21

Incorrect examples

SLIDE 22

Critique

Focus on specific domain – referring expressions, carefully design the model with

prior knowledge

Compared to similar works, they utilize more visual hidden features – C3 & C4

features from ResNet

Take the unbalanced data issues into account (in loss function of attribute

prediction)

Good comparison and ablation study

SLIDE 23

Critique

Location module & relationship module may double count the same object –

should this case be considered?

In the relationship module, they use unusual expression of relative object

locations, dependent on the width & height of given object 𝑝𝑗 – why not use 𝑋 and 𝐼?

May add pairs of ground truth expression and object with same type as negative

examples

SLIDE 24

Critique

Can the model skip synonyms when selecting top-5 attributes to precept more

attribute information?

SLIDE 25

Thank you!

SLIDE 26