Generating Visual Explanations Lisa Anne Hendricks [et al](Mar, - - PowerPoint PPT Presentation

▶

Nov 24, 2023 267 likes •507 views

Generating Visual Explanations Lisa Anne Hendricks [et al](Mar, 2016) UC Berkeley Anurag Patil Outline 1. Motivation 2. The Problem and Importance 3. The Approach a. The Relevance Loss b. The Discriminative Loss 4. Dataset 5.

SLIDE 1

Generating Visual Explanations

Lisa Anne Hendricks [et al](Mar, 2016) UC Berkeley Anurag Patil

SLIDE 2

Outline

1. Motivation 2. The Problem and Importance 3. The Approach a. The Relevance Loss b. The Discriminative Loss 4. Dataset 5. Experiments and Results 6. Critique

SLIDE 3

Motivation

Explainable AI : Why should we care about it? Explainability is about trust. It’s important to know why our self-driving car decided to slam on the brakes. Explanation are required for regulatory compliance in certain industries. eg: medical diagnosis, equal credit opportunity act in US Explanation can facilitate model validation and

debugging. Models learn associative (not

necessarily causal patterns in training data). Explanations can reveal spurious associations. But, tradeoff of performance vs explainability

SLIDE 4

Motivation : Explainable Models

Two broad ideas of : 1. Introspection explanation systems : which explain how a model determines its final output (eg : This is Western grebe because filter 2 has a high activation) 2. Justification explanation systems : which produce sentences detailing how visual evidence is compatible with a system output ( eg : This is Western Gerbe because it has red eyes..) Here, we look as justification explanation systems because they are more suited for non-experts. We apply the principles to the classification by visual systems Here, Applying the idea of explainability to classification by visual systems.

SLIDE 5

The Problem and Importance

Description : sentence based only on visual information (image captioning systems) Visual Explanation : sentence that details why a certain category is appropriate for a

given image while only mentioning the image relevant features.

SLIDE 6

The Approach

Condition language generation on image and predicted class label. Other captioning models: condition only on visual features. For this use fine grained recognition pipeline + novel loss function to include class discriminative information. Challenge : Class specificity is a global sentence property i.e the words black or red eye are less class discriminative on their own but the entire sentence: This is an all black bird with a bright red eye is class specific to Bronzed Cowbird. Typical loss functions optimize on sentence alignment b/w generated and the ground truth.

SLIDE 7

Note on LRCN

SLIDE 8

Model

Inputs : [image, category label, ground truth sentence]

SLIDE 9

Proposed Loss

Relevance loss(LR ) is related with "Image Relevance"
Discriminative loss(E[RD(w̃)]) is related with "Class Relevance"

Relevance Loss Discriminative Loss

Proposed loss

SLIDE 10

Relevance Loss :

Produces sentences that correspond to the image content
Does not explicitly encourage generated sentences which are both image

relevant and category specific.

Class Labels : Average hidden state of another separate LSTM to generate

word sequences conditioned on images. [Average across all sequences for all classes in the train set] N = the batch size | Wt = ground truth word |I = image | C = category

SLIDE 11

Discriminative Loss :

Based on a reinforcement learning paradigm.
RD(w̃) = pD(C|w̃)
pD(C|w̃) : pretrained sentence classifier
The accuracy of this classifier(pretrained) is

not important (22%) : sampled sentences from LSTM(pL(w))

p(w | I,C) = model’s estimate conditional distribution RD(w̃) = reward for the sampled description E[RD(w̃)] = Estimation of the reward w̃ = sampled description from LSTM (p( w | I,C))

Agent = LSTM Env = previous generated words Action = predict next word based on policy and the env. Policy = defined by weights W

SLIDE 12

Minimizing the loss

Since expectation over descriptions( E[RD(w̃)]) is

intractable, use Monte Carlo sampling from LSTM [p(w | I,C)].

p(w | I,C) is a discrete distribution
To avoid differentiating RD(w̃) w.r.t W use

REINFORCE property

The final gradient to update weights W

Log p(w̃)= log likelihood

f the sampled

description

LR = log likelihood of

the ground truth description

SLIDE 13

Dataset

Caltech UCSD Birds:

200 classes of North American Bird species |11,788 images | 5 captions/image

Every image belongs to a class and therefore sentence and

image are associated with single label.

Descriptive details about each bird class.
Does not explain why an image belongs to a certain class.

SLIDE 14

Experiments

Baseline and ablation model :

Description model : generates sentences conditioned only on images

(equivalent to LRCN)

Definition model : sentences using only image label as input.
Explanation-label : not trained with discriminative loss
Explanation-discriminative : not conditioned on predicted class.

Metrics :

Image relevance : METEOR, CIDEr
Class relevance : class similarity score, class rank.

SLIDE 15

Results

Small gain in automatic evaluation metrics for Image Relevance. But huge gains in Class relevance Metrics.

SLIDE 16

Results

Comparison of Explanations, Baselines, and Ablations.

Green : correct, Yellow : mostly correct, Red : incorrect
’Red eye’ is a class relevant attribute

SLIDE 17

Results

Comparison of Explanations and Definitions.

Definition can produce sentences which are not image relevant

SLIDE 18

Results

Comparison of Explanations and Descriptions.

Both models generate visually correct sentences.
’Black head’ is one of the most prominent distinguishing properties of this

vireo type.

SLIDE 19

Critique – The Good

Motivation:

○ Novel motivation of making the models more explainable to non experts

Explanation Model:

○ Novel loss function to include global sentence property. ○ Loss function also has a wide generic applicability.

Ablation study:

○ Performed ablation study of all the important model components, gives reasoning behind model design decisions.

SLIDE 20

Critique – The not so good

Motivation

○ What if underlying feature in the network was not identifying red eye, it was instead identifying that there is a bird flying over water, there is no way that you would know.

Dataset :

○ Every image belongs to a class and therefore sentence and image are associated with single label.

Explanation Model:

○ Reduce the variance of the gradient estimation in REINFORCE by inclusion of baseline? ○ Choice of other reward functions based on class similarity and class rank? ○ Use of attention layers to combine text and image features?

Missing details:

○ Why didn't the accuracy of the LSTM classifier matter?

Evaluation Methodology:

○ Comparison with other SOTA image captioning models?

Human evaluation improvements

○ Include a reason why they chose which sentence was ranked higher.

SLIDE 21

References

Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, Trevor Darrell, Generating Visual

Explanations, European Conference on Computer Vision (ECCV), 2016

SLIDE 22

Generating Visual Explanations

Outline

Motivation

Motivation : Explainable Models

The Problem and Importance

The Approach

Note on LRCN

Model

Proposed Loss

Relevance Loss :

Discriminative Loss :

Minimizing the loss

Dataset

Experiments

Results

Results

Results

Results

Critique – The Good

Critique – The not so good

References

Additional Examples