Generating Visual Explanations Lisa Anne Hendricks [et al](Mar, - - PowerPoint PPT Presentation

generating visual explanations
SMART_READER_LITE
LIVE PREVIEW

Generating Visual Explanations Lisa Anne Hendricks [et al](Mar, - - PowerPoint PPT Presentation

Generating Visual Explanations Lisa Anne Hendricks [et al](Mar, 2016) UC Berkeley Anurag Patil Outline 1. Motivation 2. The Problem and Importance 3. The Approach a. The Relevance Loss b. The Discriminative Loss 4. Dataset 5.


slide-1
SLIDE 1

Generating Visual Explanations

Lisa Anne Hendricks [et al](Mar, 2016) UC Berkeley Anurag Patil

slide-2
SLIDE 2

Outline

1. Motivation 2. The Problem and Importance 3. The Approach a. The Relevance Loss b. The Discriminative Loss 4. Dataset 5. Experiments and Results 6. Critique

slide-3
SLIDE 3

Motivation

Explainable AI : Why should we care about it? Explainability is about trust. It’s important to know why our self-driving car decided to slam on the brakes. Explanation are required for regulatory compliance in certain industries. eg: medical diagnosis, equal credit opportunity act in US Explanation can facilitate model validation and

  • debugging. Models learn associative (not

necessarily causal patterns in training data). Explanations can reveal spurious associations. But, tradeoff of performance vs explainability

slide-4
SLIDE 4

Motivation : Explainable Models

Two broad ideas of : 1. Introspection explanation systems : which explain how a model determines its final output (eg : This is Western grebe because filter 2 has a high activation) 2. Justification explanation systems : which produce sentences detailing how visual evidence is compatible with a system output ( eg : This is Western Gerbe because it has red eyes..) Here, we look as justification explanation systems because they are more suited for non-experts. We apply the principles to the classification by visual systems Here, Applying the idea of explainability to classification by visual systems.

slide-5
SLIDE 5

The Problem and Importance

Description : sentence based only on visual information (image captioning systems) Visual Explanation : sentence that details why a certain category is appropriate for a

given image while only mentioning the image relevant features.

slide-6
SLIDE 6

The Approach

Condition language generation on image and predicted class label. Other captioning models: condition only on visual features. For this use fine grained recognition pipeline + novel loss function to include class discriminative information. Challenge : Class specificity is a global sentence property i.e the words black or red eye are less class discriminative on their own but the entire sentence: This is an all black bird with a bright red eye is class specific to Bronzed Cowbird. Typical loss functions optimize on sentence alignment b/w generated and the ground truth.

slide-7
SLIDE 7

Note on LRCN

slide-8
SLIDE 8

Model

Inputs : [image, category label, ground truth sentence]

slide-9
SLIDE 9

Proposed Loss

  • Relevance loss(LR ) is related with "Image Relevance"
  • Discriminative loss(E[RD(w̃)]) is related with "Class Relevance"

Relevance Loss Discriminative Loss

Proposed loss

slide-10
SLIDE 10

Relevance Loss :

  • Produces sentences that correspond to the image content
  • Does not explicitly encourage generated sentences which are both image

relevant and category specific.

  • Class Labels : Average hidden state of another separate LSTM to generate

word sequences conditioned on images. [Average across all sequences for all classes in the train set] N = the batch size | Wt = ground truth word |I = image | C = category

slide-11
SLIDE 11

Discriminative Loss :

  • Based on a reinforcement learning paradigm.
  • RD(w̃) = pD(C|w̃)
  • pD(C|w̃) : pretrained sentence classifier
  • The accuracy of this classifier(pretrained) is

not important (22%) : sampled sentences from LSTM(pL(w))

p(w | I,C) = model’s estimate conditional distribution RD(w̃) = reward for the sampled description E[RD(w̃)] = Estimation of the reward w̃ = sampled description from LSTM (p( w | I,C))

Agent = LSTM Env = previous generated words Action = predict next word based on policy and the env. Policy = defined by weights W

slide-12
SLIDE 12

Minimizing the loss

  • Since expectation over descriptions( E[RD(w̃)]) is

intractable, use Monte Carlo sampling from LSTM [p(w | I,C)].

  • p(w | I,C) is a discrete distribution
  • To avoid differentiating RD(w̃) w.r.t W use

REINFORCE property

  • The final gradient to update weights W

Log p(w̃)= log likelihood

  • f the sampled

description

LR = log likelihood of

the ground truth description

slide-13
SLIDE 13

Dataset

Caltech UCSD Birds:

200 classes of North American Bird species |11,788 images | 5 captions/image

  • Every image belongs to a class and therefore sentence and

image are associated with single label.

  • Descriptive details about each bird class.
  • Does not explain why an image belongs to a certain class.
slide-14
SLIDE 14

Experiments

Baseline and ablation model :

  • Description model : generates sentences conditioned only on images

(equivalent to LRCN)

  • Definition model : sentences using only image label as input.
  • Explanation-label : not trained with discriminative loss
  • Explanation-discriminative : not conditioned on predicted class.

Metrics :

  • Image relevance : METEOR, CIDEr
  • Class relevance : class similarity score, class rank.
slide-15
SLIDE 15

Results

Small gain in automatic evaluation metrics for Image Relevance. But huge gains in Class relevance Metrics.

slide-16
SLIDE 16

Results

Comparison of Explanations, Baselines, and Ablations.

  • Green : correct, Yellow : mostly correct, Red : incorrect
  • ’Red eye’ is a class relevant attribute
slide-17
SLIDE 17

Results

Comparison of Explanations and Definitions.

  • Definition can produce sentences which are not image relevant
slide-18
SLIDE 18

Results

Comparison of Explanations and Descriptions.

  • Both models generate visually correct sentences.
  • ’Black head’ is one of the most prominent distinguishing properties of this

vireo type.

slide-19
SLIDE 19

Critique – The Good

  • Motivation:

○ Novel motivation of making the models more explainable to non experts

  • Explanation Model:

○ Novel loss function to include global sentence property. ○ Loss function also has a wide generic applicability.

  • Ablation study:

○ Performed ablation study of all the important model components, gives reasoning behind model design decisions.

slide-20
SLIDE 20

Critique – The not so good

  • Motivation

○ What if underlying feature in the network was not identifying red eye, it was instead identifying that there is a bird flying over water, there is no way that you would know.

  • Dataset :

○ Every image belongs to a class and therefore sentence and image are associated with single label.

  • Explanation Model:

○ Reduce the variance of the gradient estimation in REINFORCE by inclusion of baseline? ○ Choice of other reward functions based on class similarity and class rank? ○ Use of attention layers to combine text and image features?

  • Missing details:

○ Why didn't the accuracy of the LSTM classifier matter?

  • Evaluation Methodology:

○ Comparison with other SOTA image captioning models?

  • Human evaluation improvements

○ Include a reason why they chose which sentence was ranked higher.

slide-21
SLIDE 21

References

  • Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, Trevor Darrell, Generating Visual

Explanations, European Conference on Computer Vision (ECCV), 2016

slide-22
SLIDE 22

Additional Examples