Generating Visual Explanations
Lisa Anne Hendricks [et al](Mar, 2016) UC Berkeley Anurag Patil
Generating Visual Explanations Lisa Anne Hendricks [et al](Mar, - - PowerPoint PPT Presentation
Generating Visual Explanations Lisa Anne Hendricks [et al](Mar, 2016) UC Berkeley Anurag Patil Outline 1. Motivation 2. The Problem and Importance 3. The Approach a. The Relevance Loss b. The Discriminative Loss 4. Dataset 5.
Lisa Anne Hendricks [et al](Mar, 2016) UC Berkeley Anurag Patil
1. Motivation 2. The Problem and Importance 3. The Approach a. The Relevance Loss b. The Discriminative Loss 4. Dataset 5. Experiments and Results 6. Critique
Explainable AI : Why should we care about it? Explainability is about trust. It’s important to know why our self-driving car decided to slam on the brakes. Explanation are required for regulatory compliance in certain industries. eg: medical diagnosis, equal credit opportunity act in US Explanation can facilitate model validation and
necessarily causal patterns in training data). Explanations can reveal spurious associations. But, tradeoff of performance vs explainability
Two broad ideas of : 1. Introspection explanation systems : which explain how a model determines its final output (eg : This is Western grebe because filter 2 has a high activation) 2. Justification explanation systems : which produce sentences detailing how visual evidence is compatible with a system output ( eg : This is Western Gerbe because it has red eyes..) Here, we look as justification explanation systems because they are more suited for non-experts. We apply the principles to the classification by visual systems Here, Applying the idea of explainability to classification by visual systems.
Description : sentence based only on visual information (image captioning systems) Visual Explanation : sentence that details why a certain category is appropriate for a
given image while only mentioning the image relevant features.
Condition language generation on image and predicted class label. Other captioning models: condition only on visual features. For this use fine grained recognition pipeline + novel loss function to include class discriminative information. Challenge : Class specificity is a global sentence property i.e the words black or red eye are less class discriminative on their own but the entire sentence: This is an all black bird with a bright red eye is class specific to Bronzed Cowbird. Typical loss functions optimize on sentence alignment b/w generated and the ground truth.
Inputs : [image, category label, ground truth sentence]
Relevance Loss Discriminative Loss
Proposed loss
relevant and category specific.
word sequences conditioned on images. [Average across all sequences for all classes in the train set] N = the batch size | Wt = ground truth word |I = image | C = category
not important (22%) : sampled sentences from LSTM(pL(w))
p(w | I,C) = model’s estimate conditional distribution RD(w̃) = reward for the sampled description E[RD(w̃)] = Estimation of the reward w̃ = sampled description from LSTM (p( w | I,C))
Agent = LSTM Env = previous generated words Action = predict next word based on policy and the env. Policy = defined by weights W
intractable, use Monte Carlo sampling from LSTM [p(w | I,C)].
REINFORCE property
Log p(w̃)= log likelihood
description
LR = log likelihood of
the ground truth description
Caltech UCSD Birds:
200 classes of North American Bird species |11,788 images | 5 captions/image
image are associated with single label.
Baseline and ablation model :
(equivalent to LRCN)
Metrics :
Small gain in automatic evaluation metrics for Image Relevance. But huge gains in Class relevance Metrics.
Comparison of Explanations, Baselines, and Ablations.
Comparison of Explanations and Definitions.
Comparison of Explanations and Descriptions.
vireo type.
○ Novel motivation of making the models more explainable to non experts
○ Novel loss function to include global sentence property. ○ Loss function also has a wide generic applicability.
○ Performed ablation study of all the important model components, gives reasoning behind model design decisions.
○ What if underlying feature in the network was not identifying red eye, it was instead identifying that there is a bird flying over water, there is no way that you would know.
○ Every image belongs to a class and therefore sentence and image are associated with single label.
○ Reduce the variance of the gradient estimation in REINFORCE by inclusion of baseline? ○ Choice of other reward functions based on class similarity and class rank? ○ Use of attention layers to combine text and image features?
○ Why didn't the accuracy of the LSTM classifier matter?
○ Comparison with other SOTA image captioning models?
○ Include a reason why they chose which sentence was ranked higher.
Explanations, European Conference on Computer Vision (ECCV), 2016