[PPT] - In Search of a Unifying Theory for Image Interpretation Donald Gem PowerPoint Presentation

SLIDE 1

In Search of a Unifying Theory for Image Interpretation

Donald Gem an

Department of Applied Mathematics and Statistics and Center for Imaging Science, Whitaker Institute Johns Hopkins University

SLIDE 2

Outline

Semantic Scene Interpretation Frameworks, Theories Hierarchical Testing The Efficiency of Abstraction

SLIDE 3

Orientation within Imaging

Sensors to Images Images to Images Images to Words

SLIDE 4

Images to Words

Computational vision remains a major challenge (and natural vision a mystery). Assume one grey-level image. Generally massive local ambiguity. But less so globally, e.g., at the semantic level of keywords.

SLIDE 5

Tasks

Object identification (find my car) and categorization (find all cars) Recognition of multiple objects, activities, contexts, etc. Ideally, a description machine from images to rich scene interpretations.

SLIDE 6

Scene

Slide credit: Li Fei Fei

SLIDE 7

Is that a picture of Mao?

SLIDE 8

Are there cars?

SLIDE 9

Multiple Object Categorization

sky building flag wall face banner street lamp bus bus cars

SLIDE 10

Scene Categorization

SLIDE 11

Confounding Factors

Clutter Invariance to

Viewpoint Photometry Variation

Invariance vs. Selectivity

SLIDE 12

Clutter

SLIDE 13

Clutter

Klimt, 1913

SLIDE 14

Viewpoint Variation

Michelangelo 1475-1564

Slide credit: Li Fei Fei

SLIDE 15

Lighting Variation

Slide credit: Shimon Ullman

SLIDE 16

Occlusion

Magritte, 1957

SLIDE 17

Occlusion

Xu, Beihong 1943

SLIDE 18

Within-Class Variation

SLIDE 19

Within-Class Variation

SLIDE 20

How Many Samples are Needed?

SLIDE 21

Where Things Stand

Reasonable performance for several classes of semi-rigid objects. Even for face detection, a large “ROC gap” with human vision. Full scene parsing is currently beyond reach.

SLIDE 22

Where Are the Faces?

SLIDE 23

The ROC Gap: Face Detection

Current Com puter Vision: Approximately one hallucination per scene at ninety percent detection.

SLIDE 24

Bruegel, 1564

SLIDE 25

Francisco’s Kitchen

SLIDE 26

Notation

I : greyscale image Y : distinguished descriptions of I Ex: strings of (class,pose) pairs Y∈ {0}∪ Y : hidden r.v. Ŷ(I) : estimated description(s) from Y L={(I,Y)}: finite training set, in theory i.i.d. under P(I,Y)

SLIDE 27

Description Machine Specs

DESIGN and LEARNING: An explicit set of instructions for building Ŷ involving from L. COMPUTATION: An explicit set of instructions for evaluating Ŷ(I) with as little computation as possible. ANALYSI S: A “supporting theory” which guides construction and predicts performance

SLIDE 28

Ground Truth

For Y sufficiently restricted, reasonable to assume a “true interpetation” of I:

Y = { face} , Y = { indoor, outdoor} ,…
More generally, Y = {(c1,θ1),…, (ck,θk)}, limited to

specific categories and rough poses.

Corresponds to Pemp(Y|I)=δf(I)(Y) where Pemp(I,Y) is the empirical distribution over a gigantic sample (I1 ,Y1), (I2 ,Y2),…

SLIDE 29

Outline

Semantic Scene Interpretation Frameworks, Theories Hierarchical Testing The Efficiency of Abstraction

SLIDE 30

Deceased Frameworks

Traditional “AI” (60’s, 70’s) Stepwise, bottom-up 3D metric reconstruction (80’s) Algebraic, geometric invariants (90’s)

… but who knows

SLIDE 31

Living Frameworks

Generative modeling Discriminative learning Information-theoretic

SLIDE 32

Generative Modeling

Not all observations and explanations are equally likely. Construct P(I,Y) from

A distribution P(Y) on interpretations. A data model P(I|Y).

Inference principle:

Ŷ(I) = arg max P(Y|I) = arg min {-log P(I|Y) – log P(Y)}

SLIDE 33

Examples

Deformable templates Hidden Markov models Probabilities on part hierarchies Graphical models, e.g., Bayesian networks Gaussian models (LDA, mixtures, etc.)

SLIDE 34

Generative: Critique

In principle, a very general framework. In practice, Diabolically hard to model and learn P(Y). Intense online computation. P(I|Y) alone (i.e., “templates-for- everything”) lacks selectivity and requires too much computation.

SLIDE 35

Discriminative Learning

Proceed (almost) directly from data to decision boundaries. Representation and learning:

Replace I by a fixed length feature vector X Quantize Y to a small number of classes Specify a family F of “classifiers” f(X) Induce f(X) directly from a training set L

SLIDE 36

Examples

In effect, learn P(Y|X) (or log posterior odds ratios) directly:

Artificial neural networks
k-NN with smart metrics
Decision trees

Support vector machines (interpretation as Bayes rule via logistic regression) Multiple classifiers (e.g., random forests)

SLIDE 37

Learning : Critique

In principle:

Universal learning machines which mimic natural

processes and “learn” everything (e.g., invariance).

Solid foundations in statistical learning theory

(although |L| ↓ 1 is the interesting limit).

In practice, lacks a global structure to address:

A very large number of classes (say 30,000)
Small samples, bias vs. variance, invariance vs.

selectivity.

SLIDE 38

Information-theoretic

Established connections between IT and imaging, but mostly at the “tool” level and for “low-level vision.” Two emerging frameworks:

“Information scaling” (Zhu) Resource/ complexity tradeoffs and “information refinement” (O’Sullivan et al) Both tilted towards “theory”.

SLIDE 39

SLIDE 40

An Information Theory Constellation

Slide credit: Laurent Younes

SLIDE 41

Overall Critique

Current generative and discriminative methods lack efficiency. Problem-specific structure is absent, and hence so a global organizing principle for vision. Sparse theoretical support (especially for practical systems).

SLIDE 42

Hierarchical Vision

Exploit shared components among

bjects and interpretations.

Incorporate discriminative and generative methods as necessary. Can yield efficient representation, learning and computation.

SLIDE 43

Simple Part Hierarchy

SLIDE 44

Examples

Compositional systems (S. Geman) Hierarchies of fragments (Ullman) Hierarchies of conj’s and disj’s (Poggio) Convolutional neural networks (LeCun) Hierarchical generative models (Amit; Torralba; Perona; etc.) Hierarchical Testing

SLIDE 45

Emerging Theory

“Theory of reusable parts” (S. Geman)

Inspired by MDL and speech technology. Non-Markovian (“context sensitive”) priors. Theoretical results on efficient representation and selectivity.

However, contextual constraints enforced at the expense of learning and computation.

SLIDE 46

Outline

Semantic Scene Interpretation Frameworks, Theories Hierarchical Testing The Efficiency of Abstraction

SLIDE 47

Hierarchical Testing

Coarse-to-fine modeling of both the interpretations and the computational process: Unites representation and processing. Concentrates processing on ambiguous areas. Evidence for CTF processing in neural systems. Scales to many categories.

SLIDE 48

Density of Work

Original image Spatial concentration

f processing

SLIDE 49

Collaborators: Hierarchical Testing

Evgeniy Bart IMA Sachin Gangaputra Inductus Corp. Xiaodong Fan Microsoft François Fleuret EPFL, Lausanne Gilles Blanchard Fraunhofer Hichem Sahbi Cambridge U. Yali Amit

U. Chicago

SLIDE 50

From Source Coding to Hierarchical Testing

Y: r.v. with distr p(y), y ∈ Y Code for p: a CTF exploration of Y :

Can ask all questions XA of the form: “Is Y∈ A ?”, A ⊂ Y All answers are exact. Y is the only source of uncertainty

SLIDE 51

From Source Coding to Hierarchical Testing (cont)

Constrained 20 questions:

Restrict to selected subsets A ⊂ Y Still, Y determines {XA} and vice-versa Still an errorless, unique path (root to leaf)

Realizable tests:

Make XA observable (XA= XA(I)) Requires appearance-based shared properties among elements of Y

SLIDE 52

From Source Coding to Hierarchical Testing (cont)

Accommodate mistakes:

Preserve P(XA =1|Y∈ A)=1 But allow P(XA =1|Y∉ A)≠0; hence, only negative answers eliminate hypotheses

Generalize paths to “traces”:

The outcome of processing is now a labeled subtree in a hierarchy of tests. Ŷ(I) is the union of leaves reached.

SLIDE 53

Representation of Y

Natural groupings A ⊂ Y based on shared parts or attributes.

Ex: Shape similarities between (c,θ) and (c’,θ’) for

nearby poses.

In fact, natural nested coverings or hierarchies of attributes

Hattr= { Aξ , ξ ∈ T }

SLIDE 54

Two Attribute Hierarchies

SLIDE 55

Which Decomposition?

Another story ….

SLIDE 56

Statistical Structure

For each ξ ∈ T, consider a binary test Xξ =XA dedicated to H0:Y ∈ Aξ against

ξ Ha: Balt(ξ) ⊂ {Y ∉ Aξ}

Define Htest= { Xξ , ξ ∈ T } Constraint: Each Xξ satisfies inv(Xξ ) P(Xξ =1|Y ∈ Aξ ) ≅ 1 where P=Pemp estimated from L.

SLIDE 57

Summary

T ξ ξ∈T … … … Hattr Aξ Aξ⊂Y … … … Htest Xξ Xξ: ”Y∈Aξ?” … … …

SLIDE 58

Statistical Structure (cont)

Explore Htest under some “strategy”. Ŷ(I): y∈ Y not ruled out by any (performed) test: Ŷ = Y \ ∪ {Aξ : Xξ = 0} Our constraint implies Pemp(Y ∈ Ŷ(I)) ≅ 1 Could replace Pemp by a parametric model if available.

SLIDE 59

Example

A recursive partitioning of Y with four levels; there is a binary test for each of the 15 cells. (A): Positive tests are shown in black. (B): Ŷ is the union of leaves 3 and 4. (C): Tests performed under coarse-to-fine search.

SLIDE 60

Tests in Practice

Complex, e.g.,

Build an SVM from training examples

Simple, e.g.,

VQ 16x16 patches into k types; Ask two types of questions:

“Is there a patch in W with label k?” “Are there two patches in W with labels k,l ?”

SLIDE 61

Example: Face Detection

Detect and localize all instances of upright, frontal faces. Θ ={(z,σ,φ)} = {(position, scale, tilt)} Y : subsets of Θ Construct the hierarchy Hattr= { Aξ , ξ ∈ T } based on recursive partitions of Θ

SLIDE 62

Face Pose Hierarchy: Upper

First split: coarse scale (8-15, 16-31, 32-63, etc.) Second split: coarse position (disjoint 16x16 blocks B) These tests are “virtual” (always passed) without diffuse, extended attributes (e.g., color).

SLIDE 63

Face Pose Hierarchy: Lower

Construct Hattr from

{(z,σ,φ): z ∈ 16x16, 8 ≤ σ ≤ 15, -200 ≤ φ ≤ 200}

by splitting individual coordinates. The 64x64 subimage surrounding each B is then labeled as “face (θ)” or “background.”

SLIDE 64

Face Pose Hierarchy

{ (z,σ,φ): z ∈ 16x16, 8 ≤ σ ≤ 15, -200 ≤ φ ≤ 200 } { (z,σ,φ): z ∈ 2x2 14 ≤ σ ≤ 15, 100 ≤ φ ≤ 200 }

SLIDE 65

Detection Algorithm

Loop over resolution Loop over location (non-overlapping 16x16 blocks) Process the “reference” (lower) hierarchy Collect chains of positive responses

Scale 8 to 16 Scale 16 to 32 Scale 32 to 64

SLIDE 66

Examples

Learned hierarchy Manual hierarchy

False positives per image

SLIDE 67

Results

SLIDE 68

Face Tracking

SLIDE 69

Reading License Plates

37 classes. Synthesized training shapes from class prototypes. Nearly 99% classification rate per symbol on 380 plates.

SLIDE 70

Outline

Semantic Scene Interpretation Frameworks, Theories Hierarchical Testing The Efficiency of Abstraction

SLIDE 71

Efficient Computation

Strategy S: Any sequential, adaptive exploration of some or all of the tests. c(Xξ): cost of Xξ , sel(Xξ) = P(Xξ=0|Balt(Aξ)) Cost of Testing: Total Computation:

( ) ( )

s r

test H A s S r s

C S I c X

∈∂ ↓

= ∑

∑

ˆ ( ) * ( )

test

E C S c Y S ⎡ ⎤ + ⎣ ⎦

SLIDE 72

Assumptions

(i) “Background” is statistically dominant: P(Y=0) >> P(Y≠0) (Valid for sufficiently fine cells A, e.g., in the third level of the face hierarchy.) (ii) Total computation is driven by P(.|Y=0) (iii) The tests { Xξ , ξ ∈ T } are conditionally independent under this probability.

SLIDE 73

CTF Optimality Criterion

THEOREM: (G. Blanchard/ DG) CTF is optimal if

where C(ξ) = direct children of ξ in T. A numerical example:

( )

( ) ( ) ( ) ( )

C

c X c X sel X sel X

ξ η η ξ ξ η ∈

≤ ∑

, T ξ ∀ ∈

c(X1)=c(X2)=c(X3) sel(X1) = 1/2, sel(X2)= sel(X3)=9/10 Do X1 first !

SLIDE 74

Back to Twenty Questions

Optimal strategies are what we intuitively expect from playing 20Q:

A steady progression from high scope coupled with relatively poor selectivity to high selectivity coupled with dedication to specific interpretations.

SLIDE 75

“Right” Alternative Hypothesis

Due to CTF search Xξ is performed ⇔ all ancestor tests are performed and are positive. Hence, where = ancestors of node ξ in T.

ξ ξ η

η ξ = ∉ ∩ = ∀ ∈ A

( )

{ } { 1 ( )}

alt

B Y A X ξ A( )

SLIDE 76

Efficient Learning: Trace Model

Encodes the computational history of CTF search.

T: tree underlying the hierarchy S(I) : subtree of T determined by CTF search on image I Z(I) = { Xξ , ξ ∈ S(I) } , the “trace”, a labeled subtree.

SLIDE 77

Trace Representation

Tree hierarchy Realization of all tests Subtree from CTF search Trace

SLIDE 78

Trace Representation (cont)

n(T) = 5 23 = 8 n(T) = 26 27 = 128 Top: A 3 node hierarchy and its 5 possible traces Bottom: A 7 node hierarchy and 5 of its 26 possible traces

SLIDE 79

Trace Distributions

THEOREM: Let {pξ , ξ ∈ T} be any set of numbers with 0≤pξ≤1. Then defines a probability distribution on traces where Sz is the subtree identified with z and pξ(1)= pξ , pξ(0)= 1-pξ

( ) ( )

z

S

P z p x

ξ ξ ξ∈

∏

Interpretation:

( ) ( | 1, ( )) p x P X x X

ξ ξ ξ ξ η

η ξ = = = ∀ ∈A

SLIDE 80

Learning

Needn’t model P(Xξ , ξ∈T); only need one parameter per node After CTF search, a trace-based likelihood ratio test. Interpretation: exact Bayesian inference with the trace as a (tree-structured) sufficient statistic: P(I|Y) = P(tr(I)|Y) / #{tr(I)}

SLIDE 81

Selectivity of the LRT

Top: Raw results of pure detection Bottom: False positives are eliminated with the trace model

SLIDE 82

Pruning Detections

Detection rate vs. false positives on the MIT+ CMU test set; Ex: 0.77 FPs/ image at 89.1% detection with | L| = 400

SLIDE 83

Conclusions: Hierarchical Testing

Can be seen as an adaptation of source coding to vision Hardwires invariance and computational efficiency. Limited by an impoverished contextual analysis. Resolve the particular, data-dependent, ambiguities by tests constructed on-line.

SLIDE 84

Conclusions: General

A dramatic “ROC gap” with natural vision. Also, no Shannon yet. Ambitious proposals center on hierarchical structures. None is simultaneously efficient and contextual.