In Search of a Unifying Theory for Image Interpretation Donald Gem - - PowerPoint PPT Presentation

in search of a unifying theory for image interpretation
SMART_READER_LITE
LIVE PREVIEW

In Search of a Unifying Theory for Image Interpretation Donald Gem - - PowerPoint PPT Presentation

In Search of a Unifying Theory for Image Interpretation Donald Gem an Department of Applied Mathematics and Statistics and Center for Imaging Science, Whitaker Institute Johns Hopkins University Outline Semantic Scene Interpretation


slide-1
SLIDE 1

In Search of a Unifying Theory for Image Interpretation

Donald Gem an

Department of Applied Mathematics and Statistics and Center for Imaging Science, Whitaker Institute Johns Hopkins University

slide-2
SLIDE 2

Outline

Semantic Scene Interpretation Frameworks, Theories Hierarchical Testing The Efficiency of Abstraction

slide-3
SLIDE 3

Orientation within Imaging

Sensors to Images Images to Images Images to Words

slide-4
SLIDE 4

Images to Words

Computational vision remains a major challenge (and natural vision a mystery). Assume one grey-level image. Generally massive local ambiguity. But less so globally, e.g., at the semantic level of keywords.

slide-5
SLIDE 5

Tasks

Object identification (find my car) and categorization (find all cars) Recognition of multiple objects, activities, contexts, etc. Ideally, a description machine from images to rich scene interpretations.

slide-6
SLIDE 6

Scene

Slide credit: Li Fei Fei

slide-7
SLIDE 7

Is that a picture of Mao?

slide-8
SLIDE 8

Are there cars?

slide-9
SLIDE 9

Multiple Object Categorization

sky building flag wall face banner street lamp bus bus cars

slide-10
SLIDE 10

Scene Categorization

slide-11
SLIDE 11

Confounding Factors

Clutter Invariance to

Viewpoint Photometry Variation

Invariance vs. Selectivity

slide-12
SLIDE 12

Clutter

slide-13
SLIDE 13

Clutter

Klimt, 1913

slide-14
SLIDE 14

Viewpoint Variation

Michelangelo 1475-1564

Slide credit: Li Fei Fei

slide-15
SLIDE 15

Lighting Variation

Slide credit: Shimon Ullman

slide-16
SLIDE 16

Occlusion

Magritte, 1957

slide-17
SLIDE 17

Occlusion

Xu, Beihong 1943

slide-18
SLIDE 18

Within-Class Variation

slide-19
SLIDE 19

Within-Class Variation

slide-20
SLIDE 20

How Many Samples are Needed?

slide-21
SLIDE 21

Where Things Stand

Reasonable performance for several classes of semi-rigid objects. Even for face detection, a large “ROC gap” with human vision. Full scene parsing is currently beyond reach.

slide-22
SLIDE 22

Where Are the Faces?

slide-23
SLIDE 23

The ROC Gap: Face Detection

Current Com puter Vision: Approximately one hallucination per scene at ninety percent detection.

slide-24
SLIDE 24

Bruegel, 1564

slide-25
SLIDE 25

Francisco’s Kitchen

slide-26
SLIDE 26

Notation

I : greyscale image Y : distinguished descriptions of I Ex: strings of (class,pose) pairs Y∈ {0}∪ Y : hidden r.v. Ŷ(I) : estimated description(s) from Y L={(I,Y)}: finite training set, in theory i.i.d. under P(I,Y)

slide-27
SLIDE 27

Description Machine Specs

DESIGN and LEARNING: An explicit set of instructions for building Ŷ involving from L. COMPUTATION: An explicit set of instructions for evaluating Ŷ(I) with as little computation as possible. ANALYSI S: A “supporting theory” which guides construction and predicts performance

slide-28
SLIDE 28

Ground Truth

For Y sufficiently restricted, reasonable to assume a “true interpetation” of I:

  • Y = { face} , Y = { indoor, outdoor} ,…
  • More generally, Y = {(c1,θ1),…, (ck,θk)}, limited to

specific categories and rough poses.

Corresponds to Pemp(Y|I)=δf(I)(Y) where Pemp(I,Y) is the empirical distribution over a gigantic sample (I1 ,Y1), (I2 ,Y2),…

slide-29
SLIDE 29

Outline

Semantic Scene Interpretation Frameworks, Theories Hierarchical Testing The Efficiency of Abstraction

slide-30
SLIDE 30

Deceased Frameworks

Traditional “AI” (60’s, 70’s) Stepwise, bottom-up 3D metric reconstruction (80’s) Algebraic, geometric invariants (90’s)

… but who knows

slide-31
SLIDE 31

Living Frameworks

Generative modeling Discriminative learning Information-theoretic

slide-32
SLIDE 32

Generative Modeling

Not all observations and explanations are equally likely. Construct P(I,Y) from

A distribution P(Y) on interpretations. A data model P(I|Y).

Inference principle:

Ŷ(I) = arg max P(Y|I) = arg min {-log P(I|Y) – log P(Y)}

slide-33
SLIDE 33

Examples

Deformable templates Hidden Markov models Probabilities on part hierarchies Graphical models, e.g., Bayesian networks Gaussian models (LDA, mixtures, etc.)

slide-34
SLIDE 34

Generative: Critique

In principle, a very general framework. In practice, Diabolically hard to model and learn P(Y). Intense online computation. P(I|Y) alone (i.e., “templates-for- everything”) lacks selectivity and requires too much computation.

slide-35
SLIDE 35

Discriminative Learning

Proceed (almost) directly from data to decision boundaries. Representation and learning:

Replace I by a fixed length feature vector X Quantize Y to a small number of classes Specify a family F of “classifiers” f(X) Induce f(X) directly from a training set L

slide-36
SLIDE 36

Examples

In effect, learn P(Y|X) (or log posterior odds ratios) directly:

  • Artificial neural networks
  • k-NN with smart metrics
  • Decision trees

Support vector machines (interpretation as Bayes rule via logistic regression) Multiple classifiers (e.g., random forests)

slide-37
SLIDE 37

Learning : Critique

In principle:

  • Universal learning machines which mimic natural

processes and “learn” everything (e.g., invariance).

  • Solid foundations in statistical learning theory

(although |L| ↓ 1 is the interesting limit).

In practice, lacks a global structure to address:

  • A very large number of classes (say 30,000)
  • Small samples, bias vs. variance, invariance vs.

selectivity.

slide-38
SLIDE 38

Information-theoretic

Established connections between IT and imaging, but mostly at the “tool” level and for “low-level vision.” Two emerging frameworks:

“Information scaling” (Zhu) Resource/ complexity tradeoffs and “information refinement” (O’Sullivan et al) Both tilted towards “theory”.

slide-39
SLIDE 39
slide-40
SLIDE 40

An Information Theory Constellation

Slide credit: Laurent Younes

slide-41
SLIDE 41

Overall Critique

Current generative and discriminative methods lack efficiency. Problem-specific structure is absent, and hence so a global organizing principle for vision. Sparse theoretical support (especially for practical systems).

slide-42
SLIDE 42

Hierarchical Vision

Exploit shared components among

  • bjects and interpretations.

Incorporate discriminative and generative methods as necessary. Can yield efficient representation, learning and computation.

slide-43
SLIDE 43

Simple Part Hierarchy

slide-44
SLIDE 44

Examples

Compositional systems (S. Geman) Hierarchies of fragments (Ullman) Hierarchies of conj’s and disj’s (Poggio) Convolutional neural networks (LeCun) Hierarchical generative models (Amit; Torralba; Perona; etc.) Hierarchical Testing

slide-45
SLIDE 45

Emerging Theory

“Theory of reusable parts” (S. Geman)

Inspired by MDL and speech technology. Non-Markovian (“context sensitive”) priors. Theoretical results on efficient representation and selectivity.

However, contextual constraints enforced at the expense of learning and computation.

slide-46
SLIDE 46

Outline

Semantic Scene Interpretation Frameworks, Theories Hierarchical Testing The Efficiency of Abstraction

slide-47
SLIDE 47

Hierarchical Testing

Coarse-to-fine modeling of both the interpretations and the computational process: Unites representation and processing. Concentrates processing on ambiguous areas. Evidence for CTF processing in neural systems. Scales to many categories.

slide-48
SLIDE 48

Density of Work

Original image Spatial concentration

  • f processing
slide-49
SLIDE 49

Collaborators: Hierarchical Testing

Evgeniy Bart IMA Sachin Gangaputra Inductus Corp. Xiaodong Fan Microsoft François Fleuret EPFL, Lausanne Gilles Blanchard Fraunhofer Hichem Sahbi Cambridge U. Yali Amit

  • U. Chicago
slide-50
SLIDE 50

From Source Coding to Hierarchical Testing

Y: r.v. with distr p(y), y ∈ Y Code for p: a CTF exploration of Y :

Can ask all questions XA of the form: “Is Y∈ A ?”, A ⊂ Y All answers are exact. Y is the only source of uncertainty

slide-51
SLIDE 51

From Source Coding to Hierarchical Testing (cont)

Constrained 20 questions:

Restrict to selected subsets A ⊂ Y Still, Y determines {XA} and vice-versa Still an errorless, unique path (root to leaf)

Realizable tests:

Make XA observable (XA= XA(I)) Requires appearance-based shared properties among elements of Y

slide-52
SLIDE 52

From Source Coding to Hierarchical Testing (cont)

Accommodate mistakes:

Preserve P(XA =1|Y∈ A)=1 But allow P(XA =1|Y∉ A)≠0; hence, only negative answers eliminate hypotheses

Generalize paths to “traces”:

The outcome of processing is now a labeled subtree in a hierarchy of tests. Ŷ(I) is the union of leaves reached.

slide-53
SLIDE 53

Representation of Y

Natural groupings A ⊂ Y based on shared parts or attributes.

  • Ex: Shape similarities between (c,θ) and (c’,θ’) for

nearby poses.

In fact, natural nested coverings or hierarchies of attributes

Hattr= { Aξ , ξ ∈ T }

slide-54
SLIDE 54

Two Attribute Hierarchies

slide-55
SLIDE 55

Which Decomposition?

Another story ….

slide-56
SLIDE 56

Statistical Structure

For each ξ ∈ T, consider a binary test Xξ =XA dedicated to H0:Y ∈ Aξ against

ξ Ha: Balt(ξ) ⊂ {Y ∉ Aξ}

Define Htest= { Xξ , ξ ∈ T } Constraint: Each Xξ satisfies inv(Xξ ) P(Xξ =1|Y ∈ Aξ ) ≅ 1 where P=Pemp estimated from L.

slide-57
SLIDE 57

Summary

T ξ ξ∈T … … … Hattr Aξ Aξ⊂Y … … … Htest Xξ Xξ: ”Y∈Aξ?” … … …

slide-58
SLIDE 58

Statistical Structure (cont)

Explore Htest under some “strategy”. Ŷ(I): y∈ Y not ruled out by any (performed) test: Ŷ = Y \ ∪ {Aξ : Xξ = 0} Our constraint implies Pemp(Y ∈ Ŷ(I)) ≅ 1 Could replace Pemp by a parametric model if available.

slide-59
SLIDE 59

Example

A recursive partitioning of Y with four levels; there is a binary test for each of the 15 cells. (A): Positive tests are shown in black. (B): Ŷ is the union of leaves 3 and 4. (C): Tests performed under coarse-to-fine search.

slide-60
SLIDE 60

Tests in Practice

Complex, e.g.,

Build an SVM from training examples

Simple, e.g.,

VQ 16x16 patches into k types; Ask two types of questions:

“Is there a patch in W with label k?” “Are there two patches in W with labels k,l ?”

slide-61
SLIDE 61

Example: Face Detection

Detect and localize all instances of upright, frontal faces. Θ ={(z,σ,φ)} = {(position, scale, tilt)} Y : subsets of Θ Construct the hierarchy Hattr= { Aξ , ξ ∈ T } based on recursive partitions of Θ

slide-62
SLIDE 62

Face Pose Hierarchy: Upper

First split: coarse scale (8-15, 16-31, 32-63, etc.) Second split: coarse position (disjoint 16x16 blocks B) These tests are “virtual” (always passed) without diffuse, extended attributes (e.g., color).

slide-63
SLIDE 63

Face Pose Hierarchy: Lower

Construct Hattr from

{(z,σ,φ): z ∈ 16x16, 8 ≤ σ ≤ 15, -200 ≤ φ ≤ 200}

by splitting individual coordinates. The 64x64 subimage surrounding each B is then labeled as “face (θ)” or “background.”

slide-64
SLIDE 64

Face Pose Hierarchy

{ (z,σ,φ): z ∈ 16x16, 8 ≤ σ ≤ 15, -200 ≤ φ ≤ 200 } { (z,σ,φ): z ∈ 2x2 14 ≤ σ ≤ 15, 100 ≤ φ ≤ 200 }

slide-65
SLIDE 65

Detection Algorithm

Loop over resolution Loop over location (non-overlapping 16x16 blocks) Process the “reference” (lower) hierarchy Collect chains of positive responses

Scale 8 to 16 Scale 16 to 32 Scale 32 to 64

slide-66
SLIDE 66

Examples

Learned hierarchy Manual hierarchy

False positives per image

slide-67
SLIDE 67

Results

slide-68
SLIDE 68

Face Tracking

slide-69
SLIDE 69

Reading License Plates

37 classes. Synthesized training shapes from class prototypes. Nearly 99% classification rate per symbol on 380 plates.

slide-70
SLIDE 70

Outline

Semantic Scene Interpretation Frameworks, Theories Hierarchical Testing The Efficiency of Abstraction

slide-71
SLIDE 71

Efficient Computation

Strategy S: Any sequential, adaptive exploration of some or all of the tests. c(Xξ): cost of Xξ , sel(Xξ) = P(Xξ=0|Balt(Aξ)) Cost of Testing: Total Computation:

( ) ( )

s r

test H A s S r s

C S I c X

∈∂ ↓

= ∑

ˆ ( ) * ( )

test

E C S c Y S ⎡ ⎤ + ⎣ ⎦

slide-72
SLIDE 72

Assumptions

(i) “Background” is statistically dominant: P(Y=0) >> P(Y≠0) (Valid for sufficiently fine cells A, e.g., in the third level of the face hierarchy.) (ii) Total computation is driven by P(.|Y=0) (iii) The tests { Xξ , ξ ∈ T } are conditionally independent under this probability.

slide-73
SLIDE 73

CTF Optimality Criterion

THEOREM: (G. Blanchard/ DG) CTF is optimal if

where C(ξ) = direct children of ξ in T. A numerical example:

( )

( ) ( ) ( ) ( )

C

c X c X sel X sel X

ξ η η ξ ξ η ∈

≤ ∑

, T ξ ∀ ∈

c(X1)=c(X2)=c(X3) sel(X1) = 1/2, sel(X2)= sel(X3)=9/10 Do X1 first !

slide-74
SLIDE 74

Back to Twenty Questions

Optimal strategies are what we intuitively expect from playing 20Q:

A steady progression from high scope coupled with relatively poor selectivity to high selectivity coupled with dedication to specific interpretations.

slide-75
SLIDE 75

“Right” Alternative Hypothesis

Due to CTF search Xξ is performed ⇔ all ancestor tests are performed and are positive. Hence, where = ancestors of node ξ in T.

ξ ξ η

η ξ = ∉ ∩ = ∀ ∈ A

( )

{ } { 1 ( )}

alt

B Y A X ξ A( )

slide-76
SLIDE 76

Efficient Learning: Trace Model

Encodes the computational history of CTF search.

T: tree underlying the hierarchy S(I) : subtree of T determined by CTF search on image I Z(I) = { Xξ , ξ ∈ S(I) } , the “trace”, a labeled subtree.

slide-77
SLIDE 77

Trace Representation

Tree hierarchy Realization of all tests Subtree from CTF search Trace

slide-78
SLIDE 78

Trace Representation (cont)

n(T) = 5 23 = 8 n(T) = 26 27 = 128 Top: A 3 node hierarchy and its 5 possible traces Bottom: A 7 node hierarchy and 5 of its 26 possible traces

slide-79
SLIDE 79

Trace Distributions

THEOREM: Let {pξ , ξ ∈ T} be any set of numbers with 0≤pξ≤1. Then defines a probability distribution on traces where Sz is the subtree identified with z and pξ(1)= pξ , pξ(0)= 1-pξ

( ) ( )

z

S

P z p x

ξ ξ ξ∈

  • Interpretation:

( ) ( | 1, ( )) p x P X x X

ξ ξ ξ ξ η

η ξ = = = ∀ ∈A

slide-80
SLIDE 80

Learning

Needn’t model P(Xξ , ξ∈T); only need one parameter per node After CTF search, a trace-based likelihood ratio test. Interpretation: exact Bayesian inference with the trace as a (tree-structured) sufficient statistic: P(I|Y) = P(tr(I)|Y) / #{tr(I)}

slide-81
SLIDE 81

Selectivity of the LRT

Top: Raw results of pure detection Bottom: False positives are eliminated with the trace model

slide-82
SLIDE 82

Pruning Detections

Detection rate vs. false positives on the MIT+ CMU test set; Ex: 0.77 FPs/ image at 89.1% detection with | L| = 400

slide-83
SLIDE 83

Conclusions: Hierarchical Testing

Can be seen as an adaptation of source coding to vision Hardwires invariance and computational efficiency. Limited by an impoverished contextual analysis. Resolve the particular, data-dependent, ambiguities by tests constructed on-line.

slide-84
SLIDE 84

Conclusions: General

A dramatic “ROC gap” with natural vision. Also, no Shannon yet. Ambitious proposals center on hierarchical structures. None is simultaneously efficient and contextual.