SLIDE 1 In Search of a Unifying Theory for Image Interpretation
Donald Gem an
Department of Applied Mathematics and Statistics and Center for Imaging Science, Whitaker Institute Johns Hopkins University
SLIDE 2
Outline
Semantic Scene Interpretation Frameworks, Theories Hierarchical Testing The Efficiency of Abstraction
SLIDE 3
Orientation within Imaging
Sensors to Images Images to Images Images to Words
SLIDE 4
Images to Words
Computational vision remains a major challenge (and natural vision a mystery). Assume one grey-level image. Generally massive local ambiguity. But less so globally, e.g., at the semantic level of keywords.
SLIDE 5
Tasks
Object identification (find my car) and categorization (find all cars) Recognition of multiple objects, activities, contexts, etc. Ideally, a description machine from images to rich scene interpretations.
SLIDE 6
Scene
Slide credit: Li Fei Fei
SLIDE 7
Is that a picture of Mao?
SLIDE 8
Are there cars?
SLIDE 9 Multiple Object Categorization
sky building flag wall face banner street lamp bus bus cars
SLIDE 10
Scene Categorization
SLIDE 11
Confounding Factors
Clutter Invariance to
Viewpoint Photometry Variation
Invariance vs. Selectivity
SLIDE 12
Clutter
SLIDE 13 Clutter
Klimt, 1913
SLIDE 14 Viewpoint Variation
Michelangelo 1475-1564
Slide credit: Li Fei Fei
SLIDE 15
Lighting Variation
Slide credit: Shimon Ullman
SLIDE 16 Occlusion
Magritte, 1957
SLIDE 17 Occlusion
Xu, Beihong 1943
SLIDE 18
Within-Class Variation
SLIDE 19
Within-Class Variation
SLIDE 20
How Many Samples are Needed?
SLIDE 21
Where Things Stand
Reasonable performance for several classes of semi-rigid objects. Even for face detection, a large “ROC gap” with human vision. Full scene parsing is currently beyond reach.
SLIDE 22
Where Are the Faces?
SLIDE 23 The ROC Gap: Face Detection
Current Com puter Vision: Approximately one hallucination per scene at ninety percent detection.
SLIDE 25
Francisco’s Kitchen
SLIDE 26
Notation
I : greyscale image Y : distinguished descriptions of I Ex: strings of (class,pose) pairs Y∈ {0}∪ Y : hidden r.v. Ŷ(I) : estimated description(s) from Y L={(I,Y)}: finite training set, in theory i.i.d. under P(I,Y)
SLIDE 27
Description Machine Specs
DESIGN and LEARNING: An explicit set of instructions for building Ŷ involving from L. COMPUTATION: An explicit set of instructions for evaluating Ŷ(I) with as little computation as possible. ANALYSI S: A “supporting theory” which guides construction and predicts performance
SLIDE 28 Ground Truth
For Y sufficiently restricted, reasonable to assume a “true interpetation” of I:
- Y = { face} , Y = { indoor, outdoor} ,…
- More generally, Y = {(c1,θ1),…, (ck,θk)}, limited to
specific categories and rough poses.
Corresponds to Pemp(Y|I)=δf(I)(Y) where Pemp(I,Y) is the empirical distribution over a gigantic sample (I1 ,Y1), (I2 ,Y2),…
SLIDE 29
Outline
Semantic Scene Interpretation Frameworks, Theories Hierarchical Testing The Efficiency of Abstraction
SLIDE 30
Deceased Frameworks
Traditional “AI” (60’s, 70’s) Stepwise, bottom-up 3D metric reconstruction (80’s) Algebraic, geometric invariants (90’s)
… but who knows
SLIDE 31
Living Frameworks
Generative modeling Discriminative learning Information-theoretic
SLIDE 32
Generative Modeling
Not all observations and explanations are equally likely. Construct P(I,Y) from
A distribution P(Y) on interpretations. A data model P(I|Y).
Inference principle:
Ŷ(I) = arg max P(Y|I) = arg min {-log P(I|Y) – log P(Y)}
SLIDE 33
Examples
Deformable templates Hidden Markov models Probabilities on part hierarchies Graphical models, e.g., Bayesian networks Gaussian models (LDA, mixtures, etc.)
SLIDE 34
Generative: Critique
In principle, a very general framework. In practice, Diabolically hard to model and learn P(Y). Intense online computation. P(I|Y) alone (i.e., “templates-for- everything”) lacks selectivity and requires too much computation.
SLIDE 35 Discriminative Learning
Proceed (almost) directly from data to decision boundaries. Representation and learning:
Replace I by a fixed length feature vector X Quantize Y to a small number of classes Specify a family F of “classifiers” f(X) Induce f(X) directly from a training set L
SLIDE 36 Examples
In effect, learn P(Y|X) (or log posterior odds ratios) directly:
- Artificial neural networks
- k-NN with smart metrics
- Decision trees
Support vector machines (interpretation as Bayes rule via logistic regression) Multiple classifiers (e.g., random forests)
SLIDE 37 Learning : Critique
In principle:
- Universal learning machines which mimic natural
processes and “learn” everything (e.g., invariance).
- Solid foundations in statistical learning theory
(although |L| ↓ 1 is the interesting limit).
In practice, lacks a global structure to address:
- A very large number of classes (say 30,000)
- Small samples, bias vs. variance, invariance vs.
selectivity.
SLIDE 38
Information-theoretic
Established connections between IT and imaging, but mostly at the “tool” level and for “low-level vision.” Two emerging frameworks:
“Information scaling” (Zhu) Resource/ complexity tradeoffs and “information refinement” (O’Sullivan et al) Both tilted towards “theory”.
SLIDE 39
SLIDE 40
An Information Theory Constellation
Slide credit: Laurent Younes
SLIDE 41
Overall Critique
Current generative and discriminative methods lack efficiency. Problem-specific structure is absent, and hence so a global organizing principle for vision. Sparse theoretical support (especially for practical systems).
SLIDE 42 Hierarchical Vision
Exploit shared components among
- bjects and interpretations.
Incorporate discriminative and generative methods as necessary. Can yield efficient representation, learning and computation.
SLIDE 43
Simple Part Hierarchy
SLIDE 44
Examples
Compositional systems (S. Geman) Hierarchies of fragments (Ullman) Hierarchies of conj’s and disj’s (Poggio) Convolutional neural networks (LeCun) Hierarchical generative models (Amit; Torralba; Perona; etc.) Hierarchical Testing
SLIDE 45
Emerging Theory
“Theory of reusable parts” (S. Geman)
Inspired by MDL and speech technology. Non-Markovian (“context sensitive”) priors. Theoretical results on efficient representation and selectivity.
However, contextual constraints enforced at the expense of learning and computation.
SLIDE 46
Outline
Semantic Scene Interpretation Frameworks, Theories Hierarchical Testing The Efficiency of Abstraction
SLIDE 47
Hierarchical Testing
Coarse-to-fine modeling of both the interpretations and the computational process: Unites representation and processing. Concentrates processing on ambiguous areas. Evidence for CTF processing in neural systems. Scales to many categories.
SLIDE 48 Density of Work
Original image Spatial concentration
SLIDE 49 Collaborators: Hierarchical Testing
Evgeniy Bart IMA Sachin Gangaputra Inductus Corp. Xiaodong Fan Microsoft François Fleuret EPFL, Lausanne Gilles Blanchard Fraunhofer Hichem Sahbi Cambridge U. Yali Amit
SLIDE 50
From Source Coding to Hierarchical Testing
Y: r.v. with distr p(y), y ∈ Y Code for p: a CTF exploration of Y :
Can ask all questions XA of the form: “Is Y∈ A ?”, A ⊂ Y All answers are exact. Y is the only source of uncertainty
SLIDE 51
From Source Coding to Hierarchical Testing (cont)
Constrained 20 questions:
Restrict to selected subsets A ⊂ Y Still, Y determines {XA} and vice-versa Still an errorless, unique path (root to leaf)
Realizable tests:
Make XA observable (XA= XA(I)) Requires appearance-based shared properties among elements of Y
SLIDE 52
From Source Coding to Hierarchical Testing (cont)
Accommodate mistakes:
Preserve P(XA =1|Y∈ A)=1 But allow P(XA =1|Y∉ A)≠0; hence, only negative answers eliminate hypotheses
Generalize paths to “traces”:
The outcome of processing is now a labeled subtree in a hierarchy of tests. Ŷ(I) is the union of leaves reached.
SLIDE 53 Representation of Y
Natural groupings A ⊂ Y based on shared parts or attributes.
- Ex: Shape similarities between (c,θ) and (c’,θ’) for
nearby poses.
In fact, natural nested coverings or hierarchies of attributes
Hattr= { Aξ , ξ ∈ T }
SLIDE 54
Two Attribute Hierarchies
SLIDE 55
Which Decomposition?
Another story ….
SLIDE 56 Statistical Structure
For each ξ ∈ T, consider a binary test Xξ =XA dedicated to H0:Y ∈ Aξ against
ξ Ha: Balt(ξ) ⊂ {Y ∉ Aξ}
Define Htest= { Xξ , ξ ∈ T } Constraint: Each Xξ satisfies inv(Xξ ) P(Xξ =1|Y ∈ Aξ ) ≅ 1 where P=Pemp estimated from L.
SLIDE 57
Summary
T ξ ξ∈T … … … Hattr Aξ Aξ⊂Y … … … Htest Xξ Xξ: ”Y∈Aξ?” … … …
SLIDE 58
Statistical Structure (cont)
Explore Htest under some “strategy”. Ŷ(I): y∈ Y not ruled out by any (performed) test: Ŷ = Y \ ∪ {Aξ : Xξ = 0} Our constraint implies Pemp(Y ∈ Ŷ(I)) ≅ 1 Could replace Pemp by a parametric model if available.
SLIDE 59
Example
A recursive partitioning of Y with four levels; there is a binary test for each of the 15 cells. (A): Positive tests are shown in black. (B): Ŷ is the union of leaves 3 and 4. (C): Tests performed under coarse-to-fine search.
SLIDE 60
Tests in Practice
Complex, e.g.,
Build an SVM from training examples
Simple, e.g.,
VQ 16x16 patches into k types; Ask two types of questions:
“Is there a patch in W with label k?” “Are there two patches in W with labels k,l ?”
SLIDE 61
Example: Face Detection
Detect and localize all instances of upright, frontal faces. Θ ={(z,σ,φ)} = {(position, scale, tilt)} Y : subsets of Θ Construct the hierarchy Hattr= { Aξ , ξ ∈ T } based on recursive partitions of Θ
SLIDE 62
Face Pose Hierarchy: Upper
First split: coarse scale (8-15, 16-31, 32-63, etc.) Second split: coarse position (disjoint 16x16 blocks B) These tests are “virtual” (always passed) without diffuse, extended attributes (e.g., color).
SLIDE 63
Face Pose Hierarchy: Lower
Construct Hattr from
{(z,σ,φ): z ∈ 16x16, 8 ≤ σ ≤ 15, -200 ≤ φ ≤ 200}
by splitting individual coordinates. The 64x64 subimage surrounding each B is then labeled as “face (θ)” or “background.”
SLIDE 64 Face Pose Hierarchy
{ (z,σ,φ): z ∈ 16x16, 8 ≤ σ ≤ 15, -200 ≤ φ ≤ 200 } { (z,σ,φ): z ∈ 2x2 14 ≤ σ ≤ 15, 100 ≤ φ ≤ 200 }
SLIDE 65 Detection Algorithm
Loop over resolution Loop over location (non-overlapping 16x16 blocks) Process the “reference” (lower) hierarchy Collect chains of positive responses
Scale 8 to 16 Scale 16 to 32 Scale 32 to 64
SLIDE 66 Examples
Learned hierarchy Manual hierarchy
False positives per image
SLIDE 67
Results
SLIDE 68
Face Tracking
SLIDE 69
Reading License Plates
37 classes. Synthesized training shapes from class prototypes. Nearly 99% classification rate per symbol on 380 plates.
SLIDE 70
Outline
Semantic Scene Interpretation Frameworks, Theories Hierarchical Testing The Efficiency of Abstraction
SLIDE 71 Efficient Computation
Strategy S: Any sequential, adaptive exploration of some or all of the tests. c(Xξ): cost of Xξ , sel(Xξ) = P(Xξ=0|Balt(Aξ)) Cost of Testing: Total Computation:
( ) ( )
s r
test H A s S r s
C S I c X
∈∂ ↓
= ∑
∑
ˆ ( ) * ( )
test
E C S c Y S ⎡ ⎤ + ⎣ ⎦
SLIDE 72
Assumptions
(i) “Background” is statistically dominant: P(Y=0) >> P(Y≠0) (Valid for sufficiently fine cells A, e.g., in the third level of the face hierarchy.) (ii) Total computation is driven by P(.|Y=0) (iii) The tests { Xξ , ξ ∈ T } are conditionally independent under this probability.
SLIDE 73 CTF Optimality Criterion
THEOREM: (G. Blanchard/ DG) CTF is optimal if
where C(ξ) = direct children of ξ in T. A numerical example:
( )
( ) ( ) ( ) ( )
C
c X c X sel X sel X
ξ η η ξ ξ η ∈
≤ ∑
, T ξ ∀ ∈
c(X1)=c(X2)=c(X3) sel(X1) = 1/2, sel(X2)= sel(X3)=9/10 Do X1 first !
SLIDE 74
Back to Twenty Questions
Optimal strategies are what we intuitively expect from playing 20Q:
A steady progression from high scope coupled with relatively poor selectivity to high selectivity coupled with dedication to specific interpretations.
SLIDE 75 “Right” Alternative Hypothesis
Due to CTF search Xξ is performed ⇔ all ancestor tests are performed and are positive. Hence, where = ancestors of node ξ in T.
ξ ξ η
η ξ = ∉ ∩ = ∀ ∈ A
( )
{ } { 1 ( )}
alt
B Y A X ξ A( )
SLIDE 76
Efficient Learning: Trace Model
Encodes the computational history of CTF search.
T: tree underlying the hierarchy S(I) : subtree of T determined by CTF search on image I Z(I) = { Xξ , ξ ∈ S(I) } , the “trace”, a labeled subtree.
SLIDE 77 Trace Representation
Tree hierarchy Realization of all tests Subtree from CTF search Trace
SLIDE 78 Trace Representation (cont)
n(T) = 5 23 = 8 n(T) = 26 27 = 128 Top: A 3 node hierarchy and its 5 possible traces Bottom: A 7 node hierarchy and 5 of its 26 possible traces
SLIDE 79 Trace Distributions
THEOREM: Let {pξ , ξ ∈ T} be any set of numbers with 0≤pξ≤1. Then defines a probability distribution on traces where Sz is the subtree identified with z and pξ(1)= pξ , pξ(0)= 1-pξ
( ) ( )
z
S
P z p x
ξ ξ ξ∈
∏
( ) ( | 1, ( )) p x P X x X
ξ ξ ξ ξ η
η ξ = = = ∀ ∈A
SLIDE 80
Learning
Needn’t model P(Xξ , ξ∈T); only need one parameter per node After CTF search, a trace-based likelihood ratio test. Interpretation: exact Bayesian inference with the trace as a (tree-structured) sufficient statistic: P(I|Y) = P(tr(I)|Y) / #{tr(I)}
SLIDE 81 Selectivity of the LRT
Top: Raw results of pure detection Bottom: False positives are eliminated with the trace model
SLIDE 82 Pruning Detections
Detection rate vs. false positives on the MIT+ CMU test set; Ex: 0.77 FPs/ image at 89.1% detection with | L| = 400
SLIDE 83
Conclusions: Hierarchical Testing
Can be seen as an adaptation of source coding to vision Hardwires invariance and computational efficiency. Limited by an impoverished contextual analysis. Resolve the particular, data-dependent, ambiguities by tests constructed on-line.
SLIDE 84
Conclusions: General
A dramatic “ROC gap” with natural vision. Also, no Shannon yet. Ambitious proposals center on hierarchical structures. None is simultaneously efficient and contextual.