Biologically Inspired Machine Perception
N i c h o l a s B u t k o , M a c h i n e P e r c e p t i o n L a b W I n t e r , 2 0 1 0
Biologically Inspired Machine Perception N i c h o l a s B u t k o - - PowerPoint PPT Presentation
Biologically Inspired Machine Perception N i c h o l a s B u t k o , M a c h i n e P e r c e p t i o n L a b W I n t e r , 2 0 1 0 <Chapter 1> Artificial Intelligence vs. Natural Intelligence Borrowed Intelligence vs. Owned
N i c h o l a s B u t k o , M a c h i n e P e r c e p t i o n L a b W I n t e r , 2 0 1 0
Early on, Artificial Intelligence grabbed hold of my imagination and wouldn’t let go. “The Age of Spiritual Machines” By 2020, computers will have more transistors than brains have neurons. That won’t be sufficient for computers to be intelligent: Can’t write a summary of a movie Can’t tie shoe-laces Can’t recognize humor AI is not limited by computing power, but by our understanding of “intelligence” A revolution in that understanding is required before we can create truly cognitive machines. I wanted to be part of that revolution.
Freshman year of undergrad: Volunteered in lab of AI prof in CSE. “Labeling” eyes and mouths. Thousands of images. Computer used this information to help figure out facial expression. One of the most successful paradigms in AI: “Supervised Learning” “Learn” about facial expressions from thousands of examples Use statistics, calculus, and linear algebra.
“Supervised Learning” has been very successful; My own lab uses it extensively to develop sophisticated facial-expression recognizers.
[Demo at end, if we have time]
Widely Applicable:
Driver Drowsiness Lie Detection Real/Fake Pain Autism Therapy Tutoring Smile Shutter Art
Different from how humans learn: Nobody points out thousands of eyes and mouths to babies to help them learn about faces.
Daniel Wolpert, “The Master Puppeteer” Crick Memorial Lecture, 2005
http://royalsociety.org/event.asp?id=3773
Artificial domains like chess have a clear, well defined structure. Natural domains like “seeing” are rife with ambiguity. Consider a simple problem like “how to look at something.”
? ? ?
To know “how to look somewhere”, it is helpful to know “where did I look?” From many experiences of sending signals to your eye-muscle neurons, your brain can learn the relationship between actions and consequences. Even the question “Where did I look?” is hard to answer! Lots of things could go wrong. Can we ever make explicit rules for all of them?
Whole Scene View 1 View 2 Difficulty
No match Same object? (Which lightpost?) Same object type? (Lake or Cloud?) Same location? (Moving Target)
1) Which of these is easiest for a computer program: Seeing, Doing your laundry, Playing Sudoku, Writing a Book Report, Laughing at funny jokes? 2) We gave four reasons that it’s tough to know where you’re looking. Can you remember them? What’s the main difficulty that unites them? 3) If you were going to use today’s state-of-the-art approaches to make an intelligent computer program that “Knows how to teach,” what is the first thing you should do?
Study the “aerodynamics” of natural intelligence -- the underlying principles and objectives organizing behavior.
Want a theory that’s not just about humans Flying is not about birds and feathers. Different organisms or systems may not have access to the type of actuators and sensors that humans have, but we still want to understand and build intelligent systems.
Choose problems that will help us understand behavior in real life.
E.g. “Learning how to look somewhere.”
Trying to understand perception by studying only neurons is like trying to understand bird flight by studying only feathers: it just cannot be done. In order to understand bird flight, we have to understand aerodynamics; only then do the structure of the feathers and the different shapes of bird’s wings make sense.
A “Generative Model” is a tool to describe the structure of the problems organisms face. You must describe how the things you can see relate to the things you want to know. You must describe your uncertainty about how things are and how things will be. Probability theory tells us how to make the best guess about how the things you want to know are and how they will be based on everything you’ve seen before.
Motor command value How the motors work World appearance Where the camera is looking Camera image
t=1 t=2 t=3
{0.3, 0.2} {0.25, 0.5} {0.1, 0.1}
Sensory () Motor (a)
1 2
1 2
+∞ +∞ +∞
+∞
={x',y'} ={x,y}
=2
A little probability theory: And a little algebra: Give us all the rules for making the best guess about where we are looking.
W here am I looking?
= p(τt|τ1:t−1, a1:t)p(ψt|ψ1:t−1, τ1:t−1)p(ψ1:t−1|τ1:t−1) p(ψ1:t|τ1:t−1)
g(τt) = = −.5
P redicted Motion Match
t + Qα)−1(τt − CtαKt)
− .5
(ψxy
t
− λxy
Kt)2
(σxy2
λKt + q2 λ)
−.5
log(σxy2
λKt + q2 λ)
Everything you’ve seen so far What you see right now.
A little probability theory: And a little algebra: Give us all the rules for making the best guess about where we are looking.
W here am I looking?
= p(τt|τ1:t−1, a1:t)p(ψt|ψ1:t−1, τ1:t−1)p(ψ1:t−1|τ1:t−1) p(ψ1:t|τ1:t−1)
g(τt) = = −.5
P redicted Motion Match
t + Qα)−1(τt − CtαKt)
− .5
(ψxy
t
− λxy
Kt)2
(σxy2
λKt + q2 λ)
−.5
log(σxy2
λKt + q2 λ)
Everything you’ve seen so far What you see right now.
Where you think you’re looking based
A little probability theory: And a little algebra: Give us all the rules for making the best guess about where we are looking.
W here am I looking?
= p(τt|τ1:t−1, a1:t)p(ψt|ψ1:t−1, τ1:t−1)p(ψ1:t−1|τ1:t−1) p(ψ1:t|τ1:t−1)
g(τt) = = −.5
P redicted Motion Match
t + Qα)−1(τt − CtαKt)
− .5
(ψxy
t
− λxy
Kt)2
(σxy2
λKt + q2 λ)
−.5
log(σxy2
λKt + q2 λ)
Everything you’ve seen so far What you see right now.
Possible Match OK Match Good Match
A little probability theory: And a little algebra: Give us all the rules for making the best guess about where we are looking.
W here am I looking?
= p(τt|τ1:t−1, a1:t)p(ψt|ψ1:t−1, τ1:t−1)p(ψ1:t−1|τ1:t−1) p(ψ1:t|τ1:t−1)
g(τt) = = −.5
P redicted Motion Match
t + Qα)−1(τt − CtαKt)
− .5
(ψxy
t
− λxy
Kt)2
(σxy2
λKt + q2 λ)
−.5
log(σxy2
λKt + q2 λ)
Everything you’ve seen so far What you see right now.
Avoid if possible
A little probability theory: And a little algebra: Give us all the rules for making the best guess about where we are looking.
W here am I looking?
= p(τt|τ1:t−1, a1:t)p(ψt|ψ1:t−1, τ1:t−1)p(ψ1:t−1|τ1:t−1) p(ψ1:t|τ1:t−1)
g(τt) = = −.5
P redicted Motion Match
t + Qα)−1(τt − CtαKt)
− .5
(ψxy
t
− λxy
Kt)2
(σxy2
λKt + q2 λ)
−.5
log(σxy2
λKt + q2 λ)
Everything you’ve seen so far What you see right now.
Best Guess!
50 100 150 200 250 300 350 10 20 30
Eye-Movements
Error on Desired Eye-Movement
In 1992, Duhamel et al. showed that the parietal cortex does something similar to what we just described. Just before an eye-movement, cells “remap” their visual representation to be in line with what they expect to see. This does not mean the brain is doing probability theory and algebra. It may mean the brain found a way to implement the solution probability theory and algebra give.
REFERENCES AND NOTES
47, 861 (1986).
ibid., p. 957.
(1991).
Thangue, Nature 352, 249 (1991).
(1985).
253, 1271 (1991).
and MCF-7 cell lines.
Ce!! 66, 1155 (1991).
(1990).
(1990).
Williamson, J. Virol. 39, 861 (1981).
46, 83 (1974).
A Laboratory Manual (Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 1988).
the affinity of BF683 for cycin A. To make the intensity of cyclin A comparable in immunoprecip- itations from treated and untreated lysates, lanes 4 to 6 were exposed to film 10 hours longer than lanes 1 to 3.
67, 31 (1988).
important contributions to the early stages of this
Harbor Laboratory and the MGH Cancer Center for their helpful discussions, N. Dyson and E. Lees for critical reading of the manuscript,
gift of the cyclin A mutations, and J. Duffy, M. Ockler, and P. Renna for art and photography. Supported by NIH grants CA 13106 and 55339. 10 September 1991; accepted 25 November 1991
The Updating of the Representation of Visual Space in Parietal Cortex by Intended Eye Movements
JEAN-RENI
DUHAMEL, CAROL L. COLBY, MICHAEL E. GOLDBERG* Every eye movement produces a shift in the visual image on the retina. The receptive field, or retinal response area, of an individual visual neuron moves with the eyes so that after an eye movement it covers a new portion of visual space. For some parietal neurons, the location of the receptive field is shown to shift transiently before an eye
brings the site of a previously flashed stimulus into the receptive field. Parietal cortex both anticipates the retinal consequences of eye movements and updates the retinal coordinates of remembered stimuli to generate a continuously accurate representation
S WE MOVE OUR EYES, A STATION-
ary object excites successive loca- tions on the retina. Despite this constantly shifting input, we perceive a sta- ble visual world. This perception is presum- ably based on an internal representation derived from both visual and nonvisual in-
brain uses information about intended movement to interpret retinal displacements (1). We show that single neurons in monkey parietal cortex use information about in- tended eye movements to update the repre- sentation of visual space (2).
Laboratory of Sensorimotor Research, National Eye Institute, National Institutes of Health, Building 10, Room lOC 101, Bethesda, MD 20892. *To whom correspondence should be addressed.
The shift in the visual image on the retina produced by a saccade is determined by the size and direction of the eye movement. This predictability enables the representation of visual space in parietal cortex to be remapped in advance of the eye movement. At the single cell level, the intention to move the eyes evokes a transient shift in the retinal location at which a stimulus can excite the neuron. Our results are summarized schematically in Fig. 1, in which an observer transfers fixation from the mountain top to the tree. During fixation, the representation of the visual scene in parietal cortex is stable. A given neuron encodes the stimulus at a certain retinal location (the cloud). Immedi- ately before and during the saccade, the cortical representation shifts into the coor- dinates of the next intended fixation. The neuron now responds to the stimulus at a new retinal location (the sun) and stops responding to the stimulus at the initial location (the cloud). The neuron thus antic- ipates the retinal consequences of the in- tended eye movement: the cortical represen- tation shifts first, and then the eye catches
tion in parietal cortex matches the reafferent visual input and the neuron continues to respond to the stimulus (the sun). This process constitutes a remapping of the stim- ulus from the coordinates of the initial fixa- tion to those of the intended fixation. We demonstrated this remapping by studying the visual responsiveness of neu- rons in the lateral intraparietal area (LIP) of alert monkeys performing fixation and sac- cade tasks (3). Neurons in LIP have reti- nocentric receptive fields and carry visual and visual memory signals (4). An example is shown in Fig. 2. When the monkey fixates, this neuron responds to the onset of a visual stimulus in its receptive field at a latency of 70 ms (Fig. 2A). Receptive field borders were defined while the monkey maintained fixation, and, under these condi- tions, stimuli presented outside these bor- ders never activated the neuron. In the sac- cade task, the fixation target jumps at the same time that a visual stimulus appears. The visual stimulus is positioned so that it will be in the receptive field when the mon- key has completed the saccade. If there were no predictive remapping, the cell would be expected to begin discharging 70 ms after the eye movement brings the stimulus into
Oculomotor viuleet
events Visual events Fixate Intend eye movement
Refixate
visual representation in parietal cortex. Each panel represents the visual imnage at a point in time relative to a sequence
events. Receptive field of a parietal neuron, dashed circle; center of current gaze location, solid circle; and coordinates
tical representation, cross hairs.
90 SCIENCE, VOL. 255
1) True or false?: Scientists know that intelligence definitely requires neurons. 2) Make a Generative Model for the question, “Is this street safe to cross?”: A) What do you want to know? B) What can you do? C) What can you see? D) How is the answer likely to change in the future? 3) What is the role of “Nurture” in learning to look? What is the role of “Nature?”
Information helps us answer a question: Who was the 17th president? Will it rain tomorrow? What am I supposed to talk about next?
We constantly gather visual information by moving our eyes!
People don’t closely examine every inch of the world. Eye-movements are tuned to optimally gather information. We turned two of the shortcuts that people use into new machine perception technologies.
1) Fast Visual Saliency 2) Digital Eye
Maximum: uniform distribution has most information, because we can’t make a good guess. Additivity: we get as much information from two events as we get from each one separately. Continuity: small changes in probability give small changes in information. Symmetry: reordering/renaming outcomes doesn’t change information.
− ∞
−∞
p(x) log p(x)dx
Salient objects “pop out” of visual scenes.
Simple preprocessing step directs computational resources. Rare (improbable) image features are more salient than common (probable
Improbable events carry more information. We developed an efficient way to model the statistics of a video stream, and analyze it for salient “pop out”.
Offline: Video Analysis Online: Camera Control
Tracks people in pre-school:
68.04% of salience-tracking images contained people. 34.81% of playback images contained people.
Predicts Key-frames in Video Annotation:
Video sequence labeled by coders for “Change in activity.” [RED] Initial attempts at salience-based video statistics can give up to 70% signal correlation [BLUE] Can also be used to make a “virtual cameraman” to focus on areas of a scene.
Salience approaches don’t really pay attention to what they see. Inhibition of return Can pre-compute saccade trajectory from first image. Not reacting to information in the image. The image is constant, and all image analysis is pre-computed. What is the consequence of each eye movement? Information-gathering model, but what information was gathered? What question were we trying to answer?
1 2 3 4 5
Visual Popout can be useful for robots, and it seems to be important in people, but it can’t account for task-specific looking behavior. It has long been known that where people look depends on what questions they are trying to answer. [Yarbus 1967] Current studies have difficulty making quantitative claims: “Fixations are tightly linked in time to the evolution task. Very few irrelevant regions are fixated.” [Hayhoe & Ballard 2005]
Uncertainty after I open my eyes Uncertainty with my eyes closed
Mutual Information is “How much was my uncertainty about a question I have reduced by the things I do and see?” Is this street safe to cross? Don’t look: Very uncertain Look left: Somewhat uncertain Look right: Not uncertain
I(S; A, O) =
p(S|A, O) log p(S|A, O)dS −
p(S) log p(S)dS = H(S) − H(S|A, O)
Supervised: Student / teacher model of learning; teacher knows right answer. Learning judged by %Correct. Infomax: Confidence in response to a question (Information). Reinforcement Learning: Given a reinforcement signal, learn how to act to
In the Infomax approach, learn to gain information in order to become confident (can’t be confident without information).
50 100 No Yes 50 100 No Yes
Confidence
4 3 2 1
1
No Face
2
No Face
3
No Face
4
Face!
A Generative Model for Visual Search
[Adapted from Najemnik & Geisler 2005]
0.5 1.0 1.5 2.0 2.5 3.0Target Signal Strength Signal Signal+Noise ~N(0,1) Belief Likelihood State / Action t=0 t=1
0.5 1.0 1.5 2.0 2.5 3.0t=2
0.5 1.0 1.5 2.0 2.5 3.0t=3 Infomax Reward
9 6 0 x 5 4 0 V i d e o ( 1 / 2 M p x ) . D i g i t a l R e t i n a : 2 5 F P S V i o l a J o n e s : 1 . 2 5 F P S
1) True or False?: If you close your eyes (and ears, nose, etc.), you get no information about whether a street is safe to cross. 2) Why is it a good idea to be bad at “Where’s Waldo?” 3) In Infomax Control approaches, you reward yourself for doing things that make you more certain about the answer to a question. What keeps you from just tricking yourself into believing things with complete certainty?
It takes 2-month infants about 40 minutes to learn new contingencies
(head moves mobile moves) By 10 months infants have become experts at learning new contingencies: (it takes them only a few seconds to detect contingencies).
[Movellan & Watson 1985]
Example: vocalization contingency Actions: Vocalize, Remain Quiet Question: Are the sound statistics after my vocalization different from background? Goal: Choose length of waiting period to quickly become confident in the answer to this question. Volume
Vocalization Vocalization 50 100 No Yes 50 100 No Yes
Confidence
4 8 12 16 20 24 28 32 36 40 2 4 6 8 10 12 14 16 18 20
10 mo: 3.4 Minutes 2 mo: 18 Minutes
Months of Development
Minutes of Interaction for Accurate Contingency Detection *Butko, Movellan, ICDL 2007
Is it possible to learn about the visual appearance of people based on contingency? [John Watson (1972), 2 month infants] Contingency is the driver of social development Contingency defines the concept of “caregiver” Computational Analysis Is Watson’s hypothesis computationally plausible? If so, how long does it take to gather enough information to learn reliably?
Infomax model of detecting contingencies High reliability in real world, real time robotic applications. Movellan and Fasel (2006): Segmental Boltzmann Fields identify and locate objects in cluttered scenes weak training label “A leopard is probably in this scene” Use contingency to teach yourself about people: “A social being is probably in this scene”
GA
GA Nice Baby
3700 Images collected over 90 minutes of interaction. No experimenter intervention Variety of lighting and background conditions No post-processing of images (rectification, etc.)
Learning In The WilD
18% - No face ; 4% - No Person
17% - Face ; 20% - Person
Contingency
No Contingency
Learns what people look like with high accuracy very quickly (6 minutes) Shows preference for schematic faces shown by infants shortly after birth (40 minutes) Shows preference for caregivers above other people shown by infants shortly after birth (2 days)
50 100 150 200 250 0.5 0.6 0.7 0.8 0.9 1 Number of Training Images 2AFC Performance (Face v. No Person) Caregivers Other People1) From a computational point of view, in what ways is social intelligence “special,” or fundamentally different from low-level perceptual intelligence? 2) Under the Infomax Hypothesis, how do babies learn to be good scientists, i.e. ask the right question at the right time? 3) What ultimately enabled BEV to learn what people look like, without borrowing the expertise
Artificial Intelligence is an exciting field where we are constantly pushing the boundaries of imagination. Get involved in research labs early. Take lots of math classes:
Calculus, Linear Algebra, Probability, Statistics, Discrete Math, Algorithms & Data Structures
For more info: http://mplab.ucsd.edu nbutko@ucsd.edu