[PPT] - for AI and Robotics Exploration and information gathering PowerPoint Presentation

SLIDE 1

Statistical Filtering and Control for AI and Robotics

Alessandro Farinelli

Exploration and information gathering

SLIDE 2

Outline

POMDPs

– The POMDP model – Finite world POMDP algorithm – Point based value iteration

Exploration

– Information gain – Exploration in occupancy grid maps – Extension to MRS

Acknowledgment: material based on

– Thrun, Burgard, Fox; Probabilistic Robotics

SLIDE 3

POMDPs

In POMDPs we apply the same idea as in MDPs.
Since the state is not observable, the agent has to make its

decisions based on the belief state which is a posterior distribution over states.

Let b be the belief of the agent about the state under

consideration.

POMDPs compute a value function over belief space:

       



 ' 1

' ) , | ' ( ) ' ( ) , ( max ) (

b T u T

db b u b p b V u b r b V 

SLIDE 4

Problems

Each belief is a probability distribution, thus, each value in a

POMDP is a function of an entire probability distribution.

This is problematic, since probability distributions are

continuous.

Additionally, we have to deal with the huge complexity of

belief spaces.

For finite worlds with finite state, action, and measurement

spaces and finite horizons, however, we can effectively represent the value functions by piecewise linear functions.

– Possible because Expectation is a linear operator

SLIDE 5

Example

2

x

1

x

3

u

8 .

2

z

1

z

3

u

2 . 8 . 2 . 7 . 3 . 3 . 7 .

measurements action u3 state x2 payoff measurements

1

u

2

u

1

u

2

u

100  50  100 100

actions u1, u2 payoff state x1

1

z

2

z

SLIDE 6

Discussion on the example

The two states have different optimal actions

– u2 in x1 and u1 in x2

Action u3 is non deterministic, it flips state and

acquires knowledge with a small cost

– z1 increases confidence of being in x1 – z2 increases confidence of being in x2 – cost is -1 (see later)

Two states: belief is p1 = p(x1)

– p(x2) = 1-p1 –

 

u  1 ; : 

SLIDE 7

Payoff in POMDPs

   

 

         

u x r p u x r p dx x p u x r u b r u x r E u b r

x x

, , ' ' , ' , , ,

2 2 1 1 '

   



In MDPs, the payoff (or reward) depends on the

state of the system.

In POMDPs the true state is not exactly known.
Therefore, we compute the expected payoff by

integrating over all states:

SLIDE 8

Payoffs in the example I

         

1 , 1 50 100 , 1 100 100 ,

3 1 1 2 1 1 1

         u b r p p u b r p p u b r

If we are in x1 and execute u1 we receive -100
If we are in x2 and execute u1 we receive +100
When we are not certain of state we have a linear

combination weighted with the probabilities:

SLIDE 9

Payoffs in the example II

SLIDE 10

Finte POMDP with T=1, use V1(b) to determine the
ptimal policy:

– Choose best next action among u1,u2,u3

In our example, the optimal policy for T=1 is
This is the upper thick graph in the diagram.

The resulting policy for T=1

 

        7 3 if 7 3 if

1 2 1 1

p u p u b 

SLIDE 11

Piecewise, linearity and convexity

The resulting value function V1(b) is the maximum of

the three functions at each point

It is piecewise linear and convex.

     

                 1 1 50 100 1 100 100 max

1 1 1 1 1

p p p p b V

SLIDE 12

Pruning

Only the first two components contribute.
The third component can be pruned away from V1(b).
Pruning is crucial to have an efficient solution approach

      

          

1 1 1 1 1

1 50 100 1 100 100 max p p p p b V

SLIDE 13

Increasing the time horizon

Assume robot can make an observation before acting
Sensing will provide a better belief, how much better?

V1(b)

SLIDE 14

Sensing

Suppose the robot perceives z1.
Recall:

– p(z1 | x1)=0.7 and p(z1| x2)=0.3.

Given the observation z1 we update the belief using Bayes rule.

) ( ) 1 ( 3 . ) ( ) ( ) | ( ) | ( ' ) ( 7 . ) ( ) ( ) | ( ) | ( '

1 1 1 1 2 1 2 2 1 1 1 1 1 1 1 1

z p p z p x p x z p z x p p z p p z p x p x z p z x p p       

SLIDE 15

Value Function considering z1

b’=p(x1|z1) V1(b) V1(b|z1)

project

SLIDE 16

Computing the new value function

Suppose the robot perceives z1.
We update the belief using Bayes rule
We can compute V1(b | z1) by replacing p1 with p’1:

   

                                ) 1 ( 50 70 ) 1 ( 30 70 max 1 ) ( ) 1 ( 3 . 50 ) ( 7 . 100 ) ( ) 1 ( 3 . 100 ) ( 7 . 100 max |

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

p p p p z p z p p z p p z p p z p p z b V

SLIDE 17

Expected value after measuring

We do not know in advance what will be the next measurement
Need to compute the expectation

 

 

 

  

  

           

2 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1

) | ( ) ( ) | ( ) ( ) | ( ) ( ) | (

i i i i i i i i z

p x z p V z p p x z p V z p z b V z p z b V E b V

SLIDE 18

Expected value after measuring

We do not know in advance what will be the next measurement
Need to compute the expectation

 

 

                     

) | ( ) ( 1 1 1 1 ) | ( ) ( 1 1 1 1 2 1 1 1 1

2 1 2 1 1 1

) 1 ( 35 30 ) 1 ( 70 30 max ) 1 ( 15 70 ) 1 ( 30 70 max ) | ( ) ( ) | (

z b V z p z b V z p i i i z

p p p p p p p p z b V z p z b V E b V                          





SLIDE 19

Resulting value function

Need to consider the four possible combinations and find the max
As before we can perform pruning

 

                                                       ) 1 ( 50 100 ) 1 ( 55 40 ) 1 ( 100 100 max ) 1 ( 35 30 ) 1 ( 15 70 ) 1 ( 70 30 ) 1 ( 15 70 ) 1 ( 35 30 ) 1 ( 30 70 ) 1 ( 70 30 ) 1 ( 30 70 max

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

p p p p p p p p p p p p p p p p p p p p p p b V

SLIDE 20

Value Function considering sensing

p(z1) V1(b|z1) p(z2) V2(b|z2)

u1 u2

unclear

SLIDE 21

State transition

Need to consider how actions affect the state
In our case u1 and u2 leads to final states and are deterministic
u3 has a non deterministic effect on the state

 

 

 

1 1 1 1 3 2 1 1 3 1 1 2 1 3 1 3 1 1

6 . 8 . ) 1 ( 8 . 2 . ) 1 )( , | ' ( ) , | ' ( , | ' , | ' ' p p p p u x x p p u x x p p u x x p u x x p E p

i i i

         





SLIDE 22

State transition

 

 

1 3 1 1

6 . 8 . , | ' ' p u x x p E p    

1

p

1

' p

SLIDE 23

Resulting value function after u3

Considering the state transition we can compute
Substitute p’1 in p1

 

3 1

|u b V

 

                                    ) 1 ( 70 20 ) 1 ( 43 52 ) 1 ( 60 60 max ) ' 1 ( 50 ' 100 ) ' 1 ( 55 ' 40 ) ' 1 ( 100 ' 100 max |

1 1 1 1 1 1 1 1 1 1 1 1 3 1

p p p p p p p p p p p p u b V

SLIDE 24

Value Function considering u3

project

u1 u2

unclear

u2 u1

unclear

SLIDE 25

Resulting value function for T=2

can execute any of the three

actions u1, u2, u3

need to discount cost for u3

 

                                                 ) 1 ( 42 52 ) 1 ( 50 100 ) 1 ( 100 100 max ) 1 ( 69 21 ) 1 ( 42 52 ) 1 ( 61 59 ) 1 ( 50 100 ) 1 ( 100 100 max

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2

p p p p p p p p p p p p p p p p b V

SLIDE 26

Graphical representation for V2(b)

utcome of

measurement is important here

u1 optimal u2 optimal unclear

SLIDE 27

Deep horizons and pruning

We have now completed a full backup in belief space.
This process can be applied recursively.
The value functions for T=10 and T=20 are

SLIDE 28

Importance of pruning

 

b V2

 

b V1

 

b V

1

SLIDE 29

SLIDE 30

Why pruning is essential

Each update introduces additional linear components to V.
Each measurement squares the number of linear components.
Thus, an un-pruned value function for T=20 includes more than

10547,864 linear functions.

At T=30 we have 10561,012,337 linear functions.
The pruned value functions at T=20, in comparison, contains
nly 13 linear components.
The combinatorial explosion of linear components in the value

function are the major reason why POMDPs are impractical for most applications.

Can use approximations

– Exploiting the structure of the domain

SLIDE 31

Point based value iteration

One of many approaches to approximate POMDPs
PBVI: maintains a set of example beliefs

– Belief points

Only considers constraints that maximize value

function for at least one of the examples

– V contains only constraints that are supported by belief points

SLIDE 32

Point based value iteration: Example

Same domain with two state but det. transition prob.
PBVI: simple point set B={p1=0., p1 = 0.1, … p1 = 1.}

Value iteration with pruning T=30 PBVI T=30

SLIDE 33

Example Application for PBVI

Intrusion detection: robot is well localized, intruder

position uncertain (particle filter)

Fairly easy to define reasonable set of belief points

SLIDE 34

PBVI: policy I

SLIDE 35

PBVI: policy II

SLIDE 36

PBVI: policy III

Time to clear the room is (with high likelihood) not sufficient for intruder to pass by the corridor. The POMDP policy finds the best policy to detect the intruder considering uncertainty of state space.

SLIDE 37

Exploration

Exploration is a crucial task for robotics
Exploration: information gathering

– Find an intruder – Active localization – Acquire a map of a static environment

POMPDs naturally consider information gathering

– Just need to build an appropriate reward function (e.g. reduction in entropy) – Not practical for most realistic applications

We will consider practical algorithms for exploratoin

– Most of them are greedy

SLIDE 38

Information gain

Entropy: expected information of a probability

distribution

Maximum for uniform distributions
Minimum for point-mass

 

   

 

     

x p x p

x p x p x H dx x p x p x H x p E ) ( log ) ( ) ( log ) ( ) ( log

SLIDE 39

Conditional Entropy

Need to consider information after executing actions

and acquiring measurements

Denote the belief resulting from executing u and

acquiring measurement z under the belief b

Conditional entropy

) , , | ' ( ) ' )( , , ( b u z x p x u z b B 



 

'

' ) ' )( , , ( log ) ' )( , , ( ) , | ' (

x b

dx x u z b B x u z b B u z x H

SLIDE 40

Conditional Entropy over the control

We can not choose the measurement, only the

control action: need to integrate z out to obtain

This is done by exploiting the structure of the

application domain

Information gain: reduction in entropy

) | ' ( u x Hb ) | ' ( ) ( ) ( u x H x H u I

b b b

 

SLIDE 41

Greedy Techniques

Exploration as a decision-theoretic problem:

– Choose action that maximizes the expected utility

Expected utility for action u:

– information gain minus cost – must find a tradeoff between cost and gain

 

               

cost expected gain n informatio expected

) ( ) , ( ) | ' ( ) ( max arg ) (



  

x b b u

dx x b u x r u x H x H b  

SLIDE 42

Why Greedy ?

Long action sequences might be not executable

SLIDE 43

Monte Carlo Exploration

SLIDE 44

Issues with Monte Carlo Exploration

Sampling measurement z is not practical
Most domains exhibit a huge amount of possible
bservations
Need to exploit domain structure to overcome this

SLIDE 45

Exploration for learning occupancy grid maps

Exploration applied to mapping
Considering occupancy grid map

SLIDE 46

Occupancy grid maps

Introduced by Moravec and Elfes in 1985
Represent environment by a grid

–

Estimate the probability that a location is occupied by

an obstacle .

–

Key assumptions

– Occupancy values of individual cells are independent – Positions are known, map is static

 

i

m m 

i t t i

p x z p  ) , | (

: 1 : 1

m ] 1 , [ 

i

m





i t t i t t

x z p x z m p ) , | ( ) , | (

: 1 : 1 : 1 : 1

m

SLIDE 47

Updating occupancy grid maps: example

SLIDE 48

Occupancy grid maps: example

CAD map

ccupancy grid map

SLIDE 49

Exploring occupancy grid maps

Grey areas are not explored
Greedy technique:
Go to closest unexplored

location, where information gain is maximal

Compute gain:
Gain per grid cell (not per

robot action!)

Entropy
Expected Information gain
Binary gain

SLIDE 50

Entropy to compute gain

The brighter a location the higher the entropy Entropy Occupancy map

 

) 1 log( ) 1 ( log

i i i i i p

p p p p H      m

SLIDE 51

Information gain

Entropy does not consider the information a robot

would acquire when close to a cell

Recall that the information gain is:
In our case this reduces to the entropy before

measuring and the entropy after acquiring the possible measurement

) | ' ( ) ( ) ( u x H x H u I

b b b

  )] ( [ ) (

' i p i p

H E H I

i

m m

m

 

SLIDE 52

 

) ' 1 log( ) ' 1 ( ' log ' ) 1 )( 1 ( ' ) 1 )( 1 (

' i i i i i p i t i t i t i i t i t

p p p p H p p p p p p p p p p p p             

 

m

Computing the entropy of the posterior

Probability of correct sensing
Probability of measuring occupied
Posterior for occupancy update



p

t

p ' p

SLIDE 53

We can compute the entropy of posterior for

measuring free

The expected entropy is then
We can then compute the gain

 

 

   

i p i p i p

H p H p H E m m m

' ' '    

 

Computing the information gain

) (

' i p

H m



)] ( [ ) (

' i p i p

H E H I

i

m m

m

 

SLIDE 54

Difference between gain and entropy

Entropy is very similar to information gain
Usually entropy is good enough for exploration

SLIDE 55

Binary gain

Extremely simple, extremely popular
Divide cells in two classes:

– Explored: updated at least once – Unexplored: ever updated

Frontier based exploration

SLIDE 56

Using the information maps

Need to build a navigation function to drive

the robot based on information maps

Exploration action:

– Move to loc. (x,y) – Acquire info in a small radius

Binary gain and value iteration
r encodes the cost

 

 

      

 

) ( if ) ( ) ( if ) ( ) , ( max

1 ) ( i i i j T j i adj j i T

I I I V r V

i

m m m m m m m

m

SLIDE 57

Value function: example

Value function at convergence for binary gain
Very crude approximation but work well in practice

SLIDE 58

Exploration path: example

SLIDE 59

Extension to MRS

K robots can explore more than K times faster
Need to coordinate: avoid conflicts and maximize gain
Simple approach: greedy task allocation

– assign frontiers to different robots – greedily maximize exploration effect

SLIDE 60

Greedy coordinated exploration

SLIDE 61

MRS exploration without coordination

Both robots choose same target location

– They are at same distance and do not coordinate

SLIDE 62

MRS coordinated exploration

First robot chooses and rule out goal location for

second robot

– Joint exploration will be much more effective

SLIDE 63

Extension for MRS coordination

Greedy task allocation can easily fail

– Consider swapping the order of execution for robots

Very restrictive assumptions

– i.e., robots share the same map

Several extensions:

– Use optimal task assignment (e.g., Hungarian method) – Negotiation over tasks during execution (e.g., auctions) – Do not share maps continuously (e.g., plan for meetings) – …

SLIDE 64

Summary

POMDPs

– provide optimal policy considering belief states – are extremely hard to solve – effective for finite worlds and low dimensions

Exploration

– POMDPs can represent the exploration problem – In most practical application need to exploit domain knowledge to have tractable algorithms – Entropy to guide the search – Very often simple approaches (e.g., binary gain) are very effective and extremely efficient – Interesting extensions for MRS