for AI and Robotics Exploration and information gathering - PowerPoint PPT Presentation
Statistical Filtering and Control for AI and Robotics Exploration and information gathering Alessandro Farinelli Outline POMDPs The POMDP model Finite world POMDP algorithm Point based value iteration Exploration
Statistical Filtering and Control for AI and Robotics Exploration and information gathering Alessandro Farinelli
Outline • POMDPs – The POMDP model – Finite world POMDP algorithm – Point based value iteration • Exploration – Information gain – Exploration in occupancy grid maps – Extension to MRS • Acknowledgment: material based on – Thrun, Burgard, Fox; Probabilistic Robotics
POMDPs • In POMDPs we apply the same idea as in MDPs. • Since the state is not observable , the agent has to make its decisions based on the belief state which is a posterior distribution over states. • Let b be the belief of the agent about the state under consideration. • POMDPs compute a value function over belief space : V ( b ) max r ( b , u ) V ( b ' ) p ( b ' | u , b ) db ' T T 1 u b '
Problems • Each belief is a probability distribution, thus, each value in a POMDP is a function of an entire probability distribution . • This is problematic, since probability distributions are continuous . • Additionally, we have to deal with the huge complexity of belief spaces . • For finite worlds with finite state, action, and measurement spaces and finite horizons, however, we can effectively represent the value functions by piecewise linear functions . – Possible because Expectation is a linear operator
Example measurements state x 1 action u 3 state x 2 measurements 0 . 2 0 . 8 z z 0 . 7 0 . 3 u 1 1 x x 3 u 1 2 z z 3 0 . 3 0 . 7 2 2 0 . 8 0 . 2 u u u u actions u 1, u 2 1 2 1 2 100 100 100 50 payoff payoff
Discussion on the example • The two states have different optimal actions – u2 in x1 and u1 in x2 • Action u3 is non deterministic, it flips state and acquires knowledge with a small cost – z1 increases confidence of being in x1 – z2 increases confidence of being in x2 – cost is -1 (see later) • Two states: belief is p1 = p(x1) – p(x2) = 1-p1 : 0 ; 1 u –
Payoff in POMDPs • In MDPs, the payoff (or reward) depends on the state of the system. • In POMDPs the true state is not exactly known. • Therefore, we compute the expected payoff by integrating over all states : r b , u E r x , u x r b , u r x ' , u p x ' dx ' p r x , u p r x , u 1 1 2 2 x '
Payoffs in the example I • If we are in x1 and execute u1 we receive -100 • If we are in x2 and execute u1 we receive +100 • When we are not certain of state we have a linear combination weighted with the probabilities: r b , u 100 p 100 1 p 1 1 1 r b , u 100 p 50 1 p 2 1 1 r b , u 1 3
Payoffs in the example II
The resulting policy for T=1 • Finte POMDP with T=1, use V 1 (b) to determine the optimal policy: – Choose best next action among u1,u2,u3 • In our example, the optimal policy for T=1 is 3 u if p 1 1 7 b 3 u if p 2 1 7 • This is the upper thick graph in the diagram.
Piecewise, linearity and convexity • The resulting value function V 1 (b) is the maximum of the three functions at each point 100 p 100 1 p 1 1 100 p 50 1 p V b max 1 1 1 1 • It is piecewise linear and convex.
Pruning • Only the first two components contribute. • The third component can be pruned away from V 1 (b) . • Pruning is crucial to have an efficient solution approach 100 p 100 1 p 1 1 V b max 1 100 p 50 1 p 1 1
Increasing the time horizon • Assume robot can make an observation before acting • Sensing will provide a better belief, how much better? V 1 (b)
Sensing • Suppose the robot perceives z 1. • Recall: – p(z 1 | x 1 )=0.7 and p(z 1 | x 2 )=0.3 . • Given the observation z 1 we update the belief using Bayes rule. p ( z | x ) p ( x ) 0 . 7 p 1 1 1 1 p ' p ( x | z ) 1 1 p ( z ) p ( z ) 1 1 p ( z | x ) p ( x ) 0 . 3 ( 1 p ) 1 2 1 1 p ' p ( x | z ) 2 2 p ( z ) p ( z ) 1 1
Value Function considering z 1 V 1 (b) project b ’=p( x 1 |z 1 ) V 1 (b|z 1 )
Computing the new value function • Suppose the robot perceives z 1. • We update the belief using Bayes rule • We can compute V 1 (b | z 1 ) by replacing p1 with p’1: 0 . 7 p 0 . 3 ( 1 p ) 1 1 100 100 p ( z ) p ( z ) V b | z max 1 1 1 1 0 . 7 p 0 . 3 ( 1 p ) 1 1 100 50 p ( z ) p ( z ) 1 1 70 p 30 ( 1 p ) 1 1 1 max 70 p 50 ( 1 p ) p z 1 1 1
Expected value after measuring • We do not know in advance what will be the next measurement • Need to compute the expectation 2 V b E V ( b | z ) p ( z ) V ( b | z ) 1 z 1 i 1 i i 1 2 2 p ( z | x ) p i 1 1 p ( z ) V V p ( z | x ) p i 1 1 i 1 1 p ( z ) i 1 i 1 1
Expected value after measuring • We do not know in advance what will be the next measurement • Need to compute the expectation 2 V b E V ( b | z ) p ( z ) V ( b | z ) 1 z 1 i 1 i i 1 70 p 30 ( 1 p ) 30 p 70 ( 1 p ) 1 1 1 1 max max 70 p 15 ( 1 p ) 30 p 35 ( 1 p ) 1 1 1 1 p ( z ) V ( b | z ) p ( z ) V ( b | z ) 1 1 1 2 1 2
Resulting value function • Need to consider the four possible combinations and find the max • As before we can perform pruning 70 p 30 ( 1 p ) 30 p 70 ( 1 p ) 1 1 1 1 70 p 30 ( 1 p ) 30 p 35 ( 1 p ) 1 1 1 1 V b max 1 70 p 15 ( 1 p ) 30 p 70 ( 1 p ) 1 1 1 1 70 p 15 ( 1 p ) 30 p 35 ( 1 p ) 1 1 1 1 100 p 100 ( 1 p ) 1 1 max 40 p 55 ( 1 p ) 1 1 100 p 50 ( 1 p ) 1 1
Value Function considering sensing p(z 1 ) V 1 (b|z 1 ) u 1 u 2 unclear p(z 2 ) V 2 (b|z 2 )
State transition • Need to consider how actions affect the state • In our case u1 and u2 leads to final states and are deterministic • u3 has a non deterministic effect on the state 2 p ' E p x ' | x , u p x ' | x , u p 1 1 3 1 i 3 i i 1 p ( x ' | x , u ) p p ( x ' | x , u )( 1 p ) 1 1 3 1 1 2 3 1 0 . 2 p 0 . 8 ( 1 p ) 0 . 8 0 . 6 p 1 1 1
State transition p ' E p x ' | x , u 1 1 3 p ' 1 0 . 8 0 . 6 p 1 p 1
Resulting value function after u 3 • Considering the state transition we can compute V b | u 1 3 • Substitute p’1 in p1 100 p ' 100 ( 1 p ' ) 1 1 V b | u max 40 p ' 55 ( 1 p ' ) 1 3 1 1 100 p ' 50 ( 1 p ' ) 1 1 60 p 60 ( 1 p ) 1 1 max 52 p 43 ( 1 p ) 1 1 20 p 70 ( 1 p ) 1 1
Value Function considering u 3 u 1 u 2 unclear project u 2 u 1 unclear
Resulting value function for T=2 • can execute any of the three actions u 1 , u 2, u 3 100 p 100 ( 1 p ) • need to discount cost for u 3 1 1 100 p 50 ( 1 p ) 1 1 V b max 59 p 61 ( 1 p ) 2 1 1 52 p 42 ( 1 p ) 1 1 21 p 69 ( 1 p ) 1 1 100 p 100 ( 1 p ) 1 1 max 100 p 50 ( 1 p ) 1 1 52 p 42 ( 1 p ) 1 1
Graphical representation for V 2 (b) u 2 optimal u 1 optimal unclear outcome of measurement is important here
Deep horizons and pruning • We have now completed a full backup in belief space. • This process can be applied recursively. • The value functions for T=10 and T=20 are
Importance of pruning V b 1 V 1 b V 2 b
Recommend
More recommend
Explore More Topics
Stay informed with curated content and fresh updates.