[PPT] - 10d Machine Learning: Symbol-based 10.0 Introduction 10.5 PowerPoint Presentation

SLIDE 1

1

Machine Learning: Symbol-based

10d

10.0 Introduction 10.1 A Framework for Symbol-based Learning 10.2 Version Space Search 10.3 The ID3 Decision Tree Induction Algorithm 10.4 Inductive Bias and Learnability 10.5 Knowledge and Learning 10.6 Unsupervised Learning 10.7 Reinforcement Learning 10.8 Epilogue and References 10.9 Exercises

Additional references for the slides: Thomas Dean, James Allen, and Yiannis Aloimonos, Artificial Intgelligence: Theory and Practice Addison Wesley, 1995, Section 5.9.

SLIDE 2

2

Reinforcement Learning

A form of learning where the agent can

explore and learn through interaction with the environment

The agent learns a policy which is a mapping

from states to actions. The policy tells what the best move is in a particular state.

It is a general methodology: planning,

decision making, search can all be viewed as some form of the reinforcement learning.

SLIDE 3

3

Tic-tac-toe: a different approach

Recall the minimax approach:

The agent knows its current state. Generates a two layer search tree taking into account all the possible moves for itself and the opponent. Backs up values from the leaf nodes and takes the best move assuming that the opponent will also do so.

An alternative is to directly start playing with

an opponent (does not have to be perfect, but could as well be). Assume no prior knowledge or lookahead. Assign “values” to states: 1 is win 0 is loss or draw 0.5 is anything else

SLIDE 4

Notice that 0.5 is arbitrary, it cannot differentiate between good moves and bad moves. So, the learner has no guidance initially. It engages in playing. When the game ends, if it is a win, the value 1 will be propagated backwards. If it is a draw or a loss, the value 0 is propagated

backwards. Eventually,

earlier states will be labeled to reflect their “true” value. After several plays, the learner will learn the best move given a state (a policy.)

SLIDE 5

5

Issues in generalizing this approach

How will the state values be initialized or

propagated backwards?

What if there is no end to the game

(infinite horizon)?

This is an optimization problem which

suggests that it is hard. How can an optimal policy be learned?

SLIDE 6

6

A simple robot domain

1 3 2

The robot is in one of the states: 0, 1, 2, 3. Each one represents an office, the

ffices are connected in a

ring. Three actions are available: + moves to the “next” state

moves to the “previous”

state @ remains at the same state

+ + + + @ @ @ @

SLIDE 7

7

The robot domain (cont’d)

The robot can observe the label of the state it

is in and perform any action corresponding to an arc leading out of its current state.

We assume that there is a clock governing the

passage of time, and that at each tick of the clock the robot has to perform an action.

The environment is deterministic, there is a

unique state resulting from any initial state and action.

Each state has a reward:

10 for state 3, 0 for the others.

SLIDE 8

8

The reinforcement learning problem

Given information about the environment
States
Actions
State-transition function (or diagram)
Output a policy p: states → actions, i.e., find

the best action to execute at each state

Assumes that the state is completely
bservable (the agent always knows which

state it is in)

SLIDE 9

9

Compare three policies

a. Every state is mapped to @

The value of this policy is 0, because the robot will never get to office 3.

b. Every state is mapped to + policy 0

The value of this policy is ∞, because the robot will end up in office 3 infinitely often.

c. Every state is except 3 is mapped to +, 3 is

mapped to @ policy 1 The value of this policy is also ∞, because the robot will end up (stay) in office 3 infinitely often.

SLIDE 10

10

Compare three policies

So, it is easy to rule case a out, but how can we show that policy 1 is better than policy 0? One way would be to compute the average reward per tick:

POLICY 1 The average reward per tick for state 0 is 10. POLICY 0 The average reward per tick for state 0 is 10/4.

Another way would be to assign higher values for immediate rewards and apply a discount to future rewards.

SLIDE 11

11

Discounted cumulative reward

Assume that the robot associates a higher value with more immediate rewards and therefore discounts future rewards. The discount rate (γ) is a number between 0 and 1 used to discount future rewards. The discounted cumulative reward for a particular state with respect to a given policy is the sum for n from 0 to infinity of γn times the reward associated with the state reached after the n-th tick of the clock.

POLICY 1 The discounted cumulative reward for state 0 is 2.5. POLICY 0 The discounted cumulative reward for state 0 is 1.33.

SLIDE 12

12

Discounted cumulative reward (cont’d)

Take γ = 0.5 For state 0 with respect to policy 0: 0.50 x 0 + 0.51 x 0 + 0.52 x 0 + 0.53 x 10 + 0.54 x 0 + 0.55 x 0 + 0.56 x 0 + 0.57 x 10 + … = 1.25 + 0.078 + … = 1.33 in the limit For state 0 with respect to policy 1: 0.50 x 0 + 0.51 x 0 + 0.52 x 0 + 0.53 x 10 + 0.54 x 10 + 0.55 x 10 + 0.56 x 10 + 0.57 x 10 + … = 2.5 in the limit

SLIDE 13

13

Discounted cumulative reward (cont’d)

Let j be a state, R(j) be the reward for ending up in state j, π be a fixed policy, π(j) be the action dictated by π in state j, f(j,a) be the next state given the robot starts in state j and performs action a, Vπi(j) be the estimated value of state j with respect to the policy π after the i-th iteration of the algorithm Using a dynamic programming algorithm, one can obtain a good estimate of Vπ, the value function for policy π as i → ∞ → ∞.

SLIDE 14

14

A dynamic programming algorithm to compute values for states for a policy π

1. For each j, set Vπ0(j) to 0.
2. Set i to 0.
3. For each j, set Vπi+1 (j) to R(j) + γ Vπi( f(j,π) ) ).
4. Set i to i + 1.
5. If i is equal to the maximum number of

iterations, then return Vπi otherwise, return to step 3.

SLIDE 15

15

Values of states for policy 0

initialize
V(0) = 0
V(1) = 0
V(2) = 0
V(3) = 0
iteration 0
For office 0: R(0) + γ V(1) = 0 + 0.5 x 0 = 0
For office 1: R(1) + γ V(2) = 0 + 0.5 x 0 = 0
For office 2: R(2) + γ V(3) = 0 + 0.5 x 0 = 0
For office 3: R(3) + γ V(1) = 10 + 0.5 x 0 = 10
(iteration 0 essentially initializes values of states to their

immediate rewards)

SLIDE 16

16

Values of states for policy 0 (cont’d)

iteration 0 V(0) = V(1) = V(2) = 0 V(3)=10
iteration 1
For office 0: R(0) + γ V(1) = 0 + 0.5 x 0 = 0
For office 1: R(1) + γ V(2) = 0 + 0.5 x 0 = 0
For office 2: R(2) + γ V(3) = 0 + 0.5 x 10 = 5
For office 3: R(3) + γ V(0) = 10 + 0.5 x 0 = 10
iteration 2
For office 0: R(0) + γ V(1) = 0 + 0.5 x 0 = 0
For office 1: R(1) + γ V(2) = 0 + 0.5 x 5 = 2.5
For office 2: R(2) + γ V(3) = 0 + 0.5 x 10 = 5
For office 3: R(3) + γ V(0) = 10 + 0.5 x 0 = 10

SLIDE 17

17

Values of states for policy 0 (cont’d)

iteration 2 V(0) = 0 V(1) = 2.5 V(2) = 5 V(3) = 10
iteration 3
For office 0: R(0) + γ V(1) = 0 + 0.5 x 2.5 = 1.25
For office 1: R(1) + γ V(2) = 0 + 0.5 x 5 = 2.5
For office 2: R(2) + γ V(3) = 0 + 0.5 x 10 = 5
For office 3: R(3) + γ V(0) = 10 + 0.5 x 0 = 10
iteration 4
For office 0: R(0) + γ V(1) = 0 + 0.5 x 2.5 = 1.25
For office 1: R(1) + γ V(2) = 0 + 0.5 x 5 = 2.5
For office 2: R(2) + γ V(3) = 0 + 0.5 x 10 = 5
For office 3: R(3) + γ V(1) = 10 + 0.5 x 1.25 = 10.625

SLIDE 18

18

Values of states for policy 1

initialize
V(0) = 0
V(1) = 0
V(2) = 0
V(3) = 0
iteration 0
For office 0: R(0) + γ V(1) = 0 + 0.5 x 0 = 0
For office 1: R(1) + γ V(2) = 0 + 0.5 x 0 = 0
For office 2: R(2) + γ V(3) = 0 + 0.5 x 0 = 0
For office 3: R(3) + γ V(3) = 10 + 0.5 x 0 = 10

SLIDE 19

19

Values of states for policy 1 (cont’d)

iteration 0 V(0) = V(1) = V(2) = 0 V(3)=15
iteration 1
For office 0: R(0) + γ V(1) = 0 + 0.5 x 0 = 0
For office 1: R(1) + γ V(2) = 0 + 0.5 x 0 = 0
For office 2: R(2) + γ V(3) = 0 + 0.5 x 10 = 5
For office 3: R(3) + γ V(3) = 10 + 0.5 x 10 = 15
iteration 2
For office 0: R(0) + γ V(1) = 0 + 0.5 x 0 = 0
For office 1: R(1) + γ V(2) = 0 + 0.5 x 5 = 2.5
For office 2: R(2) + γ V(3) = 0 + 0.5 x 15 = 7.5
For office 3: R(3) + γ V(3) = 10 + 0.5 x 15 = 17.5

SLIDE 20

20

Values of states for policy 1 (cont’d)

iteration 2 V(0) = 0 V(1) = 2.5 V(2) = 7.5 V(3) = 17.5
iteration 3
For office 0: R(0) + γ V(1) = 0 + 0.5 x 2.5 = 1.25
For office 1: R(1) + γ V(2) = 0 + 0.5 x 7.5 = 3.75
For office 2: R(2) + γ V(3) = 0 + 0.5 x 17.5 = 8.75
For office 3: R(3) + γ V(3) = 10 + 0.5 x 17.5 = 18.75
iteration 4
For office 0: R(0) + γ V(1) = 0 + 0.5 x 3.75 = 1.875
For office 1: R(1) + γ V(2) = 0 + 0.5 x 8.75 = 4.375
For office 2: R(2) + γ V(3) = 0 + 0.5 x 18.75 = 9.375
For office 3: R(3) + γ V(3) = 10 + 0.5 x 18.75 = 19.375

SLIDE 21

21

Compare policies

Policy 0 after iteration 4
For office 0: R(0) + γ V(1) = 0 + 0.5 x 2.5 = 1.25
For office 1: R(1) + γ V(2) = 0 + 0.5 x 5 = 2.5
For office 2: R(2) + γ V(3) = 0 + 0.5 x 10 = 5
For office 3: R(3) + γ V(1) = 10 + 0.5 x 1.25 = 10.625
Policy 1 after iteration 4
For office 0: R(0) + γ V(1) = 0 + 0.5 x 3.75 = 1.875
For office 1: R(1) + γ V(2) = 0 + 0.5 x 8.75 = 4.375
For office 2: R(2) + γ V(3) = 0 + 0.5 x 18.75 = 9.375
For office 3: R(3) + γ V(3) = 10 + 0.5 x 18.75 = 19.375
Policy 1 is better because each state has

higher value compared to policy 0

SLIDE 22

22

Temporal credit assignment problem

It is the problem of assigning credit or blame

to the actions in a sequence of actions where feedback is available only at the end of the sequence.

When you lose a game of chess or checkers,

the blame for your loss cannot necessarily be attributed to the last move you made, or even the next-to-the-last move.

Dynamic programming solves the temporal

credit assignment problem by propagating rewards backwards to earlier states and hence to actions earlier in the sequence of actions determined by a policy.

SLIDE 23

23

Computing an optimal policy

Given a method for estimating the value of states with respect to a fixed policy, it is possible to find an optimal policy. We would like to maximize the discounted cumulative reward. Policy iteration [Howard, 1960] is an algorithm that uses the algorithm for computing the value

f a state as a subroutine.

SLIDE 24

24

Policy iteration algorithm

1. Let π0 be an arbitrary policy.
2. Set i to 0.
3. Compute Vπ0 (j) for each j.
4. Compute a new policy πi+1 so that πi+1 (j) is the

action a maximizing R(j) + γ Vπi( f(j,π) ) .

5. If πi+1 = πi , then return πi; otherwise,

set i to i + 1, and go to step 3.

SLIDE 25

25

Policy iteration algorithm (cont’d)

A policy π is said to be the optimal policy if there is no other policy π’ and state j such that Vπ’ (j) > Vπ (j) and for all k ≠ j Vπ’ (j) > Vπ (j) . The policy iteration algorithm is guaranteed to terminate in a finite number of steps with an

ptimal policy.

SLIDE 26

26

Comments on reinforcement learning

A general model where an agent can learn to

function in dynamic environments

The agent can learn while interacting with the

environment

No prior knowledge except the (probabilistic)

transitions is assumed

Can be generalized to stochastic domains (an

action might have several different probabilistic consequences, i.e., the state-transition function is not deterministic)

Can also be generalized to domains where the

reward function is not known

SLIDE 27

27

Famous example: TD-Gammon (Tosauro, 1995)

Learns to play Backgammon
Immediate reward:

+100 if win

100 if lose

0 for all other states

Trained by playing 1.5 million games against

itself (several weeks)

Now approximately equal to best human

player (won World Cup of Backgammon in 1992; among top 3 since 1995)

Predecessor: NeuroGammon [Tesauro and

Sejnowski, 1989] learned from examples of labeled moves (very tedious for human expert)

SLIDE 28

28

Other examples

Robot learning to dock on battery charger
Pole balancing
Elevator dispatching [Crites and Barto, 1995]:

better than industry standard

Inventory management [Van Roy et. Al]:

10-15% improvement over industry standards

Job-shop scheduling for NASA space

missions [Zhang and Dietterich, 1997]

Dynamic channel assignment in cellular

phones [Singh and Bertsekas, 1994]

Robotic soccer

SLIDE 29

29

Common characteristics

delayed reward
opportunity for active exploration
possibility that state only partially observable
possible need to learn multiple tasks with

same sensors/effectors

there may not be an adequate teacher