Generalised weakened fictitious play and random belief learning - - PowerPoint PPT Presentation

generalised weakened fictitious play and random belief
SMART_READER_LITE
LIVE PREVIEW

Generalised weakened fictitious play and random belief learning - - PowerPoint PPT Presentation

Generalised weakened fictitious play and random belief learning David S. Leslie 12 April 2010 Collaborators: Sean Collins, Claudio Mezzetti, Archie Chapman Overview Learning in games Stochastic approximation Generalised weakened


slide-1
SLIDE 1

Generalised weakened fictitious play and random belief learning

David S. Leslie 12 April 2010 Collaborators: Sean Collins, Claudio Mezzetti, Archie Chapman

slide-2
SLIDE 2

Overview

  • Learning in games
  • Stochastic approximation
  • Generalised weakened fictitious play

– Random belief learning – Oblivious learners

slide-3
SLIDE 3

Normal form games

  • Players i = 1, . . . , N
  • Action sets Ai
  • Reward functions ri : A1 × · · · × AN → R
slide-4
SLIDE 4

Mixed strategies

  • Mixed strategies πi ∈ ∆i
  • Joint mixed strategy π = (π1, . . . , πN)
  • Reward function extended so that ri(π) = Eπ[ri(a)]
slide-5
SLIDE 5

Best responses

Assume other players use mixed strategy π−i. Player i should choose a mixed strategy in the best response set

bi(π−i) = argmax

˜ πi∈∆i ri(˜

πi, π−i)

slide-6
SLIDE 6

Best responses

Assume other players use mixed strategy π−i. Player i should choose a mixed strategy in the best response set

bi(π−i) = argmax

˜ πi∈∆i ri(˜

πi, π−i)

A Nash equilibrium is a fixed point of the best response map:

πi ∈ bi(π−i)

for all i

slide-7
SLIDE 7

A problem with Nash

Consider the game (2, 0) (0, 1)

(0, 2) (1, 0)

  • with unique Nash equilibrium

π1 = (2/3, 1/3), π2 = (1/3, 2/3)

slide-8
SLIDE 8

A problem with Nash

Consider the game (2, 0) (0, 1)

(0, 2) (1, 0)

  • with unique Nash equilibrium

π1 = (2/3, 1/3), π2 = (1/3, 2/3)

  • ri(ai, π−i) = 2/3 for each i, ai
  • How does Player 1 know to use π1 = (2/3, 1/3)?
  • Player 2 to use π2 = (1/3, 2/3)?
slide-9
SLIDE 9

Learning in games

  • Attempts to justify equilibrium play as the end point of a

learning process

  • Generally assumes pretty stupid players!
  • Related to evolutionary game theory
slide-10
SLIDE 10

Multi-armed bandits

At time n, choose action an, and receive reward Rn

slide-11
SLIDE 11

Multi-armed bandits

Estimate after time n of the expected reward for action a ∈ A is:

Qn(a) =

  • m≤n : am=a

Rm κn(a)

where κn(a) = n

m=1 I{am = a}

slide-12
SLIDE 12

Multi-armed bandits

If an = a, κn(a) = κn−1(a) and:

Qn(a) =

n−1

m=1 I{am = a}Rm

  • + 0

κn−1(a) = Qn−1(a)

slide-13
SLIDE 13

Multi-armed bandits

if an = a,

Qn(a) =

n−1

m=1 I{am = a}Rm

  • + Rn

κn(a) =

  • 1 −

1 κn(a)

  • Qn−1(a) +

1 κn(a)Rn

slide-14
SLIDE 14

Multi-armed bandits

Update estimates using

Qn(a) =

    

Qn−1(a) +

1 κn(a) {Rn − Qn−1(a)}

if an = a

Qn−1(a)

if an = a At time n + 1 use Qn to choose an action an+1

slide-15
SLIDE 15

Fictitious play

At iteration n + 1, player i:

  • forms beliefs σ−i

n

∈ ∆−i about the other players’ strategies

  • chooses an action in bi(σ−i

n )

slide-16
SLIDE 16

Belief formation

The beliefs about player j are simply the MLE:

σj

n(aj) = κj n(aj)

n

where κj

n(aj) = n m=1 I{aj m = aj}

slide-17
SLIDE 17

Belief formation

The beliefs about player j are simply the MLE:

σj

n(aj) = κj n(aj)

n

where κj

n(aj) = n m=1 I{aj m = aj}

Recursive update:

σj

n+1(aj) = κj

n+1(aj)

n+1

= κj

n(aj)+I{aj n+1=aj}

n+1

slide-18
SLIDE 18

Belief formation

The beliefs about player j are simply the MLE:

σj

n(aj) = κj n(aj)

n

where κj

n(aj) = n m=1 I{aj m = aj}

Recursive update:

σj

n+1(aj) = κj

n+1(aj)

n+1

= κj

n(aj)+I{aj n+1=aj}

n+1

=

n n+1 κj

n(aj)

n

+I{aj

n+1=aj}

n+1

slide-19
SLIDE 19

Belief formation

The beliefs about player j are simply the MLE:

σj

n(aj) = κj n(aj)

n

where κj

n(aj) = n m=1 I{aj m = aj}

Recursive update:

σj

n+1(aj) =

  • 1 −

1 n+1

  • σj

n(aj) + 1 n+1I{aj n+1 = aj}

slide-20
SLIDE 20

Belief formation

The beliefs about player j are simply the MLE:

σj

n(aj) = κj n(aj)

n

where κj

n(aj) = n m=1 I{aj m = aj}

Recursive update:

σj

n+1

=

  • 1 −

1 n+1

  • σj

n

+

1 n+1eaj

n+1

slide-21
SLIDE 21

Belief formation

The beliefs about player j are simply the MLE:

σj

n(aj) = κj n(aj)

n

where κj

n(aj) = n m=1 I{aj m = aj}

Recursive update:

σj

n+1

=

  • 1 −

1 n+1

  • σj

n

+

1 n+1eaj

n+1

In terms of best responses:

σj

n+1

  • 1 −

1 n+1

  • σj

n

+

1 n+1bj(σ−j n )

slide-22
SLIDE 22

Belief formation

The beliefs about player j are simply the MLE:

σj

n(aj) = κj n(aj)

n

where κj

n(aj) = n m=1 I{aj m = aj}

Recursive update:

σj

n+1

=

  • 1 −

1 n+1

  • σj

n

+

1 n+1eaj

n+1

In terms of best responses:

σn+1 ∈

  • 1 −

1 n+1

  • σn

+

1 n+1b (σn )

slide-23
SLIDE 23

Stochastic approximation

slide-24
SLIDE 24

Stochastic approximation

θn+1 ∈ θn + αn+1 {F(θn) + Mn+1}

slide-25
SLIDE 25

Stochastic approximation

θn+1 ∈ θn + αn+1 {F(θn) + Mn+1}

  • F : Θ → Θ is a (bounded u.s.c.) set-valued map
  • αn → 0,

n αn = ∞

  • For any T > 0,

lim

n→∞

sup

k>n : k−1

i=n αi+1≤T

  • k−1
  • i=n

αi+1Mi+1

  • = 0

The last is implied by:

n(αn)2 < ∞, E[Mn+1 | θn] → 0, and

Var[Mn+1 | θn] < C almost surely.

slide-26
SLIDE 26

Stochastic approximation

θn+1 ∈ θn + αn+1 {F(θn) + Mn+1} θn+1 − θn αn ∈ F(θn) + Mn+1 ↓

d dtθ ∈ F(θ),

a differential inclusion (Bena ¨ ım, Hofbauer and Sorin, 2005)

slide-27
SLIDE 27

Stochastic approximation

θn+1 ∈ θn + αn+1 {F(θn) + Mn+1}

In fictitious play:

σn+1 ∈ σn +

1 n+1 {b(σn) − σn}

d dtσ ∈ b(σ) − σ,

the best response differential inclusion. Hence σn converges to the set of Nash equilibria in zero-sum games, potential games, and generic 2 × m games.

slide-28
SLIDE 28

Generalised weakened fictitious play

slide-29
SLIDE 29

Weakened fictitious play

  • Van der Genugten (2000) showed that the convergence rate
  • f fictitious play can be improved if players use ǫn-best re-
  • sponses. (For 2-player zero-sum games, and a very specific

choice of ǫn)

  • π ∈ bǫn(σn)

⇒ π ∈ b(σn) + Mn+1

where Mn → 0 as ǫn → 0 (by continuity properties of b and boundedness of r)

  • For general games and general ǫn → 0 this fits into the

stochastic approximation framework

slide-30
SLIDE 30

Generalised weakened fictitious play

Theorem: Any process such that

σn+1 ∈ σn + αn+1 {bǫn(σn) − σn + Mn+1}

where

  • ǫn → 0 as n → ∞
  • αn → 0 as n → ∞
  • lim

n→∞

sup

k>n : k−1

i=n αi+1≤T

  • k−1
  • i=n

αi+1Mi+1

  • = 0

converges to the set of Nash equilibria for zero-sum games, po- tential games and generic 2 × m games.

slide-31
SLIDE 31

Recency

  • For classical fictitious play αn = 1

n, ǫn ≡ 0 and Mn ≡ 0

  • For any αn → 0 the conditions are met (since Mn ≡ 0)
  • How about αn =

1 √n, or even αn = 1 log n?

slide-32
SLIDE 32

Recency

Belief that Player 1 plays Heads over 200 plays of the two-player matching pennies game under clas- sical fictitious play (top), under a modified ficti- tious play with αn =

1 √n

(middle), and with αn =

1 log n (bottom)

slide-33
SLIDE 33

Stochastic fictitious play

In fictitious play, players always choose pure actions

⇒ strategies never converge to mixed strategies

(beliefs do, but played strategies do not)

slide-34
SLIDE 34

Stochastic fictitious play

Instead consider smooth best responses:

βi

τ(σ−i) = argmax πi∈∆i

  • ri(πi, σ−i) + τv(πi)
  • For example βi

τ(σ−i)(ai) = exp{ri(ai,σ−i)/τ}

  • a∈Ai exp{ri(a,σ−i)/τ}
slide-35
SLIDE 35

Stochastic fictitious play

Instead consider smooth best responses:

βi

τ(σ−i) = argmax πi∈∆i

  • ri(πi, σ−i) + τv(πi)
  • For example βi

τ(σ−i)(ai) = exp{ri(ai,σ−i)/τ}

  • a∈Ai exp{ri(a,σ−i)/τ}

Strategies evolve according to

σn+1 = σn+ 1

n+1 {βτ(σn) + Mn+1 − σn}

where E[Mn+1 | σn] = 0

slide-36
SLIDE 36

Convergence

σn+1 = σn +

1 n+1 {βτ(σn) − σn + Mn+1}

slide-37
SLIDE 37

Convergence

σn+1 = σn +

1 n+1 {βτ(σn) − σn + Mn+1}

∈ σn +

1 n+1 { bǫ(σn) − σn + Mn+1}

slide-38
SLIDE 38

Convergence

σn+1 = σn +

1 n+1 {βτ(σn) − σn + Mn+1}

∈ σn +

1 n+1 { bǫ(σn) − σn + Mn+1}

But can now consider the effect of using smooth best response

βτn with τn → 0. . .

. . . it means that ǫn → 0, resulting in a GWFP!

slide-39
SLIDE 39

Random belief learning

slide-40
SLIDE 40

Random beliefs

(Friedman and Mezzetti 2005) Best response ‘assumes’ complete confidence in:

  • knowledge of the reward functions
  • beliefs σ about opponent strategy
slide-41
SLIDE 41

Random beliefs

(Friedman and Mezzetti 2005) Best response ‘assumes’ complete confidence in:

  • knowledge of the reward functions
  • beliefs σ about opponent strategy
slide-42
SLIDE 42

Random beliefs

(Friedman and Mezzetti 2005) Best response ‘assumes’ complete confidence in:

  • knowledge of the reward functions
  • beliefs σ about opponent strategy

Uncertainty in the beliefs σn ←

→ distribution on belief space

slide-43
SLIDE 43

Belief distributions

  • The belief about player j is that πj ∼ µj
  • Eµj[πj] = σj, the focus of µj.
slide-44
SLIDE 44

Belief distributions

  • The belief about player j is that πj ∼ µj
  • Eµj[πj] = σj, the focus of µj.

Response to random beliefs: sample π−i ∼ µ−i and play ai ∈ bi(π−i) Let ˜

bi(µ−i) be the resulting mixed strategy

slide-45
SLIDE 45

Random belief equilibrium

A random belief equilibrium is a set of belief distributions µi such that the focus of µi is equal to the mixed strategy played by i:

Eµi[πi] = ˜ bi(µ−i)

A refinement of Nash equilibria when µi depends on ǫ and Varµj

ǫ(πj) → 0 as ǫ → 0.

slide-46
SLIDE 46

Inference

  • In fictitious play, σj

n is the MLE of πj

slide-47
SLIDE 47

Inference

  • In fictitious play, σj

n is the MLE of πj

  • Fudenberg and Levine (1998): if the prior is Dirichlet(α),

then the posterior is Dirichlet(α + κ)

Fictitious play is doing Bayesian learning, with best replies taken with respect to the expected opponent strategy

slide-48
SLIDE 48

Random belief learning

  • Start with priors µj
slide-49
SLIDE 49

Random belief learning

  • Start with priors µj
  • After observing actions for n steps, have posteriors µj

n

slide-50
SLIDE 50

Random belief learning

  • Start with priors µj
  • After observing actions for n steps, have posteriors µj

n

  • Select actions using response to random beliefs (i.e. mixed

strategy ˜

bi(µ−i

n ))

slide-51
SLIDE 51

Convergence

Can show:

  • ˜

bi(µ−i

n ) ∈ bǫn(σ−i n )

  • So the beliefs follow a GWFP process

Unfortunately it is the beliefs, not the strategies.

slide-52
SLIDE 52

Learning the game

Best response ‘assumes’ complete confidence in:

  • knowledge of the reward functions
  • beliefs σ about opponent strategy
slide-53
SLIDE 53

Learning the game

Best response ‘assumes’ complete confidence in:

  • knowledge of the reward functions
  • beliefs σ about opponent strategy

Learn reward matrices using reinforcement learning ideas:

  • at iteration n, observe joint action an and reward

Ri(an) = ri(an) + ǫn

  • update estimates σ−i of opponent strategies
  • update estimate Qi(an) of ri(an)
slide-54
SLIDE 54

Convergence

Assume all joint actions a are played infinitely often. Can show:

  • Qi

n(a) → ri(a) for all a

  • Best responses with to σ−i

n

with respect to Qi

n are ǫn-best

responses with respect to ri

  • So the beliefs follow a GWFP process

Potentially very useful in DCOP games (Chapman, Rogers, Jen- nings and Leslie 2008)

slide-55
SLIDE 55

Oblivious learners

slide-56
SLIDE 56

Oblivious learners

What if players are oblivious to opponents? Each individual treats the problem a multi-armed bandit Can we expect equilibrium play?

slide-57
SLIDE 57

Best response/inertia

Suppose individuals (somehow by magic) actually know Qi(ai) =

ri(ai, π−i

n )

They can adjust their own strategy towards a best response:

πi

n+1 = (1 − αn+1)πi n + αn+1bi(π−i)

Strategies converge, not just beliefs But it’s just not possible

slide-58
SLIDE 58

If π−i were fixed. . .

  • Player i actually faces a multi-armed bandit
  • So can learn Qi(ai) by playing all actions infinitely often
  • Then adjust πi
slide-59
SLIDE 59

Actor–critic learning

Qi

n+1(ai n+1) = Qi n(ai n+1) + λn+1

  • Rn+1 − Qn(ai

n+1)

  • πi

n+1 = πi n + αn

  • bi(Qi

n) − πi n

slide-60
SLIDE 60

Actor–critic learning

Qi

n+1(ai n+1) = Qi n(ai n+1) + λn+1

  • Rn+1 − Qn(ai

n+1)

  • πi

n+1 = πi n + αn

  • bi(Qi

n) − πi n

  • With all players adjusting simultaneously, need to be careful

If αn

λn → 0, the system can be analysed as if all players have

accurate Q values.

slide-61
SLIDE 61

Convergence

  • Can show that |Qi

n(ai) − ri(ai, π−i n )| → 0

  • So best responses with respect to the Qi’s are ǫ-best re-

sponses to π−i

n

  • So the πn follow a GWFP process

We have a process under which played strategy converges to Nash equilibrium

slide-62
SLIDE 62

Conclusions

  • Generalised weakened fictitious play is a class that is closely

related to the best response dynamics

  • All GWFP processes converge to Nash equilibrium in zero-

sum games, potential games, and generic 2 × m games

  • GWFP encompasses numerous models of learning in games:

– Fictitious play with greater weight on recent observations – Stochastic fictitious play with vanishing smoothing – Random belief learning – Fictitious play while learning the reward matrices – An oblivious actor–critic process