[PPT] - Multi-agent learning Repeated games Gerard Vreeswijk , Intelligent PowerPoint Presentation

SLIDE 1

Multi-agent learning Repeated games

Multi-agent learning

Repeated games

Gerard Vreeswijk, Intelligent Systems Group, Computer Science Department, Faculty of Sciences, Utrecht University, The Netherlands. Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 1

SLIDE 2

Multi-agent learning Repeated games

Repeated games: motivation

1. Much interaction in multi-agent systems can be modelled through games.
2. Much learning in multi-agent systems can therefore be modelled through

learning in games.

3. Learning in games usually takes place through the (gradual) adaption of

strategies (hence, behaviour) in a repeated game.

4. In most repeated games, one game (a.k.a. stage game) is played repeatedly.

Possibilities:

A finite number of times.
An indefinite (same: indeterminate) number of times.
An infinite number of times.
5. Therefore, familiarity with the basic concepts and results from the theory
f repeated games is essential to understand multi-agent learning.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 2

SLIDE 3

Multi-agent learning Repeated games

Plan for today

NE in normal form games that are repeated a finite number of times.

– Principle of backward induction.

NE in normal form games that are repeated an indefinite number of times.

– Discount factor. Models the probability of continuation. – Folk theorem. (Actually many FT’s.) Repeated games generally do have infinitely many Nash equilibria. – Trigger strategy, on-path vs. off-path play, the threat to “minmax” an

pponent.

This presentation draws heavily on (Peters, 2008).

* H. Peters (2008): Game Theory: A Multi-Leveled Approach. Springer, ISBN: 978-3-540-69290-4.

Ch. 8:

Repeated games.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 3

SLIDE 4

Multi-agent learning Repeated games

Example 1: Nash equilibria in playing the PD twice

Other: Prisoners’ Dilemma Cooperate Defect You: Cooperate

(3, 3) (0, 5)

Defect

(5, 0) (1, 1)

Even if mixed strategies are allowed, the PD possesses one Nash

equilibrium, viz. (D, D) with payoffs (1, 1).

This equilibrium is Pareto sub-optimal. (Because (3, 3) makes both

players better off.)

Does the situation change if two parties get to play the Prisoners’

Dilemma two times in succession?

The following diagram (hopefully) shows that playing the PD two times

in succession does not yield an essentially new NE.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 4

SLIDE 5

Multi-agent learning Repeated games

Example 1: Nash equilibria in playing the PD twice ( 2 )

(0, 0)

CC

(3, 3)

CC

(6, 6)

CD

(3, 8)

DC

(8, 3)

DD

(4, 4)

CD

(0, 5)

CC

(3, 8)

CD

(0, 10)

DC

(5, 5)

DD

(1, 6)

DC

(5, 0)

CC

(8, 3)

CD

(5, 5)

DC

(10, 0)

DD

(6, 1)

DD

(1, 1)

CC

(4, 4)

CD

(1, 6)

DC

(6, 1)

DD

(2, 2)

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 5

SLIDE 6

Multi-agent learning Repeated games

Example 1: Nash equilibria in playing the PD twice ( 3 )

In normal form: Other: CC CD DC DD You: CC

(6, 6) (3, 8) (3, 8) (0, 10)

CD

(8, 3) (4, 4) (5, 5) (1, 6)

DC

(8, 3) (5, 5) (4, 4) (1, 6)

DD

(10, 0) (6, 1) (6, 1) (2, 2)

The action profile (DD, DD) is the only Nash equilibrium.
With 3 successive games, we obtain a 23 × 23 matrix, where the action

profile (DDD, DDD) still would be the only Nash equilibrium.

Generalise to N repetitions: (DN, DN) still is the only Nash equilibrium

in a repeated game where the PD is played N times in succession.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 6

SLIDE 7

Multi-agent learning Repeated games

Backward induction (version for repeated games)

Suppose G is a game in normal form for p players, where all players

possess the same arsenal of possible actions A = {a1, . . . , am}.

The game Gn arises by playing the stage game G a number of n times in

succession.

A history h of length k is an element of (Ap)k, e.g., for p = 3 and k = 10,

a7 a5 a3 a6 a1 a9 a2 a7 a7 a3 a6 a9 a2 a4 a2 a9 a9 a1 a1 a4 a1 a2 a7 a9 a6 a1 a1 a8 a2 a4 is a history of length ten in a game with three players. The set of all possible histories is denoted by H. (Hence, |Hk| = mkp.)

A (possibly mixed) strategy for one player is a function H → Pr(A).

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 7

SLIDE 8

Multi-agent learning Repeated games

Backward induction (version for repeated games)

For some repeated games of length n, the dominating (read: “clearly

best”) strategy for all players in round n (the last round) does not depend

n the history of play. E.g., for the Prisoners’ Dilemma in last round:

“No matter what happened in rounds 1 . . . n − 1, I am better off playing D.”

Fixed strategies (D, D) in round n determine play after round n − 1.
Independence on history, plus a determined future, leads to the following

justification for playing D in round n − 1: “No matter what happened in rounds 1 . . . n − 2 (the past), and given that I will receive a payoff of 1 in round n (the future), I am better off playing D now.”

Per induction in round k, where k ≥ 1:

“No matter what happened in rounds 1 . . . k, and given that I will receive a payoff

f (n − k) · 1 in rounds (k + 1) . . . n, I am better off playing D in round k.”

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 8

SLIDE 9

Multi-agent learning Repeated games

Indefinite number of repetitions

A Pareto-suboptimal outcome can be avoided in case the following three

conditions are met.

1. The Prisoners’ Dilemma is repeated an indefinite number of times

(rounds).

2. A so-called discount factor δ ∈ [0, 1] determines the probability of

continuing the game after each round.

3. The probability to continue, δ, must be large enough.
Under these conditions suddenly infinitely many Nash equilibria exist.

This is sometimes called an embarrassment of richness (Peters, 2008).

Various Folk theorems state the existence of multiple equilibria in infinitely

repeated games.a

We now informally discuss one version of “the” Folk Theorem.

aFolk Theorems are named such, because their exact origin cannot be traced.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 9

SLIDE 10

Multi-agent learning Repeated games

Example 2: Prisoners’ Dilemma repeated indefinitely

Consider the game G∗(δ) where the PD is played a number of times in
succession. We write G∗(δ) : G0, G1, G2, . . . .
The number of times the stage game is played is determined by a

parameter 0 ≤ δ ≤ 1. The probability that the next stage (and the stages thereafter) will be played is δ. Thus, the probability that stage game Gt will be played is δt. (What if t = 0?)

The PD (of which every Gt is an incarnation) is called the stage game, as
pposed to the overall game G∗(δ).
A history h of length t of a repeated game is a sequence of action profiles
f length t.
A realisation h is a countably infinite sequence of action profiles.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 10

SLIDE 11

Multi-agent learning Repeated games

Example 2: Prisoners’ Dilemma repeated indefinitely ( 2 )

Example of a history of length t = 10:

Row player: C D D D C C D D D D Column player: C D D D D D D C D D 1 2 3 4 5 6 7 8 9

The set of all possible histories (of any length) is denoted by H.
A (mixed) strategy for Player i is a function si : H → Pr({C, D}) such that

Pr( Player i plays C in round |h| + 1 | h ) = si(h) (C).

A strategy profile s is a combination of strategies, one for each player.
The expected payoff for player i given s can be computed. It is

Expected payoffi(s) =

∞

∑

t=0

δt Expected payoffi,t(s).

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 11

SLIDE 12

Multi-agent learning Repeated games

Example: The expected payoff of a stage game

Prisoners’ Dilemma Other: Cooperate Defect You: Cooperate

(3, 3) (0, 5)

Defect

(5, 0) (1, 1)

Suppose following strategy profile for one game:

– Row player (you) plays with mixed strategy 0.8 on C (hence, 0.2 on D). – Column player (other) plays with mixed strategy 0.7 on C.

Your expected payoff is 0.8(0.7· 3 + 0.3 · 0) + 0.2(0.7· 5 + 0.3 · 1) = 2.44
General formula (cf., e.g., Leyton-Brown et al., 2008):

Expected payoffi,t(s) =

∑

(i1,...,in)∈An

Πn

k=1sk,ik· payoffi(si1, . . . , sin)

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 12

SLIDE 13

Multi-agent learning Repeated games

Expected payoffs for P1 and P2 in stage PD with mixed strategies

Player 1 may

nly

move “back – front”; Player 2 may

nly

move “left – right”.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 13

SLIDE 14

Multi-agent learning Repeated games

Subgame perfect equilibria of G∗(δ): D∗

Recall: a subgame perfect Nash equilibrium of an extensive (in this case: repeated) game is a Nash equilibrium of this extensive game, of which the restriction to every subgame (read: tailgame) also is a Nash equilibrium to that subgame. Consider the strategy of iterated defection D∗: “always defect, no matter what”.a

Claim. The strategy profile (D∗, D∗) is a subgame perfect equilibrium in G∗(δ).
Proof. Consider any tailgame starting at round t. We are done if we can show

that (D∗, D∗) is a NE for this subgame. This is true: given that one player always defects, it never pays off for the other player to play C at any time. Hence, everyone plays D∗.

aA notation like D∗ or (worse) D∞ is suggestive. Mathematically it makes no sense, but intu-

itively it does. Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 14

SLIDE 15

Multi-agent learning Repeated games

Part II: Trigger strategies

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 15

SLIDE 16

Multi-agent learning Repeated games

Example 3: Cost of deviating in Round 4 of the repeated PD

Consider the so-called trigger strategy T: “always play C unless D has been played at least once. In that case play D forever”.

Claim. The strategy profile (T, T) is a subgame perfect equilibrium in G∗(δ),

provided the probability of continuation, δ, is sufficiently large.

Proof. Consider a typical play:

Row player: C C C C C D D D D D . . . Column player: C C C C D D D D D D . . . 1 2 3 4 5 6 7 8 9 . . . Column player defects after Round 4. By doing so he expects a payoff of

4

∑

t=0

δt· 3 + δ5· 5 +

∞

∑

t=6

δt· 1

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 16

SLIDE 17

Multi-agent learning Repeated games

Example 3: Cost of deviating in Round 4 ( 2 )

By cooperating throughout, the column player could have expected

∞

∑

t=0

δt· 3 which means he forfeited

∞

∑

t=0

δt· 3 −

4

∑

t=0

δt· 3 + δ5· 5 +

∞

∑

t=6

δt· 1

= −2δ5 + 2

∞

∑

t=6

δt by deviating from T. If δ = 0,

−2δ5 + 2

∞

∑

t=6

δt > 0

⇔

1 −

∞

∑

t=1

δt > 0

⇔

1 −

∞

∑

t=0

δt + 1 > 0

⇔

1 − 1 1 − δ + 1 > 0

⇔

δ > 1 2 Thus, if δ > 1/2, the column player forfeits payoff by deviating from T.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 17

SLIDE 18

Multi-agent learning Repeated games

Analysis of trigger strat. generalised to deviation in round N

Player starts to defect at Round N. By doing so he expects a payoff of

N−1

∑

t=0

δt· 3 + δN· 5 +

∞

∑

t=N+1

δt· 1 By playing C throughout, the column player could have expected

∞

∑

t=0

δt· 3 which means he forfeited δN· (3 − 5) +

∞

∑

t=N+1

δt· (3 − 1) = −2δN + 2

∞

∑

t=N+1

δt thanks to deviating in round N (and further).

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 18

SLIDE 19

Multi-agent learning Repeated games

Analysis of trigger strat. generalised to deviation in round N

If δ = 0, and

−2δN + 2

∞

∑

t=N+1

δt > 0 then there is forfeit from payoff. This is when

− 2δN + 2

∞

∑

t=N+1

δt > 0

⇔

1 −

∞

∑

t=N+1

δt−N > 0

⇔

1 −

∞

∑

t=0

δ(t−N)+(N+1) > 0

⇔

1 − δ

∞

∑

t=0

δt > 0

⇔

1 − δ 1 1 − δ > 0

⇔

δ > 1 2 Thus, if δ > 1/2 every player forfeits payoff by deviating from T.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 19

SLIDE 20

Multi-agent learning Repeated games

Example 4: An alternating trigger strategy for the repeated PD

Yet another subgame perfect equilibrium: Informal definition of strategies. A and B tacitly agree to alternate actions, i.e. (C, D), (D, C), (C, D), . . . . If one of them deviates, the other party plays D

forever. (Consequently, the party who originally deviated plays D forever

thereafter as well.) Notice the CKR aspect! Let A be the strategy that plays C in Round 1. Let B be the other strategy.

Claim. The strategy profile (A, B) is a subgame perfect equilibrium in G∗(δ),

provided the probability of continuation, δ, is sufficiently large. An analysis of this situation and a proof of this claim can be found in (Peters, 2008), pp. 104-105.*

*H. Peters (2008): Game Theory: A Multi-Leveled Approach. Springer, ISBN: 978-3-540-69290-4.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 20

SLIDE 21

Multi-agent learning Repeated games

Generalisation of trigger strategies

The idea of trigger strategies can be generalised.

Both parties A, B reside in a strategy pair that consists of patterns of

repeated action profiles of the stage game PD.

Every convex combinationa of payoffs

α1(3, 3) + α2(0, 5) + α3(5, 0) + α4(1, 1) can be established by smartly picking appropriate strategy patterns. E.g.: “We play 4 times (C, C). Then we play 7 times (C, D), (D, C), . . . ”.

As long as these limiting average payoffs exceed payoff({D, D}) for each

player (which is 1), associated trigger strategies can be formulated that lead to these payoffs and trigger eternal play of (D, D) after a deviation.

For δ high enough, such strategies again form a SGP Nash equilibrium.

aMeaning αi ≥ 0 and α1 + α2 + α3 + α4 = 1.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 21

SLIDE 22

Multi-agent learning Repeated games

Folk theorem for SGP Nash equilibria in the repeated PD

1. Feasible payoffs (striped): payoff

combos that can be obtained by jointly repeating patterns of actions (more accurately: patterns

f action profiles).
2. Enforceable payoffs (shaded):

everyone resides above minmax. For every payoff pair (x, y) in (1) ∩ (2), there is a δ(x, y) ∈ (0, 1), such that for all δ ≥ δ(x, y) the payoff

(x, y) can be obtained as the lim-

iting average in a subgame perfect equilibrium of G∗(δ).

1 2 3 4 5 1 2 3 4 5

(3, 3)

Gerard Vreeswijk.

Last modified on February 9th, 2012 at 17:15 Slide 22

SLIDE 23

Multi-agent learning Repeated games

Family of Folk Theorems

There actually exist many Folk Theorems.

Horizon. May the game be repeated infinitely (as in our case) or is there

an upper bound to the number of plays?

Information. Do players act on the basis of CKR (present case), or are

certain parts of the history hidden?

Reward. Do players collect their payoff through a discount factor (present

case) or through average rewards?

Equilibrium. Do we consider Nash equilibria, or other forms of

equilibria, such as so-called ǫ-Nash equilibria or so-called correlated equilibria?

Subgame perfectness. Do we consider subgame perfect equilibria

(present case) or just Nash equilibria?

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 23

SLIDE 24

Multi-agent learning Repeated games

General theorems

For the prisoners’ dilemma game PD we have established that each player playing always D is a subgame perfect equilibrium of the repeated game based on PD. This is the one-stage deviation principle in action. The following result follows from exactly the same logic.

Theorem. Let G be an arbitrary (not necessarily finite) n-person game, and let the

strategy combination s be a Nash equilibrium of the stage game G. Let δ ∈ (0, 1). Then each player i playing si at every moment t is a subgame perfect equilibrium in G∗(δ). Theorem (Folk theorem for subgame perfect equilibrium). Let (p, q) be a Nash equilibrium of the stage game G, and let (x, y) ∈ G such that x > pAq and y > pBq. Then there is a δ(x, y) ∈ (0, 1) such that for every δ ≥ δ(x, y) there is a subgame perfect equilibrium in the repeated game G∗(δ).

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 24

SLIDE 25

Multi-agent learning Repeated games

Existence of non-SGP Nash equilibria in repeated games

We have seen that many subgame perfect equilibria exist for repeated games.

(At least for repeated games were both players have all information, horizon is infinite, and more assumptions).

What about the existence of non-SGP Nash equilibria in repeated games, i.e.,

equilibria that are not necessarily subgame perfect?

Without the requirement of subgame perfection, deviations can be

punished more severely: the equilibrium does not have to induce a Nash equilibrium in the punishment subgame.

We will now consider the consequences of relaxing the subgame

perfection requirement for a Nash equilibrium in an infinitely repeated game.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 25

SLIDE 26

Multi-agent learning Repeated games

Example 5: A repeated game with a non-SGP NE

Other: Some game Left ( L ) Right ( R ) You: Up ( U )

(1, 1) (0, 0)

Down ( D )

(0, 0) (−1, 4)

1. For you, U is a dominating strategy.
2. The pure profile (U, L) is the only mixed strategy profile that is a Nash
equilibrium. (Hence, (U, L), (U, L), . . . is a SGP-NE in the repeated game.)
3. Define trigger-strategies (T1, T2) such that the pattern [(D, R), (U, L)3]∗ is

played indefinitely. (So we have periods of length 4.) If this pattern is violated:

Fallback strategy of the row player (you) is mixed (0.8, 0.2)∗.
Fallback strategy of the column player is pure R∗.

This combination of fallback strategies is not a Nash equilibrium. (Cf. 2.)

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 26

SLIDE 27

Multi-agent learning Repeated games

Example 5: A repeated game with a non-SGP NE ( 2 )

Other: Some game Left ( L ) Right ( R ) You: Up ( U )

(1, 1) (0, 0)

Down ( D )

(0, 0) (−1, 4)

Claim. The combination of trigger strategies (T1, T2) is a Nash-equilibrium for

some δ ∈ (0, 1).

T1 ⇒ T2. If you play (the non-degenerated part of) T1, then the column

player cannot do much different than T2, for T2 is a best response to T1.

T2 ⇒ T1. If at all, the best moment for you to deviate is at D, for that

would give you a incidental advantage of 1. After that your opponent falls back to R∗. Total payoff for you: 0 (for cheating) + 0 + · · · + 0 (for being punished by your opponent).

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 27

SLIDE 28

Multi-agent learning Repeated games

Example 5: A repeated game with a non-SGP NE ( 3 )

Other: Some game Left ( L ) Right ( R ) You: Up ( U )

(1, 1) (0, 0)

Down ( D )

(0, 0) (−1, 4)

T2 ⇒ T1 (continued). Total payoff for row player: 0 (for cheating) +

0 + · · · + 0 (for being punished by the column player.) Payoff for row player if he was loyal:

(−1 + 1· δ + 1· δ2 + 1· δ3) + (−1· δ4 + 1· δ5 + 1· δ6 + 1· δ7) + . . . =

∞

∑

k=0

δk − 2

∞

∑

k=0

δ4k = 1 1 − δ − 2 1 1 − δ4 This expression is positive only if δ >≈ 0.54. (Solve 3rd-degree equation.)

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 28

SLIDE 29

Multi-agent learning Repeated games

Retaliation in repeated games: playing the minmax value

In the previous game, your “Plan B” was to play a mixed strategy (0.8, 0.2). Questions:

Why may a mixed

strategy (0.8, 0.2) be considered a punishment?

Is mixed strategy (0.8, 0.2)

the most severe punishment?

If so, why?

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 29

SLIDE 30

Multi-agent learning Repeated games

Retaliation in repeated games: playing the minmax value

Other: 1 2 You: 0

(4, 9) (7, 9) (5, 8)

1

(6, 7) (8, 7) (4, 8)

2

(7, 9) (5, 7) (6, 7)

First for pure strategies.

Which action of you (the row

player) minimises the maximum payoff of your opponent?

This is the pure minmax: minimise

maximum payoff, which can be found by “scanning blue rows” (= payoffs of opponent). It turns out that Action 1 keeps the payoff of your opponent below 9.

Similarly, if your opponent

wishes to punish you, he scans “green columns” to minimise your payoff ⇒ Action 2.

Alert 1: minmax may be

= maxmin (= security level

strategy).

Alert 2: mixed minmax may be <

pure minmax.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 30

SLIDE 31

Multi-agent learning Repeated games

Minmax: payoff surface of the opponent (mixed strategies)

his payoff your mix his mix push payoff range to the minimum

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 31

SLIDE 32

Multi-agent learning Repeated games

The minmax value as a threat

Your (possibly mixed) minmax strategy can be used as a threat to

withhold you opponent from deviating of the silently agreed path.

Cf. the threat:

If you’re not complying to our normal pattern of actions, I am going to minmax you.

By actually executing this threat you might harm yourself as well

⇒ non-SGP.

For finite two-person strictly competitive games (such as zero-sum

games), minmax = maxmin. (Finite in the sense that the arsenal of playable actions, hence the payoff matrix, is finite.)

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 32

SLIDE 33

Multi-agent learning Repeated games

Example 5: A repeated game with a non-SGP NE ( 4 )

Other: L R You: U

(1, 1) (0, 0)

D

(0, 0) (−1, 4)

Your opponent can punish you

maximally by playing R∗. How you can punish your opponent is less obvious.

If you play D∗ your opponent

will earn 4∗. If you play U∗, your

pponent will earn 1∗.
It is possible to punish your
pponent even more by becoming

unpredictable (within CKR!!). Given your mixed strategy (u, d) your opponent maximises his payoff by choosing the right mix

(l, r):

max

l,r ul· 1 + dr· 4

= max

l

ul + 4(1 − u)(1 − l)

= max

l

(5u − 4)l + 4 − 4u

If 5u − 4 = 0, it does not matter

what you opponent chooses for l—his expected payoff always equals 4 − 4(4/5) = 4/5.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 33

SLIDE 34

Multi-agent learning Repeated games

Example 5: A repeated game with a non-SGP NE ( 5 )

Draw a picture of the payoff surface

f the opponent.
If 5u − 4 = 0, it does not matter

what you opponent chooses for l. He expects 4 − 4(4/5) = 4/5.

If 5u − 4 > 0 your opponent will

play 1, and expects to earn > 4 − 4u which is > 4/5.

If 5u − 4 < 0 your opponent will

play 0, and expects to earn > 4 − 4u which, again, is > 4/5. These calculations are done by hand, and do not easily generalise to higher dimensions. u 1 4/5 l 1

4 1

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 34

SLIDE 35

Multi-agent learning Repeated games

Literature

Literature on game theory is vast. For me, the following sources were helpful and (no less important) offered different perspectives to repeated games.

* C.M. Gintis (2009): Game Theory Evolving: A Problem-Centered Introduction to Modeling Strategic Interaction. Second Edition. University Press, Princeton. Ch. 9: Repeated Games. * H. Peters (2008): Game Theory: A Multi-Leveled Approach. Springer, ISBN: 978-3-540-69290-4.

Ch. 8:

Repeated games. * K. Leyton-Brown & Y. Shoham (2008): Essentials of Game Theory: A Concise, Multidisciplinary Introduction. Morgan and Claypool Publishers, 2008. Ch. 6: Repeated and Stochastic Games. * S.P. Hargreaves Heap & Y. Varoufakis (2004): Game theory: a critical text, Routledge. Ch. 5: THE PRISON- ERS’DILEMMA - The riddle of co-operation and its implications for collective agency. * J. Ratliff (1997): Graduate-Level Course in Game Theory. (AKA: “Jim Ratliff’s Graduate-Level Course in Game Theory”. Lecture notes, Dept. of Economics, University of Arizona. (Available through the web but not officially published.) Sec. 5.3: A Folk Theorem Sampler. * M.J. Osborne & A. Rubinstein (1994): A Course in Game Theory. MIT Press. Ch. 8: Repeated games.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 35

SLIDE 36

Multi-agent learning Repeated games

What next?

Now that we know that infinitely many equilibria exist in repeated games (an embarrassment of richness), there are a number of ways in which we may proceed.

Gradient Dynamics. This is to approximate NE of single-shot games

(stage games) through gradient ascent (hill-climbing).

Reinforcement Learning. Agents simply execute the action(s) with

maximal rewards in the past.

No-regret learning. Agents execute the action(s) with maximal virtual

rewards in the past.

Fictitious Play. Sample the actions of opponent(s) and play a best

response.

Gerard Vreeswijk. Last modified on February 9th, 2012 at 17:15 Slide 36