LEARNING IN GAMES WITH NOISY PAYOFF OBSERVATIONS Background and - - PowerPoint PPT Presentation

learning in games with noisy payoff observations
SMART_READER_LITE
LIVE PREVIEW

LEARNING IN GAMES WITH NOISY PAYOFF OBSERVATIONS Background and - - PowerPoint PPT Presentation

LEARNING IN GAMES WITH NOISY PAYOFF OBSERVATIONS Background and motivation Preliminaries The core scheme Learning with noisy feedback Mario Bravo Panayotis Mertikopoulos Universidad de Santiago de Chile CNRS Laboratoire


slide-1
SLIDE 1 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

LEARNING IN GAMES WITH NOISY PAYOFF OBSERVATIONS Mario Bravo Panayotis Mertikopoulos

Universidad de Santiago de Chile CNRS – Laboratoire d’Informatique de Grenoble

ADGO 2016 – Santiago, January 28, 2016

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-2
SLIDE 2 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Outline Background and motivation Preliminaries The core scheme Learning with noisy feedback

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-3
SLIDE 3 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Learning in Games The basic context:

▸ Decision-making: agents choose actions, each seeking to optimize some objective.

Example: a trader chooses asset proportions in an investment portfolio.

▸ Payoffs: rewards are determined by the decisions of all interacting agents.

Example: asset placements determine returns.

▸ Learning: the agents adjust their decisions and the process continues.

Example: change asset proportions based on performance.

When does the agents’ learning process lead to a “reasonable” outcome?

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-4
SLIDE 4 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Learning in Games The basic context:

▸ Decision-making: agents choose actions, each seeking to optimize some objective.

Example: a trader chooses asset proportions in an investment portfolio.

▸ Payoffs: rewards are determined by the decisions of all interacting agents.

Example: asset placements determine returns.

▸ Learning: the agents adjust their decisions and the process continues.

Example: change asset proportions based on performance.

When does the agents’ learning process lead to a “reasonable” outcome?

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-5
SLIDE 5 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Learning in Games The basic context:

▸ Decision-making: agents choose actions, each seeking to optimize some objective.

Example: a trader chooses asset proportions in an investment portfolio.

▸ Payoffs: rewards are determined by the decisions of all interacting agents.

Example: asset placements determine returns.

▸ Learning: the agents adjust their decisions and the process continues.

Example: change asset proportions based on performance.

When does the agents’ learning process lead to a “reasonable” outcome?

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-6
SLIDE 6 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Motivation

▸ In many applications, decisions taken at very fast time-scales.

Example: in high-frequency trading (HFT), decision times .

▸ Regulations/physical constraints limit changes in decisions.

Example: the SEC requires small differences in HFT orders to reduce volatility.

▸ Fast time-scales have adverse effects on quality of feedback.

Example: volatility estimates highly inaccurate at the time-scale.

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-7
SLIDE 7 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Motivation

▸ In many applications, decisions taken at very fast time-scales.

Example: in high-frequency trading (HFT), decision times ≈  µs.

▸ Regulations/physical constraints limit changes in decisions.

Example: the SEC requires small differences in HFT orders to reduce volatility.

▸ Fast time-scales have adverse effects on quality of feedback.

Example: volatility estimates highly inaccurate at the  µs time-scale.

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-8
SLIDE 8 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

The Flash Crash of 2010 A trillion-dollar NYSE crash (and partial rebound) that lasted 35 minutes (14:42–15:07) Aggressive selling due to imperfect volatility estimates induced a huge drop in liquidity and precipitated the crash (Vuorenmaa and Wang, 2014)

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-9
SLIDE 9 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

What this talk is about:

Examine the robustness of a class of continuous-time learning schemes with noisy feedback.

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-10
SLIDE 10 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Outline Background and motivation Preliminaries The core scheme Learning with noisy feedback

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-11
SLIDE 11 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Game setup Throughout this talk, we focus on finite games:

▸ Finite set of players: N = {, . . . , N} ▸ Finite set of actions per player: Ak = {αk,, αk,, . . . } ▸ Reward of player k determined by corresponding payoff function uk∶ ∏k Ak → R:

(α, . . . , αn) ↦ uk(α, . . . , αN) Mixed strategies yield expected payoffs Strategy profiles: Payoff vector of player : where is the payoff to the

  • th action of player

in the mixed strategy profile .

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-12
SLIDE 12 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Game setup Throughout this talk, we focus on finite games:

▸ Finite set of players: N = {, . . . , N} ▸ Finite set of actions per player: Ak = {αk,, αk,, . . . } ▸ Reward of player k determined by corresponding payoff function uk∶ ∏k Ak → R:

(α, . . . , αn) ↦ uk(α, . . . , αN)

▸ Mixed strategies xk ∈ Xk ≡ ∆(Ak) yield expected payoffs

uk(x, . . . , xN) = ∑α . . . ∑αN x,α⋯ xN,αN uk(α, . . . , αN)

▸ Strategy profiles: x = (x, . . . , xN) ∈ X ≡ ∏k Xk

Payoff vector of player : where is the payoff to the

  • th action of player

in the mixed strategy profile .

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-13
SLIDE 13 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Game setup Throughout this talk, we focus on finite games:

▸ Finite set of players: N = {, . . . , N} ▸ Finite set of actions per player: Ak = {αk,, αk,, . . . } ▸ Reward of player k determined by corresponding payoff function uk∶ ∏k Ak → R:

(α, . . . , αn) ↦ uk(α, . . . , αN)

▸ Mixed strategies xk ∈ Xk ≡ ∆(Ak) yield expected payoffs

uk(x, . . . , xN) = ∑α . . . ∑αN x,α⋯ xN,αN uk(α, . . . , αN)

▸ Strategy profiles: x = (x, . . . , xN) ∈ X ≡ ∏k Xk ▸ Payoff vector of player k: vk(x) = (vkα(x))α∈Ak where

vkα(x) = vk(α; x−k) is the payoff to the α-th action of player k in the mixed strategy profile x ∈ X.

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-14
SLIDE 14 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Regret Suppose players follow a trajectory of play x(t) (based on some learning/adjustment rule, to be discussed later). How does xk(t) compare on average to the “best possible” action αk ∈ Ak?

uk(α; x−k(s)) − uk(x(s)) Definition leads to no regret if for all , i.e. if every player’s average regret is non-positive in the long run.

NB: unilateral definition, no need for a game

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-15
SLIDE 15 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Regret Suppose players follow a trajectory of play x(t) (based on some learning/adjustment rule, to be discussed later). How does xk(t) compare on average to the “best possible” action αk ∈ Ak?

t  uk(α; x−k(s)) − uk(x(s)) ds

Definition

leads to no regret if for all , i.e. if every player’s average regret is non-positive in the long run.

NB: unilateral definition, no need for a game

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-16
SLIDE 16 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Regret Suppose players follow a trajectory of play x(t) (based on some learning/adjustment rule, to be discussed later). How does xk(t) compare on average to the “best possible” action αk ∈ Ak?

max

α∈Ak ∫ t  uk(α; x−k(s)) − uk(x(s)) ds

Definition

leads to no regret if for all , i.e. if every player’s average regret is non-positive in the long run.

NB: unilateral definition, no need for a game

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-17
SLIDE 17 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Regret Suppose players follow a trajectory of play x(t) (based on some learning/adjustment rule, to be discussed later). How does xk(t) compare on average to the “best possible” action αk ∈ Ak?

Regk(t) = max

α∈Ak ∫ t  uk(α; x−k(s)) − uk(x(s)) ds

Definition

leads to no regret if for all , i.e. if every player’s average regret is non-positive in the long run.

NB: unilateral definition, no need for a game

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-18
SLIDE 18 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Regret Suppose players follow a trajectory of play x(t) (based on some learning/adjustment rule, to be discussed later). How does xk(t) compare on average to the “best possible” action αk ∈ Ak?

Regk(t) = max

α∈Ak ∫ t  uk(α; x−k(s)) − uk(x(s)) ds

Definition x(t) leads to no regret if Regk(t) = o(t) for all k ∈ N, i.e. if every player’s average regret is

non-positive in the long run.

NB: unilateral definition, no need for a game

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-19
SLIDE 19 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Dominated strategies

Definition

A (pure) strategy α ∈ Ak is dominated by β ∈ Ak if vkα(x) < vkβ(x) for all x ∈ X. More generally, a mixed strategy p ∈ Xk is dominated by q ∈ Xk if ⟨vk(x)∣p − q⟩ <  for all x ∈ X. Variants: weakly/iteratively dominated defined analogously.

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-20
SLIDE 20 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Nash equilibrium

Definition

A strategy profile x∗ ∈ X is a Nash equilibrium if uk(x∗

k ; x∗ −k) ≥ uk(xk; x∗ −k)

for all xk ∈ Xk, k ∈ N, (NE) i.e. when no player has an incentive to deviate from x∗. Variants:

▸ Pure: x∗ is a corner of X (the support of x∗ is a singleton) ▸ Strict: (NE) holds as an equality iff xk = x∗ k for all k ∈ N; equivalently, x∗ is strict iff x∗

is pure and uk(α; x∗

−k) < uk(x∗)

for all α ∉ supp(x∗

k ) ▸ Restricted: (NE) holds for all xk whose support is contained in that of x∗ k

(like Nash equilibrium but players not allowed to deviate to actions not present in x∗) strict ⊆ pure ⊆ Nash ⊆ restricted

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-21
SLIDE 21 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Some basic questions

▸ Does x(t) lead to no regret? ▸ Are dominated strategies eliminated along x(t)? ▸ What are the possible limit points of x(t)? ▸ Does x(t) converge to Nash equilibrium? ▸ If not, do time averages converge? ▸ ⋯

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-22
SLIDE 22 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Outline Background and motivation Preliminaries The core scheme Learning with noisy feedback

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-23
SLIDE 23 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Exponential reinforcement learning A well-known strategy adjustment process is exponential learning:

˙ ykα = vkα(x) xkα(t) = exp(ykα(t)) ∑β exp(ykβ(t)) (XL) In words: Score actions based on their cumulative payoffs. Assign probability weights exponentially proportionally to these scores. (Exponential reinforcement of highest scoring strategies). Continuous-time analogue of EXP3/EWA class of online learning algorithms (Vovk, 1990;

Littlestone and Warmuth, 1994; Sorin, 2009;…)

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-24
SLIDE 24 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Exponential reinforcement learning A well-known strategy adjustment process is exponential learning:

ykα(t) = ∫

t  vkα(x(s)) ds

xkα(t) = exp(ykα(t)) ∑β exp(ykβ(t)) (XL) In words:

▸ Score actions based on their cumulative payoffs. ▸ Assign probability weights exponentially proportionally to these scores.

(Exponential reinforcement of highest scoring strategies). Continuous-time analogue of EXP3/EWA class of online learning algorithms (Vovk, 1990;

Littlestone and Warmuth, 1994; Sorin, 2009;…)

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-25
SLIDE 25 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Exponential reinforcement learning A well-known strategy adjustment process is exponential learning:

ykα(t) = ∫

t  vkα(x(s)) ds

xkα(t) = exp(ykα(t)) ∑β exp(ykβ(t)) (XL) In words:

▸ Score actions based on their cumulative payoffs. ▸ Assign probability weights exponentially proportionally to these scores.

(Exponential reinforcement of highest scoring strategies). Continuous-time analogue of EXP3/EWA class of online learning algorithms (Vovk, 1990;

Littlestone and Warmuth, 1994; Sorin, 2009;…)

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-26
SLIDE 26 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Links with evolutionary game theory Trajectories of play under (XL) follow the replicator dynamics (Taylor & Jonker, 1978):

˙ xkα = xkα [vkα(x) − ∑β xkβvkβ(x)] (RD) Most widely studied dynamics in evolutionary game theory; known properties include:

▸ Dominated strategies become extinct under interior solutions of (RD) ▸ Nash equilibria are stationary under (RD); stationary points of (RD) are restricted

equilibria

▸ Limit points of interior solutions are Nash equilibria ▸ Strict Nash equilibria are locally stable and attracting ▸ Convergence to restricted equilibria in potential games. ▸ ⋯

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-27
SLIDE 27 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

An alternative characterization of exponential learning The logit map yα ↦ e yα/ ∑β e yβ can be equivalently characterized as

y ↦ arg max

x∈∆

{⟨y∣x⟩ − h(x)} where h(x) = −∑β xβ log xβ is the (negative) Gibbs entropy. In words: Agents play mixed strategies that maximize their expected cumulative payoff minus a penalty. Interpretation: The entropic penalty promotes exploration (contrast to greedily playing arg max⟨y∣x⟩)

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-28
SLIDE 28 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Reinforcement learning via regularization A general reinforcement principle:

▸ Score actions by keeping track of their cumulative payoffs over time. ▸ Play an “approximate” best response to the resulting score vector

Formally: (RL) where the approximate best response (or choice map) is defined as for some penalty function Assumptions for : Continuous on ; smooth on interiors of faces; strongly convex: for all

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-29
SLIDE 29 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Reinforcement learning via regularization A general reinforcement principle:

▸ Score actions by keeping track of their cumulative payoffs over time. ▸ Play an “approximate” best response to the resulting score vector

Formally: ˙ yk = vk(x) xk(t) = Qk(yk(t)) (RL) where the approximate best response (or choice map) Qk is defined as Qk(yk) = arg max

xk∈Xk

{⟨yk∣xk⟩ − hk(xk)} for some penalty function hk∶ Xk → R Assumptions for h: Continuous on X; smooth on interiors of faces; strongly convex: h(tx + ( − t)x) ≤ th(x) + ( − t)h(x) − 

 Kt( − t)∥x − x′∥

for all t ∈ [, ]

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-30
SLIDE 30 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Examples

  • Ex. 1. Entropic penalty:

h(x) = ∑β xβ log xβ Induces the logit map Gα(v) = exp(vα) ∑β exp(vβ)

  • Ex. 2. Quadratic penalty:

Induces the closest point projection map

Important dichotomy:

is steep ; is non-steep

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-31
SLIDE 31 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Examples

  • Ex. 1. Entropic penalty:

h(x) = ∑β xβ log xβ Induces the logit map Gα(v) = exp(vα) ∑β exp(vβ)

  • Ex. 2. Quadratic penalty:

h(x) =   ∑β x

β

Induces the closest point projection map Π(v) = arg min

x∈∆

∥v − x∥ = proj∆ v

Important dichotomy:

is steep ; is non-steep

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-32
SLIDE 32 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Examples

  • Ex. 1. Entropic penalty:

h(x) = ∑β xβ log xβ Induces the logit map Gα(v) = exp(vα) ∑β exp(vβ)

  • Ex. 2. Quadratic penalty:

h(x) =   ∑β x

β

Induces the closest point projection map Π(v) = arg min

x∈∆

∥v − x∥ = proj∆ v

Important dichotomy: h is steep ❀ im Q = ∆

○; h is non-steep ❀ im Q = ∆

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-33
SLIDE 33 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Examples of dynamics

  • Ex. 1 The entropic penalty leads to exponential reinforcement learning:

˙ ykα = vkα(x) xkα = exp(ykα) ∑β exp(ykβ) (XL) Trajectories of (XL) satisfy the replicator dynamics

  • Ex. 2 The quadratic penalty

leads to projected reinforcement learning: (PL) Closely related to the projection dynamics of Friedman (1991): if

  • therwise

(PD) The

  • orbits of (PL) satisfy (PD) on an open dense set of times (M & Sandholm, 2015).

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-34
SLIDE 34 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Examples of dynamics

  • Ex. 1 The entropic penalty leads to exponential reinforcement learning:

˙ ykα = vkα(x) xkα = exp(ykα) ∑β exp(ykβ) (XL) Trajectories of (XL) satisfy the replicator dynamics

  • Ex. 2 The quadratic penalty h(x) = 

 ∑β x β leads to projected reinforcement learning:

˙ yk = vk(x) x = projX y (PL) Closely related to the projection dynamics of Friedman (1991): ˙ xkα = ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ vkα(x) − ∣supp(xk)∣− ∑β∈supp(xk) vkβ(x) if α ∈ supp(xk) 

  • therwise

(PD) The x-orbits of (PL) satisfy (PD) on an open dense set of times (M & Sandholm, 2015).

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-35
SLIDE 35 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Example portraits

1, 1 1, 1 1, 1 1, 1

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x1 x2

Projection Dynamics q2

h(x) = 

 ∑β x β

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-36
SLIDE 36 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Example portraits

1, 1 1, 1 1, 1 1, 1

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x1 x2

qReplicator Dynamics q32

h(x) = 

 ∑β x/ β

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-37
SLIDE 37 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Example portraits

1, 1 1, 1 1, 1 1, 1

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x1 x2

Replicator Dynamics q1

h(x) = ∑β xβ log xβ

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-38
SLIDE 38 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Example portraits

1, 1 1, 1 1, 1 1, 1

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

x1 x2

LogBarrier Dynamics q0

h(x) = − ∑β log xβ

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-39
SLIDE 39 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Extinction of Dominated Strategies Recall:

▸ pk is dominated by p′ k if ⟨vk(x)∣pk − p′ k⟩ <  for all x ∈ X. ▸ A strategy pk ∈ Xk becomes extinct along x(t) if

min{xkα(t) ∶ α ∈ supp(pk)} →  as t → ∞ Theorem (M & Sandholm, 2015) Dominated strategies become extinct under the reinforcement learning dynamics (RL).

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-40
SLIDE 40 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Extinction of Dominated Strategies Recall:

▸ pk is dominated by p′ k if ⟨vk(x)∣pk − p′ k⟩ <  for all x ∈ X. ▸ A strategy pk ∈ Xk becomes extinct along x(t) if

min{xkα(t) ∶ α ∈ supp(pk)} →  as t → ∞ p Theorem (M & Sandholm, 2015) Dominated strategies become extinct under the reinforcement learning dynamics (RL).

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-41
SLIDE 41 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Extinction of Dominated Strategies Recall:

▸ pk is dominated by p′ k if ⟨vk(x)∣pk − p′ k⟩ <  for all x ∈ X. ▸ A strategy pk ∈ Xk becomes extinct along x(t) if

min{xkα(t) ∶ α ∈ supp(pk)} →  as t → ∞ p Theorem (M & Sandholm, 2015) Dominated strategies become extinct under the reinforcement learning dynamics (RL).

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-42
SLIDE 42 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Extinction of Dominated Strategies Recall:

▸ pk is dominated by p′ k if ⟨vk(x)∣pk − p′ k⟩ <  for all x ∈ X. ▸ A strategy pk ∈ Xk becomes extinct along x(t) if

min{xkα(t) ∶ α ∈ supp(pk)} →  as t → ∞ p Theorem (M & Sandholm, 2015) Dominated strategies become extinct under the reinforcement learning dynamics (RL).

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-43
SLIDE 43 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Stability and convergence analysis Recall:

▸ x∗ is a Nash equilibrium iff uk(x∗) ≥ uk(xk; x∗ −k) for all xk ∈ Xk, k ∈ N. ▸ A Nash equilibrium is strict if the above inequality is strict for all xk ≠ x∗ k .

Theorem (M & Sandholm ’15)

Let be an orbit of (RL).

  • I. If

, then is a Nash equilibrium. II. is stable and attracting iff it is a strict Nash equilibrium. III. converges to Nash equilibrium in potential games.

Special case: EGT “folk theorem” for the replicator dynamics

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-44
SLIDE 44 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Stability and convergence analysis Recall:

▸ x∗ is a Nash equilibrium iff uk(x∗) ≥ uk(xk; x∗ −k) for all xk ∈ Xk, k ∈ N. ▸ A Nash equilibrium is strict if the above inequality is strict for all xk ≠ x∗ k .

Theorem (M & Sandholm ’15)

Let x(t) = Q(y(t)) be an orbit of (RL).

  • I. If x(t) → x∗, then x∗ is a Nash equilibrium.
  • II. x∗ is stable and attracting iff it is a strict Nash equilibrium.
  • III. x(t) converges to Nash equilibrium in potential games.

Special case: EGT “folk theorem” for the replicator dynamics

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-45
SLIDE 45 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Convergence to Equilibrium

( ) ( ) ( ) ( )

  • (=)

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-46
SLIDE 46 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Convergence to Equilibrium

( ) ( ) ( ) ( )

  • (=)

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-47
SLIDE 47 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Outline Background and motivation Preliminaries The core scheme Learning with noisy feedback

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-48
SLIDE 48 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

The model Noisy payoff observations lead to the stochastically perturbed learning model

dYk = vk(X) dt + dZk Xk = Qk(ηkYk) (SRL) where:

▸ the noise process Zk is an Itô martingale (think Brownian motion) with covariance

dZkα ⋅ dZℓβ = Σαβ dt (noise possibly state-dependent and/or correlated across players and strategies)

▸ ηk ≡ ηk(t) is a (possibly variable) learning parameter, introduced for flexibility ▸ the rest, as before

Assumptions for the noise (Z) and the learning parameter (η)

▸ supt∥Σ(t)∥ < ∞ ▸ η(t) smooth, nonincreasing, and η(t) = ω(t) (i.e. limt→∞ tη(t) = ∞)

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-49
SLIDE 49 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Evolution of mixed strategies How do mixed strategies evolve under (SRL)?

Proposition

Suppose that the penalty function of player k is of the form hk(xk) = ∑α θk(xkα) and Zk is a Wiener process. Then, X(t) locally follows the stochastic differential equation dXkα = ηk θ′′

[vkα − Θ′′

k ∑β vkβ/θ′′ kβ] dt

+ ηk θ′′

[σkα dWkα − Θ′′

k ∑β σkβ/θ′′ kβ dWkβ]

+ ˙ ηk ηk  θ′′

[θ′

kα − Θ′′ k ∑β θ′ kβ/θ′′ kβ] dt

−    θ′′

[θ′′′

kαU kα − Θ′′ k ∑β θ′′′ kβ/θ′′ kβ U kβ] dt,

where: a) Θ′′

k = (∑β /θ′′ kβ) −,

b) U

kα = ( ηk

θ′′

)

[σ 

kα ( − Θ′′ k /θ′′ kα)  + ∑β≠α (Θ′′ k /θ′′ kβ)  σ  kβ].

A W F U L !

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-50
SLIDE 50 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Evolution of mixed strategies How do mixed strategies evolve under (SRL)?

Proposition

Suppose that the penalty function of player k is of the form hk(xk) = ∑α θk(xkα) and Zk is a Wiener process. Then, X(t) locally follows the stochastic differential equation dXkα = ηk θ′′

[vkα − Θ′′

k ∑β vkβ/θ′′ kβ] dt

+ ηk θ′′

[σkα dWkα − Θ′′

k ∑β σkβ/θ′′ kβ dWkβ]

+ ˙ ηk ηk  θ′′

[θ′

kα − Θ′′ k ∑β θ′ kβ/θ′′ kβ] dt

−    θ′′

[θ′′′

kαU kα − Θ′′ k ∑β θ′′′ kβ/θ′′ kβ U kβ] dt,

where: a) Θ′′

k = (∑β /θ′′ kβ) −,

b) U

kα = ( ηk

θ′′

)

[σ 

kα ( − Θ′′ k /θ′′ kα)  + ∑β≠α (Θ′′ k /θ′′ kβ)  σ  kβ].

A W F U L !

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-51
SLIDE 51 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Examples The entropic penalty h(x) = ∑α xα log xα yields the stochastic replicator dynamics

dXkα = ηkXkα [vkα − ∑

k β Xkβ vkβ] dt

(drift) + ηkXkα [σkα dWkα − ∑

k β σkβXkβ dWkβ]

(noise) + ˙ ηk ηk Xkα [log Xkα − ∑

k β Xkβ log Xkβ] dt

(due to ˙ η) +   Xkα [σ 

kα( − Xkα) − ∑ k β σ  kβXkβ ( − Xkβ)] dt.

(Itô)

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-52
SLIDE 52 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Examples The quadratic penalty h(x) = 

 ∑α x α yields the stochastic projection dynamics

dXkα = [vkα − ∣supp(Xk)∣− ∑β∈supp(Xk) vkβ] dt (drift) + [σkα dWkα − ∣supp(Xk)∣− ∑β∈supp(Xk) σkβ dWkβ] (noise) + ˙ ηk ηk [Xkα − ∣supp(Xk)∣−] dt. (due to ˙ η)

NB: There is no Itô correction, but X(t) follows this SDE only locally

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-53
SLIDE 53 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Examples

( ) ( ) ( ) ( )

  • ( )

( ) ( ) ( )

  • Evolution of play under (SRL) with logit and projection choice maps (σ = )

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-54
SLIDE 54 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Consistency and regret (XL) leads to no regret (Sorin, 2009); in fact, so does (RL) (Kwon & M, 2014). Is this still true in the presence of noise?

Yes, provided that the learning parameter

tends to zero.

Theorem (Bravo & M, 2015)

If a player runs (SRL) with such that , then (a.s.) where and are constants related to the player’s penalty function.

Corollary

If , optimal regret bound obtained for and is of order ; subleading term is .

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-55
SLIDE 55 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Consistency and regret (XL) leads to no regret (Sorin, 2009); in fact, so does (RL) (Kwon & M, 2014). Is this still true in the presence of noise?

Yes, provided that the learning parameter η(t) tends to zero.

Theorem (Bravo & M, 2015)

If a player runs (SRL) with η(t) such that limt→∞ η(t) = , then Reg(t) ≤ Ω η(t) + σ 

max

∣A∣ K ∫

t  η(s) ds + O(σmax

√ t log log t) (a.s.), where Ω and K are constants related to the player’s penalty function. Corollary If η(t) ∼ t−γ, optimal regret bound obtained for γ = / and is of order O( √ t log log t); subleading term is σmax √

Ω∣A∣ K t.

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-56
SLIDE 56 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Proof

Sketch of proof.

▸ Introduce the (primal-dual) Fenchel coupling

F(x, y) = h(x) + h∗(y) − ⟨y∣x⟩

▸ Fix some test strategy p ∈ X and consider the rate-adjusted coupling

H(t) =  η(t) F(p, η(t)Y(t))

▸ Use Itô’s lemma to calculate dH(t) ▸ Bound each of the resulting terms (iterated logarithm for the noise, strong convexity

for the Itô correction, etc.)

▸ Maximize over all p ∈ X to obtain bound on the regret.

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-57
SLIDE 57 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Extinction of dominated strategies Are dominated strategies eliminated under (SRL)?

Yes, with no vanishing parameter assumptions on

Theorem (Bravo & M, 2015)

If is dominated (even iteratively), then it becomes extinct along almost surely. Extinction rate of a pure dominated strategy : If is constant, and , then for some , If is non-steep, dominated strategies become extinct in finite time (a.s.)

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-58
SLIDE 58 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Extinction of dominated strategies Are dominated strategies eliminated under (SRL)?

Yes, with no vanishing parameter assumptions on η(t)

Theorem (Bravo & M, 2015)

If pk ∈ Xk is dominated (even iteratively), then it becomes extinct along X(t) almost surely. Extinction rate of a pure dominated strategy : If is constant, and , then for some , If is non-steep, dominated strategies become extinct in finite time (a.s.)

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-59
SLIDE 59 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Extinction of dominated strategies Are dominated strategies eliminated under (SRL)?

Yes, with no vanishing parameter assumptions on η(t)

Theorem (Bravo & M, 2015)

If pk ∈ Xk is dominated (even iteratively), then it becomes extinct along X(t) almost surely. Extinction rate of a pure dominated strategy α ∈ Ak:

▸ If ηk is constant, hk(xk) = ∑β θ(xkβ) and τδ = inf{t >  ∶ Xkα(t) < δ}, then

E[τδ] ≤ Ck − θ′

k(δ)

ηkmk for some Ck > , mk > 

▸ If θk is non-steep, dominated strategies become extinct in finite time (a.s.)

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-60
SLIDE 60 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Stability and convergence properties What is the dynamics’ long-term behavior in regards to Nash equilibria?

Theorem

Let . Then: If a trajectory converges to with positive probability, is a Nash equilibrium. If is a strict Nash equilibrium, it is stochastically stable and attracting: for all and for every neighborhood

  • f

, there exists a neighborhood

  • f

such that for all and

NB: no vanishing parameter assumptions on

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-61
SLIDE 61 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Stability and convergence properties What is the dynamics’ long-term behavior in regards to Nash equilibria?

Theorem

Let x∗ ∈ X. Then:

▸ If a trajectory X(t) converges to x∗ with positive probability, x∗ is a Nash equilibrium. ▸ If x∗ is a strict Nash equilibrium, it is stochastically stable and attracting: for all ε >  and

for every neighborhood U of x∗, there exists a neighborhood U ⊆ U of x∗ such that P(X(t) ∈ U for all t ≥  and limt→∞ X(t) = x∗) ≥  − ε.

NB: no vanishing parameter assumptions on η(t)

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-62
SLIDE 62 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Long-term time averages In zero-sum games, the dynamics do not converge to a Nash equilibrium, but their time-averages do (Hofbauer et al., 2009; M & Sandholm, 2015). Is this still true for (SRL)?

Yes, provided that the learning parameter

tends to zero.

Theorem (Bravo & M, 2015)

Let be a zero-sum 2-player game with an interior equilibrium. If both players run (SRL) with vanishing learning parameters ( ), the time averages converge to the Nash set of . (Corollary of more general result linking time averages of (SRL) to the best-response dynamics)

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-63
SLIDE 63 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Long-term time averages In zero-sum games, the dynamics do not converge to a Nash equilibrium, but their time-averages do (Hofbauer et al., 2009; M & Sandholm, 2015). Is this still true for (SRL)?

Yes, provided that the learning parameter η(t) tends to zero.

Theorem (Bravo & M, 2015)

Let G be a zero-sum 2-player game with an interior equilibrium. If both players run (SRL) with vanishing learning parameters (ηk(t) → ), the time averages ¯ X(t) = t− ∫

t  X(s) ds

converge to the Nash set of G. (Corollary of more general result linking time averages of (SRL) to the best-response dynamics)

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-64
SLIDE 64 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Time averages

( -) (- ) (- ) ( -)

  • (a) A sample trajectory and its time average.
  • =
  • =
  • =
  • =

(b) Distribution of time averages at time T.

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble

slide-65
SLIDE 65 Sunday, October 7, 2012

Background and motivation Preliminaries The core scheme Learning with noisy feedback

Concluding remarks

▸ Dichotomy between “converging to a face” (undom. strategies, strict equilibria) and

“average” results (regret, time-averages, …): constant η better for the former, vanishing η better for the latter

▸ Itô’s formula introduces second-order terms: same control trade-offs as in discrete

time

▸ Some results extend to more general games (e.g. continuous action sets); others

trickier

▸ Possible to handle more intense noise processes (semimartingale noise, fractional

Brownian motion), but results different

▸ Other directions???

P . Mertikopoulos CNRS – Laboratoire d’Informatique de Grenoble