Learning and Efficiency in Games
(with Dynamic Population)
รva Tardos
Cornell
Joint work with Thodoris Lykouris and Vasilis Syrgkanis
Learning and Efficiency in Games (with Dynamic Population) va - - PowerPoint PPT Presentation
Learning and Efficiency in Games (with Dynamic Population) va Tardos Cornell Joint work with Thodoris Lykouris and Vasilis Syrgkanis Large population games: traffic routing Traffic subject to congestion delays cars and packets follow
Joint work with Thodoris Lykouris and Vasilis Syrgkanis
3
advertising auctions
$
$
delay
Nash: Stable solution: no incentive to deviate
Nash: Stable solution: no incentive to deviate But how did the players find it?
projects Which project should I try?
๐
Uniform players and fair sharing= congestion game Unfair sharing and/or different abilities: Vetta utility game ???
Daskalakis-Goldberg-Papadimitrouโ06 Nash exists, but โฆ. Finding Nash is
time
a11 a21 an1
โฆ Outcome for ( a11, a21, โฆ, an1) Outcome for ( a1t, a2t, โฆ, ant)
a12 a22 an2
โฆ
a13 a23 an3
โฆ
a1t a2t ant
โฆ
time
a11 a21 an1
โฆ
a12 a22 an2
โฆ
a13 a23 an3
โฆ
a1t a2t ant
โฆ
Maybe here they donโt know how to play, who are the other players, โฆ
By here they have a better ideaโฆ
time
Nash equilibrium: Stable actions a with no regret for any alternate strategy ๐ฆ: ๐๐๐ก๐ข๐ ๐ฆ, ๐โ๐ โฅ ๐๐๐ก๐ข๐(๐)
a11 a21 an1
โฆ
a12 a22 an2
โฆ
a13 a23 an3
โฆ
a1 a2 an
โฆ
a1 a2 an
โฆ
a1 a2 an
โฆ
a1 a2 an
โฆ
a1 a2 an
โฆ
a1 a2 an
โฆ
a1 a2 an
โฆ
a1 a2 an
โฆ
a1 a2 an
โฆ No regret
time
For any fixed action ๐ฆ (with d options) : ๐๐๐ก๐ข๐ ๐๐ข โค ๐๐๐ก๐ข๐(๐ฆ, ๐โ๐
๐ข ) ๐ข ๐ข
Regret: Ri(x,T)= ๐๐๐ก๐ข๐ ๐๐ข โ ๐๐๐ก๐ข๐(๐ฆ, ๐โ๐
๐ข ) ๐ข ๐ข
Many simple rules ensure Ri(x,T) approx. ~ ๐๐๐๐ ๐ for all x MWU (Hedge), Regret Matching, etc.
a1t a2t ant
โฆ
โค ๐(๐) No-regret
a11 a21 an1
โฆ
a12 a22 an2
โฆ
a13 a23 an3
โฆ
time
For any fixed action ๐ฆ (with d options) : ๐๐๐ก๐ข๐ ๐๐ข โค ๐๐๐ก๐ข๐(๐ฆ, ๐โ๐
๐ข ) ๐ข ๐ข
Regret: Ri(x,T)= ๐๐๐ก๐ข๐ ๐๐ข โ (1 + ๐) ๐๐๐ก๐ข๐(๐ฆ, ๐โ๐
๐ข ) ๐ข ๐ข
Many simple rules ensure Ri(x,T) approx. ~๐(log ๐/๐) for all x MWU (Hedge), Regret Matching, etc. Foster, Li, Lykouris, Sridharan, Tโ16
a1t a2t ant
โฆ
โค ๐(๐) Approx. no-regret
a11 a21 an1
โฆ
a12 a22 an2
โฆ
a13 a23 an3
โฆ
R P S R
1
1
P
1
1
S
1
1
Rock Scissor Paper Nash:
1 3 1 3 1 3
Learning dynamic
Payoffs/utility
Examples
16
internet routing advertising auctions
$
$
Theorem (Roughgarden-Tโ02): In any network with continuous, non-decreasing cost functions and small users
cost of Nash with rates ri for all i cost of opt with rates 2ri for all i
Nash equilibrium: stable solution where no player had incentive to deviate. cost of worst Nash equilibrium โsocially optimumโ cost Price of Anarchy=
Bounds average welfare assuming no-regret learners [Blum, Hajiaghayi, Ligett, Roth, 2008]
18
1 ๐ ๐๐๐ก๐ข(๐๐ข)
๐ ๐ข=1
โsocially optimumโ cost Price of Total Anarchy= lim
๐โโ
Theorem (Blum, Even-Dar, Ligettโ06; Roughgardenโ09): Price of anarchy bounds developed for Nash equilibria extend to no- regret learning outcomes
time
Assumes a stable set of participants
a1t a2t ant
โฆ
a11 a21 an1
โฆ
a12 a22 an2
โฆ
a13 a23 an3
โฆ
Classical model:
Dynamic population model: At each step t each player i is replaced with an arbitrary new player with probability p
20
Bound average welfare assuming adaptive no-regret learners ๐๐๐ต = lim
๐โโ
๐๐๐ก๐ข(๐๐ข, ๐ค๐ข)
๐ ๐ข=1
๐๐๐ข(๐ค๐ข)
๐ ๐ข=1
where ๐ค๐ข is the vector of player types at time t even when the rate of change is high, i.e. a large fraction can turn over at every step.
21
Example routing
changing environment
environment time
22
a1t a2t ant
โฆ
a11 a21 an1
โฆ
a12 a22 an2
โฆ
a13 a23 an3
โฆ
Example 2: matching (project selection)
changing environment
environment
23
time
a1t a2t ant
โฆ
a11 a21 an1
โฆ
a12 a22 an2
โฆ
a13 a23 an3
โฆ
projects
for all player i, strategy x and interval [๐1, ๐2] ๐๐ ๐ฆ, ๐1, ๐2 = ๐๐๐ก๐ข๐ ๐๐ข; ๐ค๐ข โ ๐๐๐ก๐ข๐ ๐ฆ, ๐โ๐
๐ข ; ๐ค๐ข ๐2 ๐ข=๐1
โค ๐ ๐2 โ ๐1 rates of ~ ๐2 โ ๐1 ๏ Regret with respect to a strategy that changes k times โค ~ ๐๐
24
time
๐1 ๐2
a1t a2t ant
โฆ
a11 a21 an1
โฆ
a12 a22 an2
โฆ
a13 a23 an3
โฆ
for all player i, strategy x and interval [๐1, ๐2] ๐๐ ๐ฆ, ๐1, ๐2 = ๐๐๐ก๐ข๐ ๐๐ข; ๐ค๐ข โ 1 + ๐ ๐๐๐ก๐ข๐ ๐ฆ, ๐โ๐
๐ข ; ๐ค๐ข ๐2 ๐ข=๐1
โค ๐(k log ๐/๐) Regret with respect to a strategy that changes k times Using any of MWU (Hedge), Regret Matching, etc. mixed with a bit of โforgettingโ
25
time
๐1 ๐2
a1t a2t ant
โฆ
a11 a21 an1
โฆ
a12 a22 an2
โฆ
a13 a23 an3
โฆ
Bound average welfare close to Price of Anarchy for Nash even when the rate of change is high, ๐ โ
๐ ๐ฆ๐ฉ๐ก ๐ with n players
assuming adaptive no-regret learners
environment
26
๐๐ ๐ฆ = ๐๐๐ก๐ข๐ ๐๐ข; ๐ค๐ข โ ๐๐๐ก๐ข๐ ๐ฆ, ๐โ๐
๐ข ; ๐ค๐ข ๐ ๐ข=1
โค ๐ ๐
Best action varies with choices of othersโฆ Consider Optimal Solution Let x=๐๐
โ be the choice in OPT
No regret for all players i:
๐๐๐ก๐ข๐ ๐๐ข โค ๐๐๐ก๐ข๐(๐๐
โ, ๐โ๐) ๐ข ๐ข
Players donโt have to know ๐๐
โ
27
projects
Consider optimal solution: player i does action ๐๐
โ in optimum
No regret: ๐๐๐ก๐ข๐ ๐๐ข โค ๐๐๐ก๐ข๐(๐๐
โ, ๐โ๐ ๐ข ) ๐ข ๐ข
(doesnโt need to know ๐๐
โ)
A game is (ฮป,ฮผ)-smooth (ฮป > 0; ฮผ< 1): if for all strategy vectors a ๐๐๐ก๐ข๐(๐๐
โ, ๐โ๐ ๐
) โค ๐ ๐๐๐ + ๐ ๐๐๐ก๐ข(๐) A Nash equilibrium a has ๐๐๐ก๐ข๐ ๐ โค
๐
cost(a) โค
๐ 1โ๐Opt
Consider optimal solution: player i does action ๐๐
โ in optimum
No regret: ๐๐๐ก๐ข๐ ๐๐ข โค ๐๐๐ก๐ข๐(๐๐
โ, ๐โ๐ ๐ข ) ๐ข ๐ข
(doesnโt need to know ๐๐
โ)
A cost minimization game is (ฮป,ฮผ)-smooth (ฮป > 0; ฮผ< 1): if for all strategy vectors a ๐๐๐ก๐ข๐(๐๐
โ, ๐๐ ๐ข ๐
) โค ๐ ๐๐๐ + ๐ ๐๐๐ก๐ข(๐๐ข) A no-regret sequence ๐๐ข has and hence
1 ๐ ๐๐๐ก๐ข๐ ๐๐ข โค
๐ ๐ข
1 ๐ ๐๐๐ก๐ข(๐๐ข) ๐ข
โค
๐ 1โ๐Opt
1 ๐๏t
1 ๐๏t
Credit allocation Monotone uti๐๐ =expected credit: game is (1,1)-smooth: ๐๐
โ (Opt) with ๏ข action vector a
๐ฃ๐ข๐๐๐(๐๐โ, ๐โ๐
๐
) โฅ ๐๐๐ โ ๐ฃ๐ข๐๐๐(๐)
๐
Note: ๐ฃ๐ข๐๐๐ ๐
๐
is total value of successful projects = ๐
๐ ๐:๐ก๐ฃ๐๐๐๐๐ก
True project by project: ๐๐ and ๐๐
โ the number of players choosing project j
in a and OPT. If ๐๐ โฅ ๐๐
โ then right hand side is non-positive
Else: players benefit more than in OPT from trying their opt project
๏ Nash cost โค opt of double traffic rate (Roughgarden-Tโ02)
๏ 4/3 price of anarchy
smooth (Awerbuch-Azar-Epstein & Christodoulou-Koutsoupiasโ05) ๏ 2.5 price of anarchy Resulting bounds are tight
Mansour, Nisan ECโ11)
Other applications include:
Inequality we โwish to haveโ ๐๐๐ก๐ข๐ ๐๐ข; ๐ค๐ข โค ๐๐๐ก๐ข๐(๐๐
โ๐ข, ๐โ๐ ๐ข ; ๐ค๐ข) ๐ข ๐ข
where ๐๐
โ๐ข is the optimum strategy for the players at time t.
โ
True optimum is too sensitive
learn!! (we have p>>1/N)
If a game satisfies a โsmoothness propertyโ [Roughgardenโ09] The welfare optimization problem admits an approximation algorithm whose
is stable to changes in one playerโs type Then any adaptive learning outcome is approximately efficient even when the rate
Proof idea: use this approximate solution as ๐โ in Price of Anarchy proof With ๐โ not changing much, learners have time to learn not to regret following ๐โ Note: learner doesnโt have to know ๐โ !!
35
Recall: Regret of adaptive learning is bounded by โค ๐๐ with respect to any strategy that changes k times
True optimum is too sensitive
(loss of factor of 2)
power of 2 only
increase in log value of allocation only m log ๐ค๐๐๐ฆ , decrease due to departures
Joint privacy [Kearns et al. โ14, Dwork et al. โ06] A randomized algorithm is jointly differentially private if
smaller than ๐
sequence with small number of changes using Coupling Lemma
38
increasing costs: 1 ๐ ๐ท๐๐ก๐ข ๐๐ข; ๐ค๐ข
๐ข
โค 2.5 1 + ๐ 1 ๐ OPT ๐ค๐ข
๐ข
with ๐ = ๐
๐๐๐๐ง(๐) ๐๐๐๐ง(๐) ๐๐๐๐ง๐๐๐ ๐
if each player controls only a 1/n fraction of the total flow. Almost a constant fraction of change each step: dependence on number of players only polylog
39
Using joint differentially private algorithm of Hsu et al โ14 Theorem 2. Matching markets if values are [๐,1]
1 ๐ ๐ ๐๐ข; ๐ค๐ข ๐ข
โฅ
1 4 1+๐ 1 ๐ OPT ๐ค๐ข ๐ข
with ๐ = ๐
๐2๐2 ๐๐๐๐ง๐๐๐ ๐,1/๐,1/๐
Theorem 3. Large Combinatorial Markets with Gross-Substitutes
1 ๐ ๐ ๐๐ข; ๐ค๐ข ๐ข
โฅ
1 2 1+๐ 1 ๐ OPT ๐ค๐ข ๐ข
with ๐ = ๐
๐5๐5 ๐ ๐๐๐๐ง๐๐๐ ๐
Each item in large supply ฮฉ ๐๐๐๐ง๐๐๐ ๐ log (
1 ๐ , 1 ๐) and ฮ ๐ items
40
Value of advertiser?
regret
41
42
0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 120.00% 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.00 0.04 0.07 0.11 0.15 0.18 0.22 0.26 0.29 0.33 0.37 0.41 0.44 0.48 0.52 0.55 0.59 0.63 0.66 0.70 0.74 0.77 0.81 0.85 More Frequency Multiplicative Regret Frequency Cumulative %
๐
0.00% 20.00% 40.00% 60.00% 80.00% 100.00% 120.00% 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.00 0.04 0.07 0.11 0.15 0.18 0.22 0.26 0.29 0.33 0.37 0.41 0.44 0.48 0.52 0.55 0.59 0.63 0.66 0.70 0.74 0.77 0.81 0.85 More Frequency Multiplicative Regret Frequency Cumulative %
43
Maybe converged to best response Strictly positive regret: learning phase ๐
dynamic population
44