What can be sampled loca ! y ? Yitong Yin Nanjing University Joint - - PowerPoint PPT Presentation

what can be sampled loca y
SMART_READER_LITE
LIVE PREVIEW

What can be sampled loca ! y ? Yitong Yin Nanjing University Joint - - PowerPoint PPT Presentation

What can be sampled loca ! y ? Yitong Yin Nanjing University Joint work with: W eiming Feng, Y uxin Sun Local Computation What can be computed locally? [Noar, Stockmeyer, STOC93, SICOMP95] the LOCAL model [Linial 87] :


slide-1
SLIDE 1

What can be sampled loca!y?

Yitong Yin Nanjing University

Joint work with: W eiming Feng, Y uxin Sun

slide-2
SLIDE 2

Local Computation

  • Communications are

synchronized.

  • In each round: each node can

send messages of unbounded sizes to all its neighbors.

  • Local computations are free.
  • Complexity: # of rounds to

terminate in the worst case.

  • In t rounds: each node can collect information up to distance t.

the LOCAL model [Linial ’87]: “What can be computed locally?”

[Noar, Stockmeyer, STOC’93, SICOMP’95]

slide-3
SLIDE 3

Local Computation

  • CSPs with local constraints.
  • Construct a feasible solution:

vertex/edge coloring, Lovász local lemma

  • Find local optimum: MIS, MM
  • Approximate global optimum:

maximum matching, minimum vertex cover, minimum dominating set

Locally Checkable Labeling (LCL) problems

[Noar, Stockmeyer ’93] : Q: “What locally definable problems are locally computable?”

the LOCAL model [Linial ’87]: network G(V,E)

  • In t rounds: each node can collect information up to distance t.

by local constraints in O(1) rounds

  • r in small number of rounds
slide-4
SLIDE 4

“What can be sampled locally?”

network G(V,E):

  • CSP with local constraints
  • n the network:
  • proper q-coloring;
  • independent set;
  • Sample a uniform random

feasible solution:

  • distributed algorithms

(in the LOCAL model) Q: “What locally definable joint distributions are locally sample-able?”

slide-5
SLIDE 5

Markov Random Fields

network G(V,E):

  • Each vertex corresponds to a

variable with finite domain [q].

  • Each edge e=(u,v)∈E imposes a

weighted binary constraint:

  • Each vertex v∈E imposes a

weighted unary constraint:

  • Gibbs distribution µ : ∀σ∈[q]V

Ae : [q]2 → R≥0

bv : [q] → R≥0

µ(σ) ∝ Y

e=(u,v)∈E

Ae(σu, σv) Y

v∈V

bv(σv)

Ae bv Xv∈[q] u v

(MRF)

~ X ∈ [q]V follows µ

slide-6
SLIDE 6

Markov Random Fields

network G(V,E): Ae bv Xv∈[q] u v

~ X ∈ [q]V follows µ

(MRF)

  • Gibbs distribution µ : ∀σ∈[q]V

µ(σ) ∝ Y

e=(u,v)∈E

Ae(σu, σv) Y

v∈V

bv(σv)

  • proper q-coloring:

Ae =      ...     

1 1

bv =    1 . . . 1   

  • independent set:

bv =  1 1

  • Ae =

 1 1 1

  • local conflict colorings:

[Fraigniaud, Heinrich, Kosowski FOCS’16]

Ae ∈ {0, 1}q×q, bv ∈ {0, 1}q

slide-7
SLIDE 7

A Motivation: Distributed Machine Learning

  • Data are stored in a

distributed system.

  • Sampling from a

probabilistic graphical model (e.g. the Markov random field) by distributed algorithms.

slide-8
SLIDE 8

Glauber Dynamics

G(V,E):

pick a uniform random vertex v;

resample X(v) according to the marginal distribution induced by µ at vertex v conditioning on Xt(N(v));

starting from an arbitrary X0 ∈ [q]V transition for Xt → Xt+1 : marginal distribution:

Pr[Xv = x | XN(v)] = bv(x) Q

u∈N(v) A(u,v)(Xu, x)

P

y∈[q] bv(y) Q u∈N(v) A(u,v)(Xu, y)

Ae

bv

v v

µ(σ) ∝ Y

e=(u,v)∈E

Ae(σu, σv) Y

v∈V

bv(σv)

MRF:

∀σ ∈ [q]V ,

stationary distribution: µ mixing time: τmix = max

X0 min

  • t | dTV(Xt, µ) ≤

1 2e

slide-9
SLIDE 9

Mixing of Glauber Dynamics

for q-coloring: q≥(2+ε)Δ

Dobrushin’s condition

Δ = max-degree u v influence matrix : {ρv,u}v,u∈V ρv,u: max discrepancy (in total variation distance) of

marginal distributions at v caused by any pair σ,τ

  • f boundary conditions that differ only at u

Dobrushin’s condition:

kρk∞ = max

v∈V

X

u∈V

⇢v,u  1 ✏

contraction of one-step

  • ptimal coupling in the worst

case w.r.t. Hamming distance

Theorem (Dobrushin ’70; Salas, Sokal ’97): Dobrushin’s condition

for Glauber dynamics

τmix = O (n log n)

slide-10
SLIDE 10

Parallelization

G(V,E): v v Glauber dynamics: Parallelization:

  • Chromatic scheduler [folklore] [Gonzalez et al., AISTAT’11]:

Vertices in the same color class are updated in parallel.

  • “Hogwild!” [Niu, Recht, Ré, Wright, NIPS’11][De Sa, Olukotun, Ré, ICML’16]:

All vertices are updated in parallel, ignoring concurrency issues. pick a uniform random vertex v;

resample X(v) according to the marginal distribution induced by µ at vertex v conditioning on Xt(N(v));

starting from an arbitrary X0 ∈ [q]V transition for Xt → Xt+1 :

slide-11
SLIDE 11

Warm-up: When Luby meets Glauber

G(V,E):

resample X(v) according to the marginal distribution induced by µ at vertex v conditioning on Xt(N(v));

at each step, for each vertex v∈V: starting from an arbitrary X0 ∈ [q]V

independently sample a random number βv∈[0,1]; if βv is locally maximum among its neighborhood N(v):

Luby step Glauber step

  • Luby step: Independently sample a random independent set.
  • Glauber step: For independent set vertices, update correctly

according to the current marginal distributions.

  • Stationary distribution: the Gibbs distribution µ.
slide-12
SLIDE 12

Mixing of LubyGlauber

Dobrushin’s condition:

kρk∞ = max

v∈V

X

u∈V

⇢v,u  1 ✏

influence matrix {ρv,u}v,u∈V u v Theorem (Dobrushin ’70; Salas, Sokal ’97): Dobrushin’s condition

for Glauber dynamics

τmix = O (n log n) Dobrushin’s condition for the LubyGlauber chain τmix = O (∆ log n)

slide-13
SLIDE 13

Dobrushin’s condition:

kρk∞ = max

v∈V

X

u∈V

⇢v,u  1 ✏

influence matrix

{ρv,u}v,u∈V

u v Dobrushin’s condition for the LubyGlauber chain τmix = O (∆ log n)

Dv,v = Pr[v is picked in Luby step] ≥ 1 deg(v) + 1

D is diagonal and

p(t)

v

= Pr[Xt(v) 6= Yt(v)]

in the one-step optimal coupling (Xt,Yt), let where

Proof (similar to [Hayes’04] [Dyer-Goldberg-Jerrum’06]):

p(t+1) ≤ Mp(t)

Pr[Xt 6= Yt] kp(t)k1 nkp(t)k∞ nkMkt

∞kp(0)k∞

n ✓ 1 ✏ ∆ + 1 ◆t

M = (I − D) + Dρ

slide-14
SLIDE 14

Crossing the Chromatic # Barrier

Glauber LubyGlauber O(n log n) O(Δ log n) ∆ = max-degree

parallel speedup = θ(n /Δ)

Q: “How to update all variables simultaneously and still converge to the correct distribution?” χ = chromatic no. Do not update adjacent vertices simultaneously. It takes ≥χ steps to update all vertices at least once.

slide-15
SLIDE 15

The LocalMetropolis Chain

starting from an arbitrary X ∈ [q]V, at each step:

each vertex v∈V independently proposes a random σv∈[q] with probability ; each edge e=(u,v) passes its check independently with prob. ; each vertex v∈V accepts its proposal and update Xv to σv if all incident edges pass their checks;

a collective coin flipping made between u and v

u v w

Xu Xv Xw

current: proposals:

σu σv σw

  • [Feng, Sun, Y. ’17]: the LocalMetropolis chain is time-reversible

w.r.t. the MRF Gibbs distribution µ.

bv(σv)/ P

i∈[q] bv(i)

Ae(Xu, σv)Ae(σu, Xv)Ae(σu, σv)/ max

i,j∈[q](Ae(i, j))3

slide-16
SLIDE 16

P(X, Y ) P(Y, X) = P

(σ,C)∈ΩX→Y Pr(σ)Pr(C | σ, X)

P

(σ,C)∈ΩY →X Pr(σ)Pr(C | σ, Y )

Detailed Balance Equation:

µ(X)P(X, Y ) = µ(Y )P(Y, X)

∀X, Y ∈ [q]V ,

σ ∈ [q]V :

the proposals of all vertices

C ∈ {0, 1}E :

indicates whether each edge e∈E passes its check

ΩX→Y , {(σ, C) | X → Y when the random choice is (σ, C)} Bijection

is constructed as:

φX,Y : ΩX→Y → ΩY →X

(σ, C)

φX,Y

7 ! (σ0, C0)

C = C0

Ce = 1 if for all e incident with v, then σ0

v = Xv

  • therwise σ0

v = σv

s.t. = µ(Y ) µ(X)

Pr(σ)Pr(C | σ, X) Pr(σ0)Pr(C0 | σ0, Y ) = Y

v2V

bv(Yv) bv(Xv) Y

e=uv2E

Ae(Yu, Yv) Ae(Xu, Xv) = µ(Y ) µ(X)

slide-17
SLIDE 17

The LocalMetropolis Chain

starting from an arbitrary X ∈ [q]V, at each step:

each vertex v∈V independently proposes a random σv∈[q] with probability ; each edge e=(u,v) passes its check independently with prob. ; each vertex v∈V accepts its proposal and update Xv to σv if all incident edges pass their checks;

a collective coin flipping made between u and v

u v w

Xu Xv Xw

current: proposals:

σu σv σw

  • [Feng, Sun, Y. ’17]: the LocalMetropolis chain is time-reversible

w.r.t. the MRF Gibbs distribution µ.

bv(σv)/ P

i∈[q] bv(i)

Ae(Xu, σv)Ae(σu, Xv)Ae(σu, σv)/ max

i,j∈[q](Ae(i, j))3

slide-18
SLIDE 18

LocalMetropolis for Hardcore model

starting from an arbitrary X ∈ {0,1}V, with 1 indicating occupied at each step, each vertex v∈V:

proposes a random σv∈{0,1} independently accepts the proposal and update Xv to σv unless for some neighbor u of v: Xu=σv=1 or σu=Xv=1 or σu=σv=1 ;

σv = ( 1 with probability

λ 1+λ,

with probability

1 1+λ;

∀ independent set I in G:

µ(I) = λ|I| P

I: IS in G λ|I|

the hardcore model on G(V,E) with fugacity λ:

  • λ < 1/Δ: τmix = O(log n), even for unbounded Δ.
slide-19
SLIDE 19

LocalMetropolis for q-Coloring

starting from an arbitrary X ∈ [q]V, at each step, each vertex v∈V:

proposes a color σv∈[q] uniformly and independently at random; accepts the proposal and update Xv to σv if for all v’s neighbors u: Xu≠σv ∧ σu≠Xv ∧ σu≠σv ;

The O(log n) mixing time bound holds even for unbounded Δ and q.

Theorem (Feng, Sun, Y

. ’17):

τmix=O(log n)

q ≥ (2 + √ 2 + ✏)∆

for LocalMetropolis on q-coloring

slide-20
SLIDE 20

Xroot = red , Yroot = blue Δ-regular tree ∀ non-root v, Xv = Yv ∉ {red, blue}

proposes a uniform random color σv∈[q]; update Xv to σv if for all v’s neighbors u: Xu≠σv ∧ σu≠Xv ∧ σu≠σv ; each v:

coupling:

coupling the proposals (σX, σY) so that (X, Y )

(σX,σY )

− → (X0, Y 0)

  • 1. the root proposes consistently;
  • 2. each child of the root proposes bijectively;
  • 3. each vertex of depth ≥2 proposes bijectively if its parent proposed

different colors in the two chains, and proposes consistently if otherwise;

vertex v proposes consistently:

σX

v = σY v

vertex v proposes bijectively:

σX

v =

     red if σY

v = blue

blue if σY

v = red

σY

v

  • therwise
slide-21
SLIDE 21

Xroot = red , Yroot = blue Δ-regular tree ∀ non-root v, Xv = Yv ∉ {red, blue}

proposes a uniform random color σv∈[q]; update Xv to σv if for all v’s neighbors u: Xu≠σv ∧ σu≠Xv ∧ σu≠σv ; each v:

coupling:

coupling the proposals (σX, σY) so that (X, Y )

(σX,σY )

− → (X0, Y 0)

Pr[X0

root 6= Y 0 root]  1

✓ 1 ∆ q ◆ ✓ 1 2 q ◆∆

non-root u at level l:

Pr[X0

root 6= Y 0 root] +

X

non-root u

Pr[X0

u 6= Y 0 u]  1

✓ 1 ∆ q ◆ ✓ 1 2 q ◆∆ + ∆ q 2∆ ✓ 1 2 q ◆∆1

Pr[X0

u 6= Y 0 u]  1

q ✓ 1 2 q ◆∆1 ✓2 q ◆`1

root:

≤ 1 − e−2/α ✓ 1 − 1 α − 1 α − 2 ◆

q ≥ α∆

(assume )

slide-22
SLIDE 22

Xroot = red , Yroot = blue Δ-regular tree ∀ non-root v, Xv = Yv ∉ {red, blue}

proposes a uniform random color σv∈[q]; update Xv to σv if for all v’s neighbors u: Xu≠σv ∧ σu≠Xv ∧ σu≠σv ; each v:

for general graph:

  • 1. deal with irregularity by a metric in path coupling;
  • 2. deal with cycles by self-avoiding walks;
  • 3. deal with red/blue non-root vertices by a monotone

argument;

slide-23
SLIDE 23

LocalMetropolis for q-Coloring

starting from an arbitrary X ∈ [q]V, at each step, each vertex v∈V:

proposes a color σv∈[q] uniformly and independently at random; accepts the proposal and update Xv to σv if for all v’s neighbors u: Xu≠σv ∧ σu≠Xv ∧ σu≠σv ;

τmix=O(log n)

q ≥ (2 + √ 2 + ✏)∆

  • The mixing time holds even for unbounded Δ and q.
  • q≥(1+ε)Δ: each vertex is updated at Ω(1) rate in LocalMetropolis
slide-24
SLIDE 24

Lower Bounds

Q: “How local can a distributed sampling algorithm be?” Q: “What cannot be sampled locally?”

slide-25
SLIDE 25
  • In t rounds: each node

can collect information up to distance t.

The LOCAL Model

Outputs returned by vertices at distance >2t from each other are mutually independent.

the LOCAL model:

slide-26
SLIDE 26

Theorem (Feng, Sun, Y

. ’17): For any non-degenerate MRF, any distributed algorithm that samples from its distribution µ within bounded total variation distance requires Ω(log n) rounds of communications.

  • utputs of t-round algorithm: mutually independent e

Xv Gibbs distribution µ: exponential correlation between Xv

path:

t >2t

v u

’s ’s

kµσu

v

µτu

v kTV exp(O(t))

σu 6= τu :

> n−1/4

for a t = O(log n)

dTV(X, f X) >

1 2e

for any product distribution f

X

Ω(log n) Lower Bound for Sampling

slide-27
SLIDE 27

Theorem (Feng, Sun, Y

. ’17): For any non-degenerate MRF, any distributed algorithm that samples from its distribution µ within bounded total variation distance requires Ω(log n) rounds of communications.

Ω(log n) Lower Bound for Sampling

  • The Ω(log n) lower bound holds for all

MRFs with exponential correlation:

  • non-trivial spin systems with O(1) spin states.
  • O(log n) is the new criteria of “being local”

for distributed sampling algorithms.

slide-28
SLIDE 28
  • [Weitz’06] If ∆≤5, there are poly-time algorithms.
  • [Sly’10] If ∆≥6, there is no poly-time algorithm unless

NP=RP.

Sampling almost uniform independent set in graphs with max-degree ∆ by by poly-time Turing machines: The Ω(diam) lower bound holds for sampling from the hardcore model with fugacity

λ > λc(∆) = (∆ − 1)∆−1 (∆ − 2)∆

An Ω(diam) Lower Bound

Theorem (Feng, Sun, Y

. ’17): For any ∆≥6, any distributed algorithm that samples uniform independent set within bounded total variation distance in graphs with max-degree ∆ requires Ω(diam) rounds of communications.

slide-29
SLIDE 29

Theorem (Feng, Sun, Y

. ’17): For any ∆≥6, any distributed algorithm that samples uniform independent set within bounded total variation distance in graphs with max-degree ∆ requires Ω(diam) rounds of communications. G: even cycle H: random ∆-regular bipartite gadget

GH :

if ∆≥6: max-degree ∆ sample nearly uniform independent set in GH sample nearly uniform max-cut in even cycle G (long-range correlation!)

  • f [Sly’10]
slide-30
SLIDE 30

Theorem (Feng, Sun, Y

. ’17): For any ∆≥6, any distributed algorithm that samples uniform independent set within bounded total variation distance in graphs with max-degree ∆ requires Ω(diam) rounds of communications.

  • Independent set is trivial to construct locally (because

∅ is an independent set).

  • The Ω(diam) lower bound for sampling holds even

when every vertex knows the entire graph:

  • The lower bound holds not because of the locality of input

information, but because of the locality of randomness.

A strong separation of sampling from other local computation tasks:

slide-31
SLIDE 31

Summary

  • Sampling from locally-defined joint distribution via

distributed algorithms:

  • LubyGlauber: O(∆log n) rounds under Dobrushin condition;
  • LocalMetropolis: may achieve O(log n) rounds;
  • Ω(log n) lower bound for sampling from almost all

nontrivial joint distributions;

  • Ω(diam) lower bound for sampling from joint distributions

exhibiting (non-uniqueness) phase transition property. Weiming Feng, Yuxin Sun, Yitong Yin. What can be sampled locally? In PODC’17. arxiv: 1702.00142.

slide-32
SLIDE 32

Open Problems

  • Better analysis of LocalMetropolis.
  • Distributed sampling of:
  • matchings;
  • ferromagnetic Ising model on graphs of unbounded degree;
  • anti-ferromagnetic 2-spin systems in the uniqueness regime on

graphs of unbounded degree;

  • Self-reducible sampling in the LOCAL model?
  • Complexity hierarchy for distributed sampling?
  • New ideas for distributed sampling: e.g. the LLL sampler

for hardcore model of Guo-Jerrum-Liu.

slide-33
SLIDE 33

Thank you!

Any questions?