Distributed Consensus Optimization Ming Yan Michigan State - - PowerPoint PPT Presentation

distributed consensus optimization
SMART_READER_LITE
LIVE PREVIEW

Distributed Consensus Optimization Ming Yan Michigan State - - PowerPoint PPT Presentation

Distributed Consensus Optimization Ming Yan Michigan State University, CMSE/Mathematics September 14, 2018 Ming Yan, Michigan State University Decentralized-1 why we need decentralized optimization? Decentralized vehicles/aircrafts


slide-1
SLIDE 1

Distributed Consensus Optimization

Ming Yan Michigan State University, CMSE/Mathematics September 14, 2018

Ming Yan, Michigan State University Decentralized-1

slide-2
SLIDE 2

why we need decentralized optimization?

Decentralized vehicles/aircrafts coordination 1

Flock of birds Aircrafts formation

◮ Average consensus problem

min

{x(i)}

x(1) − b12

2 + · · · + x(n) − bn2 2

s.t. x(1) = · · · = x(n)

1Ren, Wei, Randal W. Beard, and Ella M. Atkins. “Information consensus in multivehicle cooperative control.”

IEEE Control Systems 27.2 (2007): 71-82.

Ming Yan, Michigan State University Decentralized-2

slide-3
SLIDE 3

why we need decentralized optimization?

Decentralized state estimation of smart grid 2

◮ Least squares (Gaussian noise) + ℓ1

norm (sparse anomalies) min

{x(i)∈Xi,v(i)} n

  • i=1

fi(x(i), v(i)) s.t. x(h)[j] = x(j)[h], ∀j ∈ Nh, ∀h where fi(x(i), v(i)) = zi − Hix(i) − v(i)2

2 + λv(i)1, and

the model parameter λ can be

  • btained through cross validation

2Kekatos, Vassilis, and Georgios B. Giannakis. “Distributed robust power system state estimation.” IEEE

Transactions on Power Systems 28.2 (2013): 1617-1626.

Ming Yan, Michigan State University Decentralized-3

slide-4
SLIDE 4

why we need decentralized optimization?

Decentralized dictionary learning 3

◮ Matrix factorization + regularization

min

D,A 1 2 c

  • j=1
  • Y:,j − DA:,j2

F + λA:,j1

  • +γD2

F, where

Y ∈ Rm×n – training data distributed

  • ver c agents

D ∈ Rm×p – dictionary that constitutes Y A ∈ Rp×n – sparse coefficient vectors that encode Y , divided into n parts

3Wai, Hoi-To, Tsung-Hui Chang, and Anna Scaglione. “A consensus-based decentralized algorithm for

non-convex optimization with application to dictionary learning.” Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015.

Ming Yan, Michigan State University Decentralized-4

slide-5
SLIDE 5

why we need decentralized optimization?

Decentralized data/signal processing 4

◮ cost/risk minimization: min ¯

f (x) = 1

n

n

i=1 fi(x)

◮ Communication and computation

balance, robust

◮ Privacy preservation: Exchange fi?

No! Exchange xk

(i)

◮ ...... ◮ Unmanned vehicles coordination,

in-vehicle networking

◮ Smart grid management, power

station management

◮ Decentralized recommender systems,

multi-group cooperation

◮ Decentralized network utility

maximization

◮ Decentralized resource allocation ◮ ...... 4Ren, Wei, Randal W. Beard, and Ella M. Atkins. “Information consensus in multivehicle cooperative control.” IEEE Control Systems 27.2 (2007): 71-82. Ming Yan, Michigan State University Decentralized-5

slide-6
SLIDE 6

what is decentralized optimization?

Decentralized consensus optimization

x∗ = arg min

x∈C⊆Rp

¯ f (x) = 1

n n

  • i=1

fi(x) (1)

1 10 2 3 4 5 6 7 8 9

                   involves multiple agents connected network messaging 1-hop neighbors each agent owns private objective each agent makes local decision

  • ptimize overall obejective

all agents reach consensus

◮ Compared to centralized system: robustness, computation

balanced, computation balanced, privacy preserving

◮ Related topics: in-vehicle networking, internet of things, cloud

computing, big data

Ming Yan, Michigan State University Decentralized-6

slide-7
SLIDE 7

simplest decentralized consensus problem: averaging

Ming Yan, Michigan State University Decentralized-7

slide-8
SLIDE 8

decentralized averaging

One iteration: xk+1 = Wxk, where x = [x1, . . . , xn]⊤.

Ming Yan, Michigan State University Decentralized-8

slide-9
SLIDE 9

decentralized averaging

One iteration: xk+1 = Wxk, where x = [x1, . . . , xn]⊤.

  • W encodes the network; nonzero entries correspond to edges; we assume

that W is symmetric (for undirected networks). W =

   

1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4

    ,

W =

   

2/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 2/3

   

Ming Yan, Michigan State University Decentralized-8

slide-10
SLIDE 10

decentralized averaging

One iteration: xk+1 = Wxk, where x = [x1, . . . , xn]⊤.

  • W encodes the network; nonzero entries correspond to edges; we assume

that W is symmetric (for undirected networks). W =

   

1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4

    ,

W =

   

2/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 2/3

   

  • 1n×1 is a fixed point, i.e., W has an eigenvalue 1; row/column sum equals
  • ne.

Ming Yan, Michigan State University Decentralized-8

slide-11
SLIDE 11

decentralized averaging

One iteration: xk+1 = Wxk, where x = [x1, . . . , xn]⊤.

  • W encodes the network; nonzero entries correspond to edges; we assume

that W is symmetric (for undirected networks). W =

   

1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4

    ,

W =

   

2/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 2/3

   

  • 1n×1 is a fixed point, i.e., W has an eigenvalue 1; row/column sum equals
  • ne.
  • Any fixed point is consensus, i.e., x∗ = c1n×1.

Ming Yan, Michigan State University Decentralized-8

slide-12
SLIDE 12

decentralized averaging

One iteration: xk+1 = Wxk, where x = [x1, . . . , xn]⊤.

  • W encodes the network; nonzero entries correspond to edges; we assume

that W is symmetric (for undirected networks). W =

   

1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4

    ,

W =

   

2/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 2/3

   

  • 1n×1 is a fixed point, i.e., W has an eigenvalue 1; row/column sum equals
  • ne.
  • Any fixed point is consensus, i.e., x∗ = c1n×1.
  • W has all other eigenvalues in (−1, 1).

Ming Yan, Michigan State University Decentralized-8

slide-13
SLIDE 13

decentralized averaging

One iteration: xk+1 = Wxk, where x = [x1, . . . , xn]⊤.

  • W encodes the network; nonzero entries correspond to edges; we assume

that W is symmetric (for undirected networks). W =

   

1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4

    ,

W =

   

2/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 2/3

   

  • 1n×1 is a fixed point, i.e., W has an eigenvalue 1; row/column sum equals
  • ne.
  • Any fixed point is consensus, i.e., x∗ = c1n×1.
  • W has all other eigenvalues in (−1, 1).
  • 1⊤xk+1 = 1⊤Wxk = 1⊤xk; the sum is fixed during the iteration.

Ming Yan, Michigan State University Decentralized-8

slide-14
SLIDE 14

decentralized averaging

One iteration: xk+1 = Wxk, where x = [x1, . . . , xn]⊤.

  • W encodes the network; nonzero entries correspond to edges; we assume

that W is symmetric (for undirected networks). W =

   

1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4 1/4

    ,

W =

   

2/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 2/3

   

  • 1n×1 is a fixed point, i.e., W has an eigenvalue 1; row/column sum equals
  • ne.
  • Any fixed point is consensus, i.e., x∗ = c1n×1.
  • W has all other eigenvalues in (−1, 1).
  • 1⊤xk+1 = 1⊤Wxk = 1⊤xk; the sum is fixed during the iteration.
  • Convergence speed depends on the second largest eigenvalue of W in

absolute value.

Ming Yan, Michigan State University Decentralized-8

slide-15
SLIDE 15

decentralized averaging as gradient descent

Decentralized averaging: xk+1 = Wxk

Ming Yan, Michigan State University Decentralized-9

slide-16
SLIDE 16

decentralized averaging as gradient descent

Decentralized averaging: xk+1 = Wxk Rewrite the iteration as: xk+1 = Wxk = xk − (I − W)xk.

Ming Yan, Michigan State University Decentralized-9

slide-17
SLIDE 17

decentralized averaging as gradient descent

Decentralized averaging: xk+1 = Wxk Rewrite the iteration as: xk+1 = Wxk = xk − (I − W)xk. It is equivalent to gradient descent with stepsize 1 for minimize

x

1 2 √ I − Wx2

2.

Ming Yan, Michigan State University Decentralized-9

slide-18
SLIDE 18

decentralized averaging as gradient descent

Decentralized averaging: xk+1 = Wxk Rewrite the iteration as: xk+1 = Wxk = xk − (I − W)xk. It is equivalent to gradient descent with stepsize 1 for minimize

x

1 2 √ I − Wx2

2.

The final solution depends on the initial sum 1⊤x0.

Ming Yan, Michigan State University Decentralized-9

slide-19
SLIDE 19

decentralized averaging as gradient descent

Decentralized averaging: xk+1 = Wxk Rewrite the iteration as: xk+1 = Wxk = xk − (I − W)xk. It is equivalent to gradient descent with stepsize 1 for minimize

x

1 2 √ I − Wx2

2.

The final solution depends on the initial sum 1⊤x0.

  • The Lipschitz constant of (I − W)x is smaller than 2, so we can choose

stepsize 1.

Ming Yan, Michigan State University Decentralized-9

slide-20
SLIDE 20

decentralized gradient descent

Consider problem minimize

x

f(x) =

n

  • i=1

fi(xi), s.t. x1 = x2 = · · · = xn.

Ming Yan, Michigan State University Decentralized-10

slide-21
SLIDE 21

decentralized gradient descent

Consider problem minimize

x

f(x) =

n

  • i=1

fi(xi), s.t. x1 = x2 = · · · = xn. Decentralized gradient descent (DGD) (Nedic-Ozdaglar ’09) xk+1 = Wxk−λ∇f(xk).

Ming Yan, Michigan State University Decentralized-10

slide-22
SLIDE 22

decentralized gradient descent

Consider problem minimize

x

f(x) =

n

  • i=1

fi(xi), s.t. x1 = x2 = · · · = xn. Decentralized gradient descent (DGD) (Nedic-Ozdaglar ’09) xk+1 = Wxk−λ∇f(xk).

  • Rewrite it as

xk+1 = xk − ((I − W)xk + λ∇f(xk)). DGD is gradient descent with stepsize one of minimize

x

1 2 √ I − Wx2

2 + λf(x).

Ming Yan, Michigan State University Decentralized-10

slide-23
SLIDE 23

decentralized gradient descent

Consider problem minimize

x

f(x) =

n

  • i=1

fi(xi), s.t. x1 = x2 = · · · = xn. Decentralized gradient descent (DGD) (Nedic-Ozdaglar ’09) xk+1 = Wxk−λ∇f(xk).

  • Rewrite it as

xk+1 = xk − ((I − W)xk + λ∇f(xk)). DGD is gradient descent with stepsize one of minimize

x

1 2 √ I − Wx2

2 + λf(x).

  • The solution is generally non consensus, i.e., Wx∗ = x∗ + λ∇f(x∗)= x∗.

Ming Yan, Michigan State University Decentralized-10

slide-24
SLIDE 24

decentralized gradient descent

Consider problem minimize

x

f(x) =

n

  • i=1

fi(xi), s.t. x1 = x2 = · · · = xn. Decentralized gradient descent (DGD) (Nedic-Ozdaglar ’09) xk+1 = Wxk−λ∇f(xk).

  • Rewrite it as

xk+1 = xk − ((I − W)xk + λ∇f(xk)). DGD is gradient descent with stepsize one of minimize

x

1 2 √ I − Wx2

2 + λf(x).

  • The solution is generally non consensus, i.e., Wx∗ = x∗ + λ∇f(x∗)= x∗.
  • Diminishing stepsize, i.e., decreasing λ during the iteration.

Ming Yan, Michigan State University Decentralized-10

slide-25
SLIDE 25

constant stepsize?

Ming Yan, Michigan State University Decentralized-11

slide-26
SLIDE 26

constant stepsize?

  • alternating direction method of multipliers (ADMM) (Shi et al. ’14,

Chang-Hong-Wang ’15, Hong-Chang ’17)

Ming Yan, Michigan State University Decentralized-11

slide-27
SLIDE 27

constant stepsize?

  • alternating direction method of multipliers (ADMM) (Shi et al. ’14,

Chang-Hong-Wang ’15, Hong-Chang ’17)

  • multi-consensus inner loops (Chen-Ozdaglar ’12, Jakovetic-Xavier-Moura

’14)

Ming Yan, Michigan State University Decentralized-11

slide-28
SLIDE 28

constant stepsize?

  • alternating direction method of multipliers (ADMM) (Shi et al. ’14,

Chang-Hong-Wang ’15, Hong-Chang ’17)

  • multi-consensus inner loops (Chen-Ozdaglar ’12, Jakovetic-Xavier-Moura

’14)

  • EXTRA/PG-EXTRA (Shi et al. ’15)

Ming Yan, Michigan State University Decentralized-11

slide-29
SLIDE 29

decentralized smooth optimization

Problem: minimize

x

f(x), s.t. √ I − Wx = 0.

Ming Yan, Michigan State University Decentralized-12

slide-30
SLIDE 30

decentralized smooth optimization

Problem: minimize

x

f(x), s.t. √ I − Wx = 0.

  • Lagrangian function

f(x) + √ I − Wx, s, where s is the Lagrangian multiplier.

Ming Yan, Michigan State University Decentralized-12

slide-31
SLIDE 31

decentralized smooth optimization

Problem: minimize

x

f(x), s.t. √ I − Wx = 0.

  • Lagrangian function

f(x) + √ I − Wx, s, where s is the Lagrangian multiplier.

  • Optimality condition (KKT):

0 =∇f(x∗) + √ I − Ws∗, 0 =− √ I − Wx∗.

Ming Yan, Michigan State University Decentralized-12

slide-32
SLIDE 32

decentralized smooth optimization

Problem: minimize

x

f(x), s.t. √ I − Wx = 0.

  • Lagrangian function

f(x) + √ I − Wx, s, where s is the Lagrangian multiplier.

  • Optimality condition (KKT):

0 =∇f(x∗) + √ I − Ws∗, 0 =− √ I − Wx∗.

  • It is the same as

  • ∇f(x∗)
  • =

I − W − √ I − W x∗ s∗

  • .

Ming Yan, Michigan State University Decentralized-12

slide-33
SLIDE 33

forward-backward

  • The KKT system

  • ∇f(x∗)
  • =

I − W − √ I − W x∗ s∗

  • .

Ming Yan, Michigan State University Decentralized-13

slide-34
SLIDE 34

forward-backward

  • Using forward-backward in the KKT form
  • αI

− √ I − W − √ I − W βI xk sk

  • ∇f(xk)
  • =
  • αI

− √ I − W − √ I − W βI xk+1 sk+1

  • +

I − W − √ I − W xk+1 sk+1

  • .

Ming Yan, Michigan State University Decentralized-13

slide-35
SLIDE 35

forward-backward

  • Using forward-backward in the KKT form
  • αI

− √ I − W − √ I − W βI xk sk

  • ∇f(xk)
  • =
  • αI

− √ I − W − √ I − W βI xk+1 sk+1

  • +

I − W − √ I − W xk+1 sk+1

  • .
  • It reduces to
  • αI

− √ I − W − √ I − W βI xk sk

  • ∇f(xk)
  • =
  • αI

−2 √ I − W βI xk+1 sk+1

  • .

Ming Yan, Michigan State University Decentralized-13

slide-36
SLIDE 36

forward-backward

  • Using forward-backward in the KKT form
  • αI

− √ I − W − √ I − W βI xk sk

  • ∇f(xk)
  • =
  • αI

− √ I − W − √ I − W βI xk+1 sk+1

  • +

I − W − √ I − W xk+1 sk+1

  • .
  • It reduces to
  • αI

− √ I − W − √ I − W βI xk sk

  • ∇f(xk)
  • =
  • αI

−2 √ I − W βI xk+1 sk+1

  • .
  • It is equivalent to

αxk − √ I − Wsk − ∇f(xk) =αxk+1, − √ I − Wxk + βsk = − 2 √ I − Wxk+1 + βsk+1.

Ming Yan, Michigan State University Decentralized-13

slide-37
SLIDE 37

forward-backward

  • Using forward-backward in the KKT form
  • αI

− √ I − W − √ I − W βI xk sk

  • ∇f(xk)
  • =
  • αI

− √ I − W − √ I − W βI xk+1 sk+1

  • +

I − W − √ I − W xk+1 sk+1

  • .
  • It reduces to
  • αI

− √ I − W − √ I − W βI xk sk

  • ∇f(xk)
  • =
  • αI

−2 √ I − W βI xk+1 sk+1

  • .
  • It is equivalent to

αxk − √ I − Wsk − ∇f(xk) =αxk+1, − √ I − Wxk + βsk = − 2 √ I − Wxk+1 + βsk+1.

  • For simplicity, let t =

√ I − Ws, and we have αxk − tk − ∇f(xk) =αxk+1, −(I − W)xk + βtk = − 2(I − W)xk+1 + βtk+1.

Ming Yan, Michigan State University Decentralized-13

slide-38
SLIDE 38

EXact firsT-ordeR Algorithm (EXTRA)

  • From the previous slide

αxk − tk − ∇f(xk) =αxk+1, −(I − W)xk + βtk = − 2(I − W)xk+1 + βtk+1.

Ming Yan, Michigan State University Decentralized-14

slide-39
SLIDE 39

EXact firsT-ordeR Algorithm (EXTRA)

  • From the previous slide

αxk − tk − ∇f(xk) =αxk+1, −(I − W)xk + βtk = − 2(I − W)xk+1 + βtk+1.

  • We have

αxk+1 =αxk − tk − ∇f(xk) =αxk−I − W β (2xk − xk−1) − tk−1 − ∇f(xk) =αxk − I − W β (2xk − xk−1)+αxk + ∇f(xk−1) − αxk−1 − ∇f(xk) = αI − I − W β

  • (2xk − xk−1) + ∇f(xk−1) − ∇f(xk).

Ming Yan, Michigan State University Decentralized-14

slide-40
SLIDE 40

EXact firsT-ordeR Algorithm (EXTRA)

  • From the previous slide

αxk − tk − ∇f(xk) =αxk+1, −(I − W)xk + βtk = − 2(I − W)xk+1 + βtk+1.

  • We have

αxk+1 =αxk − tk − ∇f(xk) =αxk−I − W β (2xk − xk−1) − tk−1 − ∇f(xk) =αxk − I − W β (2xk − xk−1)+αxk + ∇f(xk−1) − αxk−1 − ∇f(xk) = αI − I − W β

  • (2xk − xk−1) + ∇f(xk−1) − ∇f(xk).
  • Let αβ = 2 and we have EXTRA (Shi et al. ’15)

xk+1 =I + W 2 (2xk − xk−1) − 1 α(∇f(xk) − ∇f(xk−1)).

Ming Yan, Michigan State University Decentralized-14

slide-41
SLIDE 41

convergence conditions for EXTRA: I

EXTRA: xk+1 = I + W 2 (2xk − xk−1) − 1 α(∇f(xk) − ∇f(xk−1))

Ming Yan, Michigan State University Decentralized-15

slide-42
SLIDE 42

convergence conditions for EXTRA: I

EXTRA: xk+1 = I + W 2 (2xk − xk−1) − 1 α(∇f(xk) − ∇f(xk−1))

  • If f = 0:
  • xk+1

xk

  • =
  • I + W

− I+W

2

I xk xk−1

  • .

Ming Yan, Michigan State University Decentralized-15

slide-43
SLIDE 43

convergence conditions for EXTRA: I

EXTRA: xk+1 = I + W 2 (2xk − xk−1) − 1 α(∇f(xk) − ∇f(xk−1))

  • If f = 0:
  • xk+1

xk

  • =
  • I + W

− I+W

2

I xk xk−1

  • .
  • Let I + W = UΣU⊤.
  • I + W

− I+W

2

I

  • =
  • U

U Σ − Σ

2

I U⊤ U⊤

  • .

Ming Yan, Michigan State University Decentralized-15

slide-44
SLIDE 44

convergence conditions for EXTRA: I

EXTRA: xk+1 = I + W 2 (2xk − xk−1) − 1 α(∇f(xk) − ∇f(xk−1))

  • If f = 0:
  • xk+1

xk

  • =
  • I + W

− I+W

2

I xk xk−1

  • .
  • Let I + W = UΣU⊤.
  • I + W

− I+W

2

I

  • =
  • U

U Σ − Σ

2

I U⊤ U⊤

  • .
  • The iteration becomes
  • U⊤xk+1

U⊤xk

  • =
  • Σ

− Σ

2

I

  • U⊤xk

U⊤xk−1

  • .

Ming Yan, Michigan State University Decentralized-15

slide-45
SLIDE 45

convergence conditions for EXTRA: I

EXTRA: xk+1 = I + W 2 (2xk − xk−1) − 1 α(∇f(xk) − ∇f(xk−1))

  • If f = 0:
  • xk+1

xk

  • =
  • I + W

− I+W

2

I xk xk−1

  • .
  • Let I + W = UΣU⊤.
  • I + W

− I+W

2

I

  • =
  • U

U Σ − Σ

2

I U⊤ U⊤

  • .
  • The iteration becomes
  • U⊤xk+1

U⊤xk

  • =
  • Σ

− Σ

2

I

  • U⊤xk

U⊤xk−1

  • .
  • The condition for W is −2/3 < λ(Σ) = λ(W + I) ≤ 2, which is

−5/3 < λ(W) ≤ 1.

Ming Yan, Michigan State University Decentralized-15

slide-46
SLIDE 46

convergence conditions for EXTRA: II

EXTRA: xk+1 = I + W 2 (2xk − xk−1) − 1 α(∇f(xk) − ∇f(xk−1))

Ming Yan, Michigan State University Decentralized-16

slide-47
SLIDE 47

convergence conditions for EXTRA: II

EXTRA: xk+1 = I + W 2 (2xk − xk−1) − 1 α(∇f(xk) − ∇f(xk−1))

  • If ∇f(xk) = xk − b:
  • xk+1

xk

  • =
  • I + W− 1

αI

− I+W

2

+ 1

αI

I xk xk−1

  • .

Ming Yan, Michigan State University Decentralized-16

slide-48
SLIDE 48

convergence conditions for EXTRA: II

EXTRA: xk+1 = I + W 2 (2xk − xk−1) − 1 α(∇f(xk) − ∇f(xk−1))

  • If ∇f(xk) = xk − b:
  • xk+1

xk

  • =
  • I + W− 1

αI

− I+W

2

+ 1

αI

I xk xk−1

  • .
  • Let I + W = UΣU⊤.
  • I + W− 1

αI

− I+W

2

+ 1

αI

I

  • =
  • U

U Σ− 1

αI

− Σ

2 + 1 αI

I U⊤ U⊤

  • .

Ming Yan, Michigan State University Decentralized-16

slide-49
SLIDE 49

convergence conditions for EXTRA: II

EXTRA: xk+1 = I + W 2 (2xk − xk−1) − 1 α(∇f(xk) − ∇f(xk−1))

  • If ∇f(xk) = xk − b:
  • xk+1

xk

  • =
  • I + W− 1

αI

− I+W

2

+ 1

αI

I xk xk−1

  • .
  • Let I + W = UΣU⊤.
  • I + W− 1

αI

− I+W

2

+ 1

αI

I

  • =
  • U

U Σ− 1

αI

− Σ

2 + 1 αI

I U⊤ U⊤

  • .
  • The condition for W is 4/(3α) − 2/3 < λ(Σ) = λ(W + I) ≤ 2, which is

4/(3α) − 5/3 < λ(W) ≤ 1. In addition, we have stepsize 1/α < 2.

Ming Yan, Michigan State University Decentralized-16

slide-50
SLIDE 50

conditions for general EXTRA

EXTRA: xk+1 = I + W 2 (2xk − xk−1) − 1 α(∇f(xk) − ∇f(xk−1)).

Ming Yan, Michigan State University Decentralized-17

slide-51
SLIDE 51

conditions for general EXTRA

EXTRA: xk+1 = I + W 2 (2xk − xk−1) − 1 α(∇f(xk) − ∇f(xk−1)). Initial condition (k = 0, 1): x1 = x0 − 1 α∇f(x0).

Ming Yan, Michigan State University Decentralized-17

slide-52
SLIDE 52

conditions for general EXTRA

EXTRA: xk+1 = I + W 2 (2xk − xk−1) − 1 α(∇f(xk) − ∇f(xk−1)). Initial condition (k = 0, 1): x1 = x0 − 1 α∇f(x0). Convergence condition (Li-Yan ’17): 4/(3α) − 5/3 < λ(W) ≤ 1, 1/α < 2/L.

Ming Yan, Michigan State University Decentralized-17

slide-53
SLIDE 53

conditions for general EXTRA

EXTRA: xk+1 = I + W 2 (2xk − xk−1) − 1 α(∇f(xk) − ∇f(xk−1)). Initial condition (k = 0, 1): x1 = x0 − 1 α∇f(x0). Convergence condition (Li-Yan ’17): 4/(3α) − 5/3 < λ(W) ≤ 1, 1/α < 2/L. Linear convergence condition:

  • f(x) is strongly convex. (Li-Yan ’17)

Ming Yan, Michigan State University Decentralized-17

slide-54
SLIDE 54

conditions for general EXTRA

EXTRA: xk+1 = I + W 2 (2xk − xk−1) − 1 α(∇f(xk) − ∇f(xk−1)). Initial condition (k = 0, 1): x1 = x0 − 1 α∇f(x0). Convergence condition (Li-Yan ’17): 4/(3α) − 5/3 < λ(W) ≤ 1, 1/α < 2/L. Linear convergence condition:

  • f(x) is strongly convex. (Li-Yan ’17)
  • weaker condition on f(x) but more restrict condition for both parameters.

(Shi et al. ’15)

Ming Yan, Michigan State University Decentralized-17

slide-55
SLIDE 55

large stepsize as centralized ones?

Ming Yan, Michigan State University Decentralized-18

slide-56
SLIDE 56

decentralized smooth optimization

Problem: minimize

x

f(x), s.t. √ I − Wx = 0.

Ming Yan, Michigan State University Decentralized-19

slide-57
SLIDE 57

decentralized smooth optimization

Problem: minimize

x

f(x), s.t. √ I − Wx = 0.

  • Lagrangian function

f(x) + √ I − Wx, s, where s is the Lagrangian multiplier.

Ming Yan, Michigan State University Decentralized-19

slide-58
SLIDE 58

decentralized smooth optimization

Problem: minimize

x

f(x), s.t. √ I − Wx = 0.

  • Lagrangian function

f(x) + √ I − Wx, s, where s is the Lagrangian multiplier.

  • Optimality condition (KKT):

0 =∇f(x∗) + √ I − Ws∗, 0 =− √ I − Wx∗.

Ming Yan, Michigan State University Decentralized-19

slide-59
SLIDE 59

decentralized smooth optimization

Problem: minimize

x

f(x), s.t. √ I − Wx = 0.

  • Lagrangian function

f(x) + √ I − Wx, s, where s is the Lagrangian multiplier.

  • Optimality condition (KKT):

0 =∇f(x∗) + √ I − Ws∗, 0 =− √ I − Wx∗.

  • It is the same as

  • ∇f(x∗)
  • =

I − W − √ I − W x∗ s∗

  • .

Ming Yan, Michigan State University Decentralized-19

slide-60
SLIDE 60

forward-backward

  • The KKT system

  • ∇f(x∗)
  • =

I − W − √ I − W x∗ s∗

  • .

Ming Yan, Michigan State University Decentralized-20

slide-61
SLIDE 61

forward-backward

  • Using forward-backward in the KKT form
  • αI

βI − 1

α(I − W)

xk sk

  • ∇f(xk)
  • =
  • αI

βI − 1

α(I − W)

xk+1 sk+1

  • +

I − W − √ I − W xk+1 sk+1

  • .

Ming Yan, Michigan State University Decentralized-20

slide-62
SLIDE 62

forward-backward

  • Combine the right hand side:
  • αI

βI − 1

α(I − W)

xk sk

  • ∇f(xk)
  • =
  • αI

√ I − W − √ I − W βI − 1

α(I − W)

xk+1 sk+1

  • .

Ming Yan, Michigan State University Decentralized-20

slide-63
SLIDE 63

forward-backward

  • Combine the right hand side:
  • αI

βI − 1

α(I − W)

xk sk

  • ∇f(xk)
  • =
  • αI

√ I − W − √ I − W βI − 1

α(I − W)

xk+1 sk+1

  • .
  • Apply Gaussian elimination:
  • αI

√ I − W βI − 1

α(I − W)

xk sk

  • ∇f(xk)

1 α

√ I − W∇f(xk)

  • =
  • αI

√ I − W βI xk+1 sk+1

  • .

Ming Yan, Michigan State University Decentralized-20

slide-64
SLIDE 64

forward-backward

  • Combine the right hand side:
  • αI

βI − 1

α(I − W)

xk sk

  • ∇f(xk)
  • =
  • αI

√ I − W − √ I − W βI − 1

α(I − W)

xk+1 sk+1

  • .
  • Apply Gaussian elimination:
  • αI

√ I − W βI − 1

α(I − W)

xk sk

  • ∇f(xk)

1 α

√ I − W∇f(xk)

  • =
  • αI

√ I − W βI xk+1 sk+1

  • .
  • It is equivalent to

αxk − ∇f(xk) − √ I − Wsk+1 =αxk+1, √ I − Wxk + β I − 1 αβ (I − W) sk − 1 α √ I − W∇f(xk) =βsk+1.

Ming Yan, Michigan State University Decentralized-20

slide-65
SLIDE 65

NIDS (Li-Shi-Yan ’17)

From the previous slide: αxk − ∇f(xk) − √ I − Wsk+1 =αxk+1, √ I − Wxk + β I − 1 αβ (I − W) sk − 1 α √ I − W∇f(xk) =βsk+1.

Ming Yan, Michigan State University Decentralized-21

slide-66
SLIDE 66

NIDS (Li-Shi-Yan ’17)

From the previous slide: αxk − ∇f(xk) − √ I − Wsk+1 =αxk+1, √ I − Wxk + β I − 1 αβ (I − W) sk − 1 α √ I − W∇f(xk) =βsk+1. Let t = √ I − Ws: αxk − ∇f(xk) − tk+1 =αxk+1, −(I − W)xk + β I − 1 αβ (I − W) tk − 1 α(I − W)∇f(xk) =βtk+1.

Ming Yan, Michigan State University Decentralized-21

slide-67
SLIDE 67

NIDS (Li-Shi-Yan ’17)

Let t = √ I − Ws: αxk − ∇f(xk) − tk+1 =αxk+1, −(I − W)xk + β I − 1 αβ (I − W) tk − 1 α(I − W)∇f(xk) =βtk+1.

Ming Yan, Michigan State University Decentralized-21

slide-68
SLIDE 68

NIDS (Li-Shi-Yan ’17)

Let t = √ I − Ws: αxk − ∇f(xk) − tk+1 =αxk+1, −(I − W)xk + β I − 1 αβ (I − W) tk − 1 α(I − W)∇f(xk) =βtk+1. We have αxk+1 =αxk − ∇f(xk) − tk+1 =αxk − ∇f(xk)− I − 1 αβ (I − W) tk − 1 β (I − W)xk + 1 αβ (I − W)∇f(xk) = I − 1 αβ (I − W) (αxk − tk − ∇f(xk)) = I − 1 αβ (I − W) (αxk + αxk − αxk−1 + ∇f(xk−1) − ∇f(xk)).

Ming Yan, Michigan State University Decentralized-21

slide-69
SLIDE 69

NIDS (Li-Shi-Yan ’17)

Let t = √ I − Ws: αxk − ∇f(xk) − tk+1 =αxk+1, −(I − W)xk + β I − 1 αβ (I − W) tk − 1 α(I − W)∇f(xk) =βtk+1. We have αxk+1 =αxk − ∇f(xk) − tk+1 =αxk − ∇f(xk)− I − 1 αβ (I − W) tk − 1 β (I − W)xk + 1 αβ (I − W)∇f(xk) = I − 1 αβ (I − W) (αxk − tk − ∇f(xk)) = I − 1 αβ (I − W) (αxk + αxk − αxk−1 + ∇f(xk−1) − ∇f(xk)). Thus xk+1 = I − 1 αβ (I − W) (2xk − xk−1 − 1 α(∇f(xk) − ∇f(xk−1)).

Ming Yan, Michigan State University Decentralized-21

slide-70
SLIDE 70

convergence conditions for NIDS

NIDS (with αβ = 2): xk+1 = I + W 2 (2xk − xk−1 − 1 α(∇f(xk) − ∇f(xk−1))

Ming Yan, Michigan State University Decentralized-22

slide-71
SLIDE 71

convergence conditions for NIDS

NIDS (with αβ = 2): xk+1 = I + W 2 (2xk − xk−1 − 1 α(∇f(xk) − ∇f(xk−1))

  • If f = 0 (same as EXTRA): The condition for W is −5/3 < λ(W) ≤ 1.

Ming Yan, Michigan State University Decentralized-22

slide-72
SLIDE 72

convergence conditions for NIDS

NIDS (with αβ = 2): xk+1 = I + W 2 (2xk − xk−1 − 1 α(∇f(xk) − ∇f(xk−1))

  • If f = 0 (same as EXTRA): The condition for W is −5/3 < λ(W) ≤ 1.
  • If ∇f(xk) = xk − b:
  • xk+1

xk

  • =
  • (2 − 1

α) I+W 2

−(1 − 1

α) I+W 2

I xk xk−1

  • Ming Yan, Michigan State University

Decentralized-22

slide-73
SLIDE 73

convergence conditions for NIDS

NIDS (with αβ = 2): xk+1 = I + W 2 (2xk − xk−1 − 1 α(∇f(xk) − ∇f(xk−1))

  • If f = 0 (same as EXTRA): The condition for W is −5/3 < λ(W) ≤ 1.
  • If ∇f(xk) = xk − b:
  • xk+1

xk

  • =
  • (2 − 1

α) I+W 2

−(1 − 1

α) I+W 2

I xk xk−1

  • Let I + W = UΣU⊤.
  • U⊤xk+1

U⊤xk

  • =
  • (2 − 1

α) Σ 2

−(1 − 1

α) Σ 2

I U⊤xk U⊤xk−1

  • Ming Yan, Michigan State University

Decentralized-22

slide-74
SLIDE 74

convergence conditions for NIDS

NIDS (with αβ = 2): xk+1 = I + W 2 (2xk − xk−1 − 1 α(∇f(xk) − ∇f(xk−1))

  • If f = 0 (same as EXTRA): The condition for W is −5/3 < λ(W) ≤ 1.
  • If ∇f(xk) = xk − b:
  • xk+1

xk

  • =
  • (2 − 1

α) I+W 2

−(1 − 1

α) I+W 2

I xk xk−1

  • Let I + W = UΣU⊤.
  • U⊤xk+1

U⊤xk

  • =
  • (2 − 1

α) Σ 2

−(1 − 1

α) Σ 2

I U⊤xk U⊤xk−1

  • Therefore, one sufficient condition is −5/3 < λ(W) ≤ 1 and 1/α < 2.

Ming Yan, Michigan State University Decentralized-22

slide-75
SLIDE 75

conditions of NIDS for general smooth functions

NIDS (with αβ = 2): xk+1 = I + W 2 (2xk − xk−1 − 1 α(∇f(xk) − ∇f(xk−1))

Ming Yan, Michigan State University Decentralized-23

slide-76
SLIDE 76

conditions of NIDS for general smooth functions

NIDS (with αβ = 2): xk+1 = I + W 2 (2xk − xk−1 − 1 α(∇f(xk) − ∇f(xk−1)) Initial condition (k = 0, 1): x1 = x0 − 1 α∇f(x0).

Ming Yan, Michigan State University Decentralized-23

slide-77
SLIDE 77

conditions of NIDS for general smooth functions

NIDS (with αβ = 2): xk+1 = I + W 2 (2xk − xk−1 − 1 α(∇f(xk) − ∇f(xk−1)) Initial condition (k = 0, 1): x1 = x0 − 1 α∇f(x0). Convergence condition (Li-Yan ’17): −5/3 < λ(W) ≤ 1, 1/α < 2/L.

Ming Yan, Michigan State University Decentralized-23

slide-78
SLIDE 78

conditions of NIDS for general smooth functions

NIDS (with αβ = 2): xk+1 = I + W 2 (2xk − xk−1 − 1 α(∇f(xk) − ∇f(xk−1)) Initial condition (k = 0, 1): x1 = x0 − 1 α∇f(x0). Convergence condition (Li-Yan ’17): −5/3 < λ(W) ≤ 1, 1/α < 2/L. Linear convergence condition:

  • f(x) is strongly convex and −1 < λ(W) ≤ 1 (Li-Shi-Yan ’17):

O

  • max
  • 1 − µ

L, 1 − 1 − λ2(W) 1 − λn(W)

  • .

Ming Yan, Michigan State University Decentralized-23

slide-79
SLIDE 79

NIDS vs EXTRA

EXTRA xk+1 = I + W 2 (2xk − xk−1) − 1 α(∇f(xk) − ∇f(xk−1) NIDS: xk+1 = I + W 2 (2xk − xk−1 − 1 α(∇f(xk) − ∇f(xk−1))

Ming Yan, Michigan State University Decentralized-24

slide-80
SLIDE 80

NIDS vs EXTRA

EXTRA xk+1 = I + W 2 (2xk − xk−1) − 1 α(∇f(xk) − ∇f(xk−1) NIDS: xk+1 = I + W 2 (2xk − xk−1 − 1 α(∇f(xk) − ∇f(xk−1))

  • The difference is in the data to be communicated.

Ming Yan, Michigan State University Decentralized-24

slide-81
SLIDE 81

NIDS vs EXTRA

EXTRA xk+1 = I + W 2 (2xk − xk−1) − 1 α(∇f(xk) − ∇f(xk−1) NIDS: xk+1 = I + W 2 (2xk − xk−1 − 1 α(∇f(xk) − ∇f(xk−1))

  • The difference is in the data to be communicated.
  • But NIDS has a larger range for parameters than EXTRA.

Ming Yan, Michigan State University Decentralized-24

slide-82
SLIDE 82

NIDS vs EXTRA

EXTRA xk+1 = I + W 2 (2xk − xk−1) − 1 α(∇f(xk) − ∇f(xk−1) NIDS: xk+1 = I + W 2 (2xk − xk−1 − 1 α(∇f(xk) − ∇f(xk−1))

  • The difference is in the data to be communicated.
  • But NIDS has a larger range for parameters than EXTRA.
  • NIDS is faster than EXTRA.

Ming Yan, Michigan State University Decentralized-24

slide-83
SLIDE 83

advantages of NIDS

NIDS: xk+1 = I + W 2 (2xk − xk−1 − 1 α(∇f(xk) − ∇f(xk−1))

Ming Yan, Michigan State University Decentralized-25

slide-84
SLIDE 84

advantages of NIDS

NIDS: xk+1 = I + W 2 (2xk − xk−1 − 1 α(∇f(xk) − ∇f(xk−1))

  • The stepsize is large and does not depend on the network topology.

1 α < 2 L.

Ming Yan, Michigan State University Decentralized-25

slide-85
SLIDE 85

advantages of NIDS

NIDS: xk+1 = I + W 2 (2xk − xk−1 − 1 α(∇f(xk) − ∇f(xk−1))

  • The stepsize is large and does not depend on the network topology.

1 α < 2 L.

  • Individual stepsizes can be included.

1 αi < 2 Li .

Ming Yan, Michigan State University Decentralized-25

slide-86
SLIDE 86

advantages of NIDS

NIDS: xk+1 = I + W 2 (2xk − xk−1 − 1 α(∇f(xk) − ∇f(xk−1))

  • The stepsize is large and does not depend on the network topology.

1 α < 2 L.

  • Individual stepsizes can be included.

1 αi < 2 Li .

  • The linear convergence rate from the functions and the network are

separated. O

  • max
  • 1 − µ

L, 1 − 1 − λ2(W) 1 − λn(W)

  • .

It matches the results for gradient descent and decentralized averaging without acceleration.

Ming Yan, Michigan State University Decentralized-25

slide-87
SLIDE 87

D2: stochastic NIDS (Huang et al. ’18)

NIDS: xk+1 = I + W 2 (2xk − xk−1 − 1 α(∇f(xk) − ∇f(xk−1))

Ming Yan, Michigan State University Decentralized-26

slide-88
SLIDE 88

D2: stochastic NIDS (Huang et al. ’18)

NIDS: xk+1 = I + W 2 (2xk − xk−1 − 1 α(∇f(xk) − ∇f(xk−1)) NIDS-stochastic (D2: Decentralized Training over Decentralized Data): xk+1 = I + W 2 (2xk − xk−1 − 1 α(∇f(xk, ξk) − ∇f(xk−1, ξk−1))

  • ∇f(xk, ξk) is a stochastic gradient by sampling ξt from distribution D.

Ming Yan, Michigan State University Decentralized-26

slide-89
SLIDE 89

D2: stochastic NIDS (Huang et al. ’18)

NIDS: xk+1 = I + W 2 (2xk − xk−1 − 1 α(∇f(xk) − ∇f(xk−1)) NIDS-stochastic (D2: Decentralized Training over Decentralized Data): xk+1 = I + W 2 (2xk − xk−1 − 1 α(∇f(xk, ξk) − ∇f(xk−1, ξk−1))

  • ∇f(xk, ξk) is a stochastic gradient by sampling ξt from distribution D.
  • Eξ∼D∇f(x; ξ) = ∇f(x),

∀x.

Ming Yan, Michigan State University Decentralized-26

slide-90
SLIDE 90

D2: stochastic NIDS (Huang et al. ’18)

NIDS: xk+1 = I + W 2 (2xk − xk−1 − 1 α(∇f(xk) − ∇f(xk−1)) NIDS-stochastic (D2: Decentralized Training over Decentralized Data): xk+1 = I + W 2 (2xk − xk−1 − 1 α(∇f(xk, ξk) − ∇f(xk−1, ξk−1))

  • ∇f(xk, ξk) is a stochastic gradient by sampling ξt from distribution D.
  • Eξ∼D∇f(x; ξ) = ∇f(x),

∀x.

  • Eξ∼D ∇f(x; ξ) − ∇f(x)2 σ2,

∀x.

Ming Yan, Michigan State University Decentralized-26

slide-91
SLIDE 91

D2: stochastic NIDS (Huang et al. ’18)

NIDS: xk+1 = I + W 2 (2xk − xk−1 − 1 α(∇f(xk) − ∇f(xk−1)) NIDS-stochastic (D2: Decentralized Training over Decentralized Data): xk+1 = I + W 2 (2xk − xk−1 − 1 α(∇f(xk, ξk) − ∇f(xk−1, ξk−1))

  • ∇f(xk, ξk) is a stochastic gradient by sampling ξt from distribution D.
  • Eξ∼D∇f(x; ξ) = ∇f(x),

∀x.

  • Eξ∼D ∇f(x; ξ) − ∇f(x)2 σ2,

∀x.

  • Convergence result: if the stepsize is small enough (in the order of

(c +

  • T/n)−1), the convergence rate is

O

  • σ

√ nT + 1 T

  • .

Ming Yan, Michigan State University Decentralized-26

slide-92
SLIDE 92

numerical experiments

Ming Yan, Michigan State University Decentralized-27

slide-93
SLIDE 93

compared algorithms

  • NIDS
  • EXTRA/PG-EXTRA
  • DIGing-ATC (Nedic et al. ’16):

xk+1 =W(xk − αyk), yk+1 =W(yk + ∇f(xk+1) − ∇f(xk)).

  • accelerated distributed Nesterov gradient descent (Acc-DNGD-SC in

(Qu-Li ’17)

  • dual friendly optimal algorithm (OA) for distributed optimization (Uribe et
  • al. ’17).

Ming Yan, Michigan State University Decentralized-28

slide-94
SLIDE 94

strongly convex: same stepsize

10 20 30 40 50 60 70 80 90

number of iterations

10-14 10-12 10-10 10-8 10-6 10-4 10-2

Ming Yan, Michigan State University Decentralized-29

slide-95
SLIDE 95

strongly convex: same stepsize

10 20 30 40 50 60 70 80 90

number of iterations

10-14 10-12 10-10 10-8 10-6 10-4 10-2

Ming Yan, Michigan State University Decentralized-29

slide-96
SLIDE 96

strongly convex: adaptive stepsize

20 40 60 80 100 120 140

number of iterations

10-14 10-12 10-10 10-8 10-6 10-4 10-2

Ming Yan, Michigan State University Decentralized-30

slide-97
SLIDE 97

linear convergence rate bottleneck

50 100 150 200 250 300 350 400

number of iterations

10-20 10-15 10-10 10-5 100 105

Ming Yan, Michigan State University Decentralized-31

slide-98
SLIDE 98

linear convergence rate bottleneck

50 100 150 200 250 300 350 400 450

number of iterations

10-20 10-15 10-10 10-5 100 105

Ming Yan, Michigan State University Decentralized-31

slide-99
SLIDE 99

nonsmooth functions

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 number of iterations 104 10-8 10-6 10-4 10-2 100 102 NIDS-1/L NIDS-1.5/L NIDS-1.9/L PGEXTRA-1/L PGEXTRA-1.2/L PGEXTRA-1.3/L PGEXTRA-1.4/L

Ming Yan, Michigan State University Decentralized-32

slide-100
SLIDE 100

stochastic case: shuffled

0.5 1 1.5 2 20 40 60 80 100

Loss # Epochs Decentralized D2 Centralized

(a) TRANSFERLEARNING (b) LENET

0.5 1 1.5 2 2.5 20 40 60 80 100

Loss # Epochs Decentralized D2 Centralized

Ming Yan, Michigan State University Decentralized-33

slide-101
SLIDE 101

stochastic case: unshuffled

0.5 1 1.5 2 20 40 60 80 100

Loss # Epochs Decentralized D2 Centralized

(a) TRANSFERLEARNING (b) LENET

2 4 6 8 10 20 40 60 80 100

Loss # Epochs Decentralized D2 Centralized

Ming Yan, Michigan State University Decentralized-34

slide-102
SLIDE 102

conclusion and open questions

conclusion

Ming Yan, Michigan State University Decentralized-35

slide-103
SLIDE 103

conclusion and open questions

conclusion

  • optimal bounds for EXTRA/PG-EXTRA

Ming Yan, Michigan State University Decentralized-35

slide-104
SLIDE 104

conclusion and open questions

conclusion

  • optimal bounds for EXTRA/PG-EXTRA
  • new algorithm NIDS

Ming Yan, Michigan State University Decentralized-35

slide-105
SLIDE 105

conclusion and open questions

conclusion

  • optimal bounds for EXTRA/PG-EXTRA
  • new algorithm NIDS
  • pen questions

Ming Yan, Michigan State University Decentralized-35

slide-106
SLIDE 106

conclusion and open questions

conclusion

  • optimal bounds for EXTRA/PG-EXTRA
  • new algorithm NIDS
  • pen questions
  • network construction

Ming Yan, Michigan State University Decentralized-35

slide-107
SLIDE 107

conclusion and open questions

conclusion

  • optimal bounds for EXTRA/PG-EXTRA
  • new algorithm NIDS
  • pen questions
  • network construction
  • preconditioning

Ming Yan, Michigan State University Decentralized-35

slide-108
SLIDE 108

conclusion and open questions

conclusion

  • optimal bounds for EXTRA/PG-EXTRA
  • new algorithm NIDS
  • pen questions
  • network construction
  • preconditioning
  • acceleration?

Ming Yan, Michigan State University Decentralized-35

slide-109
SLIDE 109

conclusion and open questions

conclusion

  • optimal bounds for EXTRA/PG-EXTRA
  • new algorithm NIDS
  • pen questions
  • network construction
  • preconditioning
  • acceleration?
  • directed network?

Ming Yan, Michigan State University Decentralized-35

slide-110
SLIDE 110

conclusion and open questions

conclusion

  • optimal bounds for EXTRA/PG-EXTRA
  • new algorithm NIDS
  • pen questions
  • network construction
  • preconditioning
  • acceleration?
  • directed network?
  • dynamical network?

Ming Yan, Michigan State University Decentralized-35

slide-111
SLIDE 111

conclusion and open questions

conclusion

  • optimal bounds for EXTRA/PG-EXTRA
  • new algorithm NIDS
  • pen questions
  • network construction
  • preconditioning
  • acceleration?
  • directed network?
  • dynamical network?
  • asynchronous?

Ming Yan, Michigan State University Decentralized-35

slide-112
SLIDE 112

Paper 1 Z. Li, W. Shi and M. Yan, A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates, arXiv:1704.07807 Code https://github.com/mingyan08/NIDS Paper 2 Z. Li and M. Yan, A primal-dual algorithm with optimal stepsizes and its application in decentralized consensus optimization, arXiv:1711.06785 Paper 3 H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, D2: decentralized training over decentralized data, ICML 2018, 4848-4856. http://proceedings.mlr.press/v80/tang18a.html

Thank You!

Ming Yan, Michigan State University Decentralized-36