Simultaneous Inference for Massive Data: Distributed Bootstrap
Yang Yu1, Shih-Kang Chao2, Guang Cheng1
1Purdue University 2University of Missouri
Simultaneous Inference for Massive Data: Distributed Bootstrap Yang - - PowerPoint PPT Presentation
Simultaneous Inference for Massive Data: Distributed Bootstrap Yang Yu 1 , Shih-Kang Chao 2 , Guang Cheng 1 1 Purdue University 2 University of Missouri ICML 2020 We have N i.i.d. data points: Z 1 , . . . , Z N Estimation: Fit a model that has an
1Purdue University 2University of Missouri
θ∈Rd
N
θ∈Rd
N
θ∈Rd EZ[L(θ; Z)]
θ∈Rd
N
θ∈Rd EZ[L(θ; Z)]
◮ Linear regression: Z = (X, Y ), L(θ; Z) = (Y − X⊤θ)2/2 ◮ Logistic regression: Z = (X, Y ), L(θ; Z) = −Y X⊤θ + log(1 + exp[X⊤θ])
1 ∈ [L, U]) ≈ 95%
1 ∈ [L, U]) ≈ 95%
1 ∈ [L1, U1], . . . , θ∗ d ∈ [Ld, Ud]) ≈ 95%
1 ∈ [L, U]) ≈ 95%
1 ∈ [L1, U1], . . . , θ∗ d ∈ [Ld, Ud]) ≈ 95%
1 ∈ [L, U]) ≈ 95%
1 ∈ [L1, U1], . . . , θ∗ d ∈ [Ld, Ud]) ≈ 95%
1 ∈ [L, U]) ≈ 95%
1 ∈ [L1, U1], . . . , θ∗ d ∈ [Ld, Ud]) ≈ 95%
1 ∈ [L, U]) ≈ 95%
1 ∈ [L1, U1], . . . , θ∗ d ∈ [Ld, Ud]) ≈ 95%
◮ 1 master node M1 ◮ k − 1 worker nodes M2, M3, . . . , Mk ◮ Zij: the i-th data point at machine Mj
1Kleiner, et al. "A scalable bootstrap for massive data." JRSS-B (2014) 2Sengupta, et al. "A subsampled double bootstrap for massive data." JASA (2016)
◮ Can be approximated by existing efficient distributed estimation methods
1Kleiner, et al. "A scalable bootstrap for massive data." JRSS-B (2014) 2Sengupta, et al. "A subsampled double bootstrap for massive data." JASA (2016)
◮ Can be approximated by existing efficient distributed estimation methods
◮ Traditional bootstrap cannot be efficiently applied in the distributed
◮ BLB1 and SDB2 are computationally expensive due to repeated
1Kleiner, et al. "A scalable bootstrap for massive data." JRSS-B (2014) 2Sengupta, et al. "A subsampled double bootstrap for massive data." JASA (2016)
◮ We propose communication-efficient and computation-efficient distributed
◮ We prove a sufficient number of communication rounds that guarantees
n
k
n
k
iid
D
n
k
n
k
iid
D
n
k
iid
D
k
n
k
n
iid
k
1Jordan, et al. "Communication-efficient distributed statistical inference." JASA (2019)
j=2
j=1 ∇Lj(
1Jordan, et al. "Communication-efficient distributed statistical inference." JASA (2019)
j=2
j=1 ∇Lj(
1Jordan, et al. "Communication-efficient distributed statistical inference." JASA (2019)
j=2
j=1 ∇Lj(
l
1Jordan, et al. "Communication-efficient distributed statistical inference." JASA (2019)
j=2
j=1 ∇Lj(
l
1Jordan, et al. "Communication-efficient distributed statistical inference." JASA (2019)
2 4 6 8 10 γn = logd n 2 4 6 8 10 γk = logd k
τmin = 2 τmin = 3 τmin = 4 τmin ≥ 5
2 4 6 8 10 γn = logd n
τmin = 1 τmin = 2 τmin = 3 τmin = 4 τmin ≥ 5
2 4 6 8 10 γn = logd n 2 4 6 8 10 γk = logd k
τmin = 2 τmin = 3 τmin = 4 τmin ≥ 5
2 4 6 8 10 γn = logd n
τmin = 1 τmin = 2 τmin = 3 τmin = 4 τmin ≥ 5
◮ τmin ր logarithmically as k ր, n ց and d ր
2 4 6 8 10 γn = logd n 2 4 6 8 10 γk = logd k
τmin = 2 τmin = 3 τmin = 4 τmin ≥ 5
2 4 6 8 10 γn = logd n
τmin = 1 τmin = 2 τmin = 3 τmin = 4 τmin ≥ 5
◮ τmin ր logarithmically as k ր, n ց and d ր ◮ τmin,n+k-1-grad ≤ τmin,k-grad
2 4 6 8 10 γn = logd n 2 4 6 8 10 γk = logd k
τmin = 2 τmin = 3 τmin = 4 τmin ≥ 5
2 4 6 8 10 γn = logd n
τmin = 1 τmin = 2 τmin = 3 τmin = 4 τmin ≥ 5
◮ τmin ր logarithmically as k ր, n ց and d ր ◮ τmin,n+k-1-grad ≤ τmin,k-grad ◮ τmin,n+k-1-grad ≥ 1,
2 4 6 8 10 γn = logd n 2 4 6 8 10 γk = logd k
τmin = 2 τmin = 3 τmin = 4 τmin ≥ 5
2 4 6 8 10 γn = logd n
τmin = 1 τmin = 2 τmin = 3 τmin = 4 τmin ≥ 5
◮ τmin ր logarithmically as k ր, n ց and d ր ◮ τmin,n+k-1-grad ≤ τmin,k-grad ◮ τmin,n+k-1-grad ≥ 1,
◮ γk has to be large for k-grad, but not for n+k-1-grad
3 6 9 12 15 γn = logd n 3 6 9 12 15 γk = logd k
τmin = 2 τmin = 3 τmin = 4 τmin ≥ 5
3 6 9 12 15 γn = logd n
τmin = 2 τmin = 3 τmin = 4 τmin ≥ 5 τmin = 1
1 2 3 4 5 6 7 8 9 log2 k 0.0 0.5 1.0 Coverage 1 2 3 4 5 6 7 log2 k
τ = 1 τ = 2 τ = 3 τ = 4
1 2 3 4 5 6 7 8 9 log2 k 0.0 0.5 1.0 Coverage 1 2 3 4 5 6 7 log2 k
coverage width
0.00 0.64 1.27 1.91 2.54 Width ×10−1 0.00 0.64 1.27 1.91 2.54 Width ×10−1
◮ Width (logistic regression, left: d = 25, right: d = 27)
2 4 6 log2 k 0.0 0.2 0.4 Width 2 4 log2 k
k-grad, τ = 1 k-grad, τ = 4 n+k-1-grad, τ = 1 n+k-1-grad, τ = 4 BLB SDB
◮ Width (logistic regression, left: d = 25, right: d = 27)
2 4 6 log2 k 0.0 0.2 0.4 Width 2 4 log2 k
k-grad, τ = 1 k-grad, τ = 4 n+k-1-grad, τ = 1 n+k-1-grad, τ = 4 BLB SDB
◮ Run time in seconds (linear regression, d = 27)
◮ To other models, e.g., graphical models ◮ To high-dimensional sparse models (in progress)