Introduction Elements of Distributed Statistical Learning Applications
Nonparametric Distributed Learning Architecture: Algorithm and - - PowerPoint PPT Presentation
Nonparametric Distributed Learning Architecture: Algorithm and - - PowerPoint PPT Presentation
Introduction Elements of Distributed Statistical Learning Applications Nonparametric Distributed Learning Architecture: Algorithm and Application Scott Bruce, Zeda Li, Hsiang-Chieh(Alex) Yang, and Subhadeep (DEEP) Mukhopadhyay Temple
Introduction Elements of Distributed Statistical Learning Applications
Big Data Statistical Inference: Motivating Example
Goal: Nonparametric two-sample inference algorithm for Expedia personalized hotel recommendation engine. We develop a scalable distributed algorithm that can mine search data from millions of travelers to find the important features that best predict customers’ likelihood to book a hotel. Key Challenges Variety: Different data types require different stasitical measures. Volume: Over 10 million
- bservations across 52 variables.
Scalability: Distributed, parallel processing for massive data analysis.
Nonparametric Distributed Learning Architecture: Algorithm and Application 1/17
Introduction Elements of Distributed Statistical Learning Applications
Summary of Main Contributions
Dramatic increases in the size of datasets have made traditional “centralized” statistical inference techniques prohibitive. Surprisingly very little attention has been given to developing inferential algorithms for data whose volume exceeds the capacity
- f a single-machine system. Indeed, the topic of big data statistical
inference is very much in its nascent stage of development. A question of immediate concern is how can we design a data-intensive statistical inference architecture without changing the basic fundamental data modeling principles that were developed for ‘small’ data over the last century? To address this problem we present MetaLP–a flexible and distributed statistical modeling paradigm suitable for large-scale data analysis addressing: (1) massive volume and (2) variety or mixed data problem.
Nonparametric Distributed Learning Architecture: Algorithm and Application 2/17
Introduction Elements of Distributed Statistical Learning Applications
LP Nonparametric Harmonic Analysis
Conventional statistical approach fails to address the ‘mixed data problem’. We resolve this by representing data in a new transform domain via a specially designed procedure (analogous to time → frequency domain representation via Fourier transform). Theorem (Mukhopadhyay and Parzen, 2014) Random variable X (discrete or continuous) with finite variance admits the following decomposition: X − E(X) =
j>0 Tj(X; X) E[XTj(X; X)] with probability 1.
Traditional and modern statistical measures developed for different data-types can be compactly expressed as inner products in the LP Hilbert space.
Nonparametric Distributed Learning Architecture: Algorithm and Application 3/17
Introduction Elements of Distributed Statistical Learning Applications
Data-Adaptive Shapes
0.5 0.6 0.7 0.8 0.9 1.0 −1.0 0.0 1.0 0.5 0.6 0.7 0.8 0.9 1.0 −1 1 2 0.5 0.6 0.7 0.8 0.9 1.0 −1 1 2 3 0.5 0.6 0.7 0.8 0.9 1.0 −2 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 −1.5 −0.5 0.5 1.5 0.0 0.2 0.4 0.6 0.8 1.0 −1.0 0.0 1.0 2.0 0.0 0.2 0.4 0.6 0.8 1.0 −2 −1 1 2 0.0 0.2 0.4 0.6 0.8 1.0 −1 1 2 3
Figure 1: The left 2x2 panel shows the first four LP orthonormal score functions for discrete variable length of stay. The right panel shows the shape of the score functions for continuous price usd.
Nonparametric Distributed Learning Architecture: Algorithm and Application 4/17
Introduction Elements of Distributed Statistical Learning Applications
LP Hilbert Functional Representation
Define the two-sample LP statistic for variable selection of a mixed random variable X (either continuous or discrete) based on our specially designed score functions LP[j; X, Y ] = Cor[Tj(X; X), Y ] = E[Tj(X; X)T1(Y ; Y )]. (1) Properties Sample LP statistics √n LP[j; X, Y ] asymptotically converge to i.i.d. standard normal distributions (Mukhopadhyay and Parzen, 2014). LP[1;X,Y] unifies various measures of linear association for different data-type combinations. Higher order LP statistics capture distributional differences Allows data scientists to write a single computing formula irrespective of its data type, with a common metric and asymptotic characteristics. Steps towards Unified Algorithms.
Nonparametric Distributed Learning Architecture: Algorithm and Application 5/17
Introduction Elements of Distributed Statistical Learning Applications
Meta-Analysis and Data-Parallelism
The key is to recognize that meta-analytic logic can provide a formal statistical framework to address: How to judiciously combine the “local” LP-inferences executed in parallel by different servers to get the “global” inference for the
- riginal big data?
Towards large-scale parallel computing: We use meta-analysis to parallelize the statistical inference process for massive datasets.
Nonparametric Distributed Learning Architecture: Algorithm and Application 6/17
Introduction Elements of Distributed Statistical Learning Applications
What to Combine?
Instead of simply providing point estimates we seek to provide a distribution estimator (analogous to the Bayesian posterior distribution) for the LP-statistics via a Confidence Distribution (CD) that contains information for virtually all types of statistical inference (e.g. estimation, hypothesis testing, CI, etc.). Definition (Confidence Distribution) Suppose Θ is the parameter space of the unknown parameter of interest θ, and ω is the sample space corresponding to data Xn = {X1, X2, . . . , Xn}T. Then a function Hn(·) = Hn(X, ·) on ω × Θ → [0, 1] is a confidence distribution (CD) if: (i). For each given Xn ∈ ω, Hn(·) is a continuous cumulative distribution function on Θ; (ii). At the true parameter value θ = θ0, Hn(θ0) = Hn(X, θ0), as a function of the sample Xn, following U[0, 1].
Nonparametric Distributed Learning Architecture: Algorithm and Application 7/17
Introduction Elements of Distributed Statistical Learning Applications
How to Combine?
The combining function for CDs across k different studies can be expressed as: H(c)(LP[j; X, Y ]) = Gc{gc(H(LP1[j; X, Y ]), . . . , H(LPk[j; X, Y ])}. The function Gc is determined by the monotonic gc function defined as Gc(t) = P(gc(U1, . . . , Uk) ≤ t), in which U1, . . . , Uk are independent U[0, 1] random variables. A popular and useful choice for gc is gc(u1, . . . , uk) = α1F −1
0 (u1) + . . . + αkF −1 0 (uk),
where F0(·) is a given cumulative distribution function and αℓ ≥ 0 , with at least one αℓ = 0, are generic weights.
Nonparametric Distributed Learning Architecture: Algorithm and Application 8/17
Introduction Elements of Distributed Statistical Learning Applications
Combining Formula for the LP CD’s
Theorem (Bruce, Li, Yang and Mukhopadhyay, 2016) Setting F −1
0 (t) = Φ−1(t) and αl = √nℓ, where nℓ is the size of
subpopulation ℓ = 1, . . . , k, the following combined aCD for LP[j; X, Y ]) follows: H(c)(LP[j; X, Y ]) = Φ
- k
- ℓ=1
nℓ 1/2 LP[j; X, Y ] − LP
(c)[j; X, Y ]
- Where
- LP
(c)[j; X, Y ] =
k
ℓ=1 nℓ
LPℓ[j; X, Y ] k
ℓ=1 nℓ
where LP
(c)[j; X, Y ] and
k
ℓ=1 nℓ
−1 are the mean and variance respectively of the combined aCD for LP[j; X, Y ].
Nonparametric Distributed Learning Architecture: Algorithm and Application 9/17
Parallel Broken-Big Datasets are Often Heterogeneous
Failure to take heterogeneity into account can easily spoil the big data discovery process.
- LPℓ[j; X, Y ] | LPℓ[j; X, Y ], si
iid
∼ N(LPℓ[j; X, Y ], s2
i )
(2) LPℓ[j; X, Y ] | LP[j; X, Y ], τ
iid
∼ N(LP[j; X, Y ], τ 2) (3)
25 50 75 100 −0.3 −0.2 −0.1 0.0 0.1 0.2
LP statistics for Price usd count
Partition Random Visit ID
Figure 2: Histogram of LP-statistic of the variable price usd based on random partition and visitor location country id partition.
Heterogeneity-Corrected LP Confidence Distribution
Theorem (Bruce, Li, Yang and Mukhopadhyay, 2016) Setting F −1
0 (t) = Φ−1(t) and αℓ = 1/
- (τ 2 + (1/nℓ)), where nℓ is
the size of subpopulation ℓ = 1, . . . , k, the following combined aCD for LP[j; X, Y ] follows: H(c)(LP[j; X, Y ]) = Φ
- k
- ℓ=1
1 τ 2 + (1/nℓ) 1/2 (LP[j; X, Y ] − LP
(c)[j; X, Y ])
- ,
- LP
(c)[j; X, Y ]) =
k
ℓ=1(τ 2 + (1/nℓ))−1
LPℓ[j; X, Y ]) k
ℓ=1(τ 2 + (1/nℓ))−1
where LP
(c)[j; X, Y ]) and (k ℓ=1 1/(τ 2 + (1/nℓ)))−1 are the mean
and variance respectively of the combined aCD for LP[j; X, Y ].
Introduction Elements of Distributed Statistical Learning Applications
Expedia: Variable Importance, Impact of Regularization
−0.03 0.00 0.03 0.06 10 20 30 40
Variables LP Confidence Interval
Figure 3: 95% Confidence Intervals for each variables’ LP Statistics under Random Sampling Partitioning (black) and Country ID Partitioning (red).
Nonparametric Distributed Learning Architecture: Algorithm and Application 12/17
Introduction Elements of Distributed Statistical Learning Applications
Expedia - Robust to Subpopulation Size
100 200 300 400 500 0.060 0.062 0.064 Number of subpopulation LP prop_location_score2 100 200 300 400 500 0.034 0.036 0.038 Number of subpopulation LP promotion_flag 100 200 300 400 500 −0.028 −0.026 −0.024 Number of subpopulation LP price_usd 100 200 300 400 500 0.0045 0.0060 0.0075 Number of subpopulation LP srch_children_count 100 200 300 400 500 −0.025 −0.023 −0.021 Number of subpopulation LP srch_booking_window 100 200 300 400 500 0.009 0.011 0.013 Number of subpopulation LP srch_room_count 100 200 300 400 500 −1e−03 0e+00 Number of subpopulation LP prop_log_historical_price 100 200 300 400 500 −0.010 0.000 0.005 Number of subpopulation LP comp1_inv 100 200 300 400 500 −0.005 0.005 Number of subpopulation LP scomp1_rate_percent_diff
Figure 4: LP statistics and 95% confidence intervals for nine variables across different numbers of subpopulations (dotted line is at zero).
Nonparametric Distributed Learning Architecture: Algorithm and Application 13/17
Introduction Elements of Distributed Statistical Learning Applications
Simpson’s Paradox - MetaLP Perspective
Dept Male Female A 62% (512 / 825) 82% (89 / 108) B 63% (353 / 560) 68% (17 / 25) C 37% (120 / 325) 34% (202 / 593) D 33% (138 / 417) 35% (131 / 375) E 28% (53 / 191) 24% (94 / 393) F 6% (22 / 373) 7% (24 / 341) All 45% (1198 / 2691) 30% (557 / 1835)
Table 1: UC Berkeley admission rates by gender by department (Bickel 1975)
F E D C B A MetaLP Aggregate
- 0.2
- 0.1
0.0 0.1
Admissions LP Statistics Group
Figure 5: 95% Confidence Intervals for LP statistics for UC Berkeley admission rates and gender.
Nonparametric Distributed Learning Architecture: Algorithm and Application 14/17
Stein’s Paradox - MetaLP Perspective
Name hits/AB ˆ µ(MLE)
i
µi ˆ µ(JS)
i
ˆ µ(LP)
i
Clemente 18/45 .400 .346 .294 .276 F Robinson 17/45 .378 .298 .289 .274 F Howard 16/45 .356 .276 .285 .272 Johnstone 15/45 .333 .222 .280 .270 Berry 14/45 .311 .273 .275 .268 Spencer 14/45 .311 .270 .275 .268 Kessinger 13/45 .289 .263 .270 .265 L Alvarado 12/45 .267 .210 .266 .263 Santo 11/45 .244 .269 .261 .261 Swoboda 11/45 .244 .230 .261 .261 Unser 10/45 .222 .264 .256 .258 Williams 10/45 .222 .256 .256 .258 Scott 10/45 .222 .303 .256 .258 Petrocelli 10/45 .222 .264 .256 .258 E Rodriguez 10/45 .222 .226 .256 .258 Campaneris 9/45 .200 .286 .252 .256 Munson 8/45 .178 .316 .247 .253 Alvis 7/45 .156 .200 .242 .251 Grand Average .265 .265 .265 .263
Table 2: Batting averages ˆ µ(MLE)
i
for 18 major league players early in the 1970 season; µi values are averages
- ver the remainder of the season.
The James-Stein estimates ˆ µ(JS)
i
and MetaLP estimates ˆ µ(LP)
i
provide much more accurate
- verall predictions for the µi
values compared to MLE. MSE ratio for ˆ µ(JS)
i
to ˆ µ(MLE)
i
is 0.283 and MSE ratio for ˆ µ(LP)
i
to ˆ µ(MLE)
i
is 0.293 showing comparable efficiency.
Introduction Elements of Distributed Statistical Learning Applications
Final Remarks on Big Data Statistical Inference
To address methodological and computational challenges for big data analysis, we have outlined a general theoretical foundation in this article, which we believe may provide the missing link between small data and big data science. Instead of developing distributed versions of statistical algorithms on a case-by-case basis, here we develop a systematic and automatic strategy that will provide a generic platform to extend traditional and modern statistical modeling tools to large datasets by fully utilizing the parallel and distributed processing power, thus addressing one of the biggest bottlenecks for data-intensive statistical inference.
Nonparametric Distributed Learning Architecture: Algorithm and Application 16/17
Introduction Elements of Distributed Statistical Learning Applications
Selected References
Mukhopadhyay, S. and Parzen, E. (2014). LP approach to statistical modeling. arXiv:1405.2601. Parzen, E. (2013). Discussion of “Confidence distribution, the frequentist distribution estimator of a parameter: a review” by Min-ge Xie and Kesar Singh. International Statistical Review 81, 48–52. Hedges, Larry V., and Ingram, Olkin. (1985). Statistical method for meta-analysis, London: Academic Press. Xie, M., Singh, K., and W. E. Strawderman (2011). Confidence distributions and a unifying framework for meta-analysis. JASA 106, 320–333.
Nonparametric Distributed Learning Architecture: Algorithm and Application 17/17