Nonparametric Distributed Learning Architecture: Algorithm and - - PowerPoint PPT Presentation

▶

Mar 05, 2023 236 likes •428 views

Introduction Elements of Distributed Statistical Learning Applications Nonparametric Distributed Learning Architecture: Algorithm and Application Scott Bruce, Zeda Li, Hsiang-Chieh(Alex) Yang, and Subhadeep (DEEP) Mukhopadhyay Temple

SLIDE 1

Introduction Elements of Distributed Statistical Learning Applications

Nonparametric Distributed Learning Architecture: Algorithm and Application

Scott Bruce, Zeda Li, Hsiang-Chieh(Alex) Yang, and Subhadeep (DEEP) Mukhopadhyay

Temple University - Department Award Day Seminar Best Paper Award, JSM 2016 Section on Nonparametric Statistics of ASA. Winner Fox School Ph.D. Student Research Competition.

April 15, 2016

SLIDE 2

Introduction Elements of Distributed Statistical Learning Applications

Big Data Statistical Inference: Motivating Example

Goal: Nonparametric two-sample inference algorithm for Expedia personalized hotel recommendation engine. We develop a scalable distributed algorithm that can mine search data from millions of travelers to find the important features that best predict customers’ likelihood to book a hotel. Key Challenges Variety: Different data types require different stasitical measures. Volume: Over 10 million

bservations across 52 variables.

Scalability: Distributed, parallel processing for massive data analysis.

Nonparametric Distributed Learning Architecture: Algorithm and Application 1/17

SLIDE 3

Introduction Elements of Distributed Statistical Learning Applications

Summary of Main Contributions

Dramatic increases in the size of datasets have made traditional “centralized” statistical inference techniques prohibitive. Surprisingly very little attention has been given to developing inferential algorithms for data whose volume exceeds the capacity

f a single-machine system. Indeed, the topic of big data statistical

inference is very much in its nascent stage of development. A question of immediate concern is how can we design a data-intensive statistical inference architecture without changing the basic fundamental data modeling principles that were developed for ‘small’ data over the last century? To address this problem we present MetaLP–a flexible and distributed statistical modeling paradigm suitable for large-scale data analysis addressing: (1) massive volume and (2) variety or mixed data problem.

Nonparametric Distributed Learning Architecture: Algorithm and Application 2/17

SLIDE 4

Introduction Elements of Distributed Statistical Learning Applications

LP Nonparametric Harmonic Analysis

Conventional statistical approach fails to address the ‘mixed data problem’. We resolve this by representing data in a new transform domain via a specially designed procedure (analogous to time → frequency domain representation via Fourier transform). Theorem (Mukhopadhyay and Parzen, 2014) Random variable X (discrete or continuous) with finite variance admits the following decomposition: X − E(X) =

j>0 Tj(X; X) E[XTj(X; X)] with probability 1.

Traditional and modern statistical measures developed for different data-types can be compactly expressed as inner products in the LP Hilbert space.

Nonparametric Distributed Learning Architecture: Algorithm and Application 3/17

SLIDE 5

Introduction Elements of Distributed Statistical Learning Applications

Data-Adaptive Shapes

0.5 0.6 0.7 0.8 0.9 1.0 −1.0 0.0 1.0 0.5 0.6 0.7 0.8 0.9 1.0 −1 1 2 0.5 0.6 0.7 0.8 0.9 1.0 −1 1 2 3 0.5 0.6 0.7 0.8 0.9 1.0 −2 1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 −1.5 −0.5 0.5 1.5 0.0 0.2 0.4 0.6 0.8 1.0 −1.0 0.0 1.0 2.0 0.0 0.2 0.4 0.6 0.8 1.0 −2 −1 1 2 0.0 0.2 0.4 0.6 0.8 1.0 −1 1 2 3

Figure 1: The left 2x2 panel shows the first four LP orthonormal score functions for discrete variable length of stay. The right panel shows the shape of the score functions for continuous price usd.

Nonparametric Distributed Learning Architecture: Algorithm and Application 4/17

SLIDE 6

Introduction Elements of Distributed Statistical Learning Applications

LP Hilbert Functional Representation

Define the two-sample LP statistic for variable selection of a mixed random variable X (either continuous or discrete) based on our specially designed score functions LP[j; X, Y ] = Cor[Tj(X; X), Y ] = E[Tj(X; X)T1(Y ; Y )]. (1) Properties Sample LP statistics √n LP[j; X, Y ] asymptotically converge to i.i.d. standard normal distributions (Mukhopadhyay and Parzen, 2014). LP[1;X,Y] unifies various measures of linear association for different data-type combinations. Higher order LP statistics capture distributional differences Allows data scientists to write a single computing formula irrespective of its data type, with a common metric and asymptotic characteristics. Steps towards Unified Algorithms.

Nonparametric Distributed Learning Architecture: Algorithm and Application 5/17

SLIDE 7

Introduction Elements of Distributed Statistical Learning Applications

Meta-Analysis and Data-Parallelism

The key is to recognize that meta-analytic logic can provide a formal statistical framework to address: How to judiciously combine the “local” LP-inferences executed in parallel by different servers to get the “global” inference for the

riginal big data?

Towards large-scale parallel computing: We use meta-analysis to parallelize the statistical inference process for massive datasets.

Nonparametric Distributed Learning Architecture: Algorithm and Application 6/17

SLIDE 8

Introduction Elements of Distributed Statistical Learning Applications

What to Combine?

Instead of simply providing point estimates we seek to provide a distribution estimator (analogous to the Bayesian posterior distribution) for the LP-statistics via a Confidence Distribution (CD) that contains information for virtually all types of statistical inference (e.g. estimation, hypothesis testing, CI, etc.). Definition (Confidence Distribution) Suppose Θ is the parameter space of the unknown parameter of interest θ, and ω is the sample space corresponding to data Xn = {X1, X2, . . . , Xn}T. Then a function Hn(·) = Hn(X, ·) on ω × Θ → [0, 1] is a confidence distribution (CD) if: (i). For each given Xn ∈ ω, Hn(·) is a continuous cumulative distribution function on Θ; (ii). At the true parameter value θ = θ0, Hn(θ0) = Hn(X, θ0), as a function of the sample Xn, following U[0, 1].

Nonparametric Distributed Learning Architecture: Algorithm and Application 7/17

SLIDE 9

Introduction Elements of Distributed Statistical Learning Applications

How to Combine?

The combining function for CDs across k different studies can be expressed as: H(c)(LP[j; X, Y ]) = Gc{gc(H(LP1[j; X, Y ]), . . . , H(LPk[j; X, Y ])}. The function Gc is determined by the monotonic gc function defined as Gc(t) = P(gc(U1, . . . , Uk) ≤ t), in which U1, . . . , Uk are independent U[0, 1] random variables. A popular and useful choice for gc is gc(u1, . . . , uk) = α1F −1

0 (u1) + . . . + αkF −1 0 (uk),

where F0(·) is a given cumulative distribution function and αℓ ≥ 0 , with at least one αℓ = 0, are generic weights.

Nonparametric Distributed Learning Architecture: Algorithm and Application 8/17

SLIDE 10

Introduction Elements of Distributed Statistical Learning Applications

Combining Formula for the LP CD’s

Theorem (Bruce, Li, Yang and Mukhopadhyay, 2016) Setting F −1

0 (t) = Φ−1(t) and αl = √nℓ, where nℓ is the size of

subpopulation ℓ = 1, . . . , k, the following combined aCD for LP[j; X, Y ]) follows: H(c)(LP[j; X, Y ]) = Φ

k
ℓ=1

nℓ 1/2 LP[j; X, Y ] − LP

(c)[j; X, Y ]

Where
LP

(c)[j; X, Y ] =

k

ℓ=1 nℓ

LPℓ[j; X, Y ] k

ℓ=1 nℓ

where LP

(c)[j; X, Y ] and

k

ℓ=1 nℓ

−1 are the mean and variance respectively of the combined aCD for LP[j; X, Y ].

Nonparametric Distributed Learning Architecture: Algorithm and Application 9/17

SLIDE 11

Parallel Broken-Big Datasets are Often Heterogeneous

Failure to take heterogeneity into account can easily spoil the big data discovery process.

LPℓ[j; X, Y ] | LPℓ[j; X, Y ], si

iid

∼ N(LPℓ[j; X, Y ], s2

i )

(2) LPℓ[j; X, Y ] | LP[j; X, Y ], τ

iid

∼ N(LP[j; X, Y ], τ 2) (3)

25 50 75 100 −0.3 −0.2 −0.1 0.0 0.1 0.2

LP statistics for Price usd count

Partition Random Visit ID

Figure 2: Histogram of LP-statistic of the variable price usd based on random partition and visitor location country id partition.

SLIDE 12

Heterogeneity-Corrected LP Confidence Distribution

Theorem (Bruce, Li, Yang and Mukhopadhyay, 2016) Setting F −1

0 (t) = Φ−1(t) and αℓ = 1/

(τ 2 + (1/nℓ)), where nℓ is

the size of subpopulation ℓ = 1, . . . , k, the following combined aCD for LP[j; X, Y ] follows: H(c)(LP[j; X, Y ]) = Φ

k
ℓ=1

1 τ 2 + (1/nℓ) 1/2 (LP[j; X, Y ] − LP

(c)[j; X, Y ])

(c)[j; X, Y ]) =

k

ℓ=1(τ 2 + (1/nℓ))−1

LPℓ[j; X, Y ]) k

ℓ=1(τ 2 + (1/nℓ))−1

where LP

(c)[j; X, Y ]) and (k ℓ=1 1/(τ 2 + (1/nℓ)))−1 are the mean

and variance respectively of the combined aCD for LP[j; X, Y ].

SLIDE 13

Introduction Elements of Distributed Statistical Learning Applications

Expedia: Variable Importance, Impact of Regularization

−0.03 0.00 0.03 0.06 10 20 30 40

Variables LP Confidence Interval

Figure 3: 95% Confidence Intervals for each variables’ LP Statistics under Random Sampling Partitioning (black) and Country ID Partitioning (red).

Nonparametric Distributed Learning Architecture: Algorithm and Application 12/17

SLIDE 14

Introduction Elements of Distributed Statistical Learning Applications

Expedia - Robust to Subpopulation Size

100 200 300 400 500 0.060 0.062 0.064 Number of subpopulation LP prop_location_score2 100 200 300 400 500 0.034 0.036 0.038 Number of subpopulation LP promotion_flag 100 200 300 400 500 −0.028 −0.026 −0.024 Number of subpopulation LP price_usd 100 200 300 400 500 0.0045 0.0060 0.0075 Number of subpopulation LP srch_children_count 100 200 300 400 500 −0.025 −0.023 −0.021 Number of subpopulation LP srch_booking_window 100 200 300 400 500 0.009 0.011 0.013 Number of subpopulation LP srch_room_count 100 200 300 400 500 −1e−03 0e+00 Number of subpopulation LP prop_log_historical_price 100 200 300 400 500 −0.010 0.000 0.005 Number of subpopulation LP comp1_inv 100 200 300 400 500 −0.005 0.005 Number of subpopulation LP scomp1_rate_percent_diff

Figure 4: LP statistics and 95% confidence intervals for nine variables across different numbers of subpopulations (dotted line is at zero).

Nonparametric Distributed Learning Architecture: Algorithm and Application 13/17

SLIDE 15

Introduction Elements of Distributed Statistical Learning Applications

Simpson’s Paradox - MetaLP Perspective

Dept Male Female A 62% (512 / 825) 82% (89 / 108) B 63% (353 / 560) 68% (17 / 25) C 37% (120 / 325) 34% (202 / 593) D 33% (138 / 417) 35% (131 / 375) E 28% (53 / 191) 24% (94 / 393) F 6% (22 / 373) 7% (24 / 341) All 45% (1198 / 2691) 30% (557 / 1835)

Table 1: UC Berkeley admission rates by gender by department (Bickel 1975)

F E D C B A MetaLP Aggregate

0.0 0.1

Admissions LP Statistics Group

Figure 5: 95% Confidence Intervals for LP statistics for UC Berkeley admission rates and gender.

Nonparametric Distributed Learning Architecture: Algorithm and Application 14/17

SLIDE 16

Stein’s Paradox - MetaLP Perspective

Name hits/AB ˆ µ(MLE)

µi ˆ µ(JS)

ˆ µ(LP)

Clemente 18/45 .400 .346 .294 .276 F Robinson 17/45 .378 .298 .289 .274 F Howard 16/45 .356 .276 .285 .272 Johnstone 15/45 .333 .222 .280 .270 Berry 14/45 .311 .273 .275 .268 Spencer 14/45 .311 .270 .275 .268 Kessinger 13/45 .289 .263 .270 .265 L Alvarado 12/45 .267 .210 .266 .263 Santo 11/45 .244 .269 .261 .261 Swoboda 11/45 .244 .230 .261 .261 Unser 10/45 .222 .264 .256 .258 Williams 10/45 .222 .256 .256 .258 Scott 10/45 .222 .303 .256 .258 Petrocelli 10/45 .222 .264 .256 .258 E Rodriguez 10/45 .222 .226 .256 .258 Campaneris 9/45 .200 .286 .252 .256 Munson 8/45 .178 .316 .247 .253 Alvis 7/45 .156 .200 .242 .251 Grand Average .265 .265 .265 .263

Table 2: Batting averages ˆ µ(MLE)

for 18 major league players early in the 1970 season; µi values are averages

ver the remainder of the season.

The James-Stein estimates ˆ µ(JS)

i

and MetaLP estimates ˆ µ(LP)

i

provide much more accurate

verall predictions for the µi

values compared to MLE. MSE ratio for ˆ µ(JS)

i

to ˆ µ(MLE)

i

is 0.283 and MSE ratio for ˆ µ(LP)

i

to ˆ µ(MLE)

i

is 0.293 showing comparable efficiency.

SLIDE 17

Introduction Elements of Distributed Statistical Learning Applications

Final Remarks on Big Data Statistical Inference

To address methodological and computational challenges for big data analysis, we have outlined a general theoretical foundation in this article, which we believe may provide the missing link between small data and big data science. Instead of developing distributed versions of statistical algorithms on a case-by-case basis, here we develop a systematic and automatic strategy that will provide a generic platform to extend traditional and modern statistical modeling tools to large datasets by fully utilizing the parallel and distributed processing power, thus addressing one of the biggest bottlenecks for data-intensive statistical inference.

Nonparametric Distributed Learning Architecture: Algorithm and Application 16/17

SLIDE 18

Introduction Elements of Distributed Statistical Learning Applications

Selected References

Mukhopadhyay, S. and Parzen, E. (2014). LP approach to statistical modeling. arXiv:1405.2601. Parzen, E. (2013). Discussion of “Confidence distribution, the frequentist distribution estimator of a parameter: a review” by Min-ge Xie and Kesar Singh. International Statistical Review 81, 48–52. Hedges, Larry V., and Ingram, Olkin. (1985). Statistical method for meta-analysis, London: Academic Press. Xie, M., Singh, K., and W. E. Strawderman (2011). Confidence distributions and a unifying framework for meta-analysis. JASA 106, 320–333.

Nonparametric Distributed Learning Architecture: Algorithm and Application 17/17