[PPT] - Computational and Statistical Aspects of Statistical Machine PowerPoint Presentation

SLIDE 1

Computational and Statistical Aspects

f Statistical Machine Learning

John Lafferty Department of Statistics Retreat Gleacher Center

SLIDE 2

Outline

“Modern” nonparametric inference for high dimensional data

◮ Nonparametric reduced rank regression

Risk-computation tradeoffs

◮ Covariance-constrained linear regression

Other research and teaching activities

2

SLIDE 3

Context for High Dimensional Nonparametrics

Great progress in recent years on high dimensional linear models Many problems have important nonlinear structure. We’ve been studying “purely functional ” methods for high dimensional, nonparametric inference

no basis expansions
no Mercer kernels

3

SLIDE 4

Additive Models

Fully nonparametric models appear hopeless

Logarithmic scaling, p = log n (e.g., “Rodeo” Lafferty and

Wasserman (2008)) Additive models are useful compromise

Exponential scaling, p = exp(nc) (e.g., “SpAM” Ravikumar,

Lafferty, Liu and Wasserman (2009))

4

SLIDE 5

Additive Models

−0.10 −0.05 0.00 0.05 0.10 150 160 170 180 190

Age

−0.10 −0.05 0.00 0.05 0.10 0.15 100 150 200 250 300

Bmi

−0.10 −0.05 0.00 0.05 0.10 120 160 200 240

Map

−0.10 −0.05 0.00 0.05 0.10 0.15 110 120 130 140 150 160

Tc

5

SLIDE 6

Multivariate Regression

Y ∈ Rq and X ∈ Rp. Regression function m(X) = E(Y | X). Linear model Y = BX + ǫ where B ∈ Rq×p. Reduced rank regression: r = rank(B) ≤ C. Recent work has studied properties and high dimensional scaling of reduced rank regression where nuclear norm B∗ is used as convex surrogate for rank constraint (Yuan et al., 2007; Negahban and Wainwright, 2011). E.g.,

Bn − B∗F = OP
Var(ǫ)r(p + q)

n

6

SLIDE 7

Low-Rank Matrices and Convex Relaxation

low rank matrices convex hull rank(X) ≤ t X∗ ≤ t

7

SLIDE 8

Nuclear Norm Regularization

Algorithms for nuclear norm minimization are a lot like iterative soft thresholding for lasso problems. To project a matrix B onto the nuclear norm ball X∗ ≤ t:

Compute the SVD:

B = U diag(σ) V T

Soft threshold the singular values:

B ← U diag(Softλ(σ)) V T

8

SLIDE 9

Nonparametric Reduced Rank Regression

Foygel, Horrell, Drton and Lafferty (NIPS 2012)

Nonparametric multivariate regression m(X) = (m1(X), . . . , mq(X))T Each component an additive model mk(X) =

p

j=1

mk

j (Xj)

What is the nonparametric analogue of B∗ penalty?

9

SLIDE 10

Low Rank Functions

What does it mean for a set of functions m1(x), . . . , mq(x) to be low rank? Let x1, . . . , xn be a collection of points. We require the n × q matrix M(x1:n) = [mk(xi)] is low rank. Stochastic setting: M = [mk(Xi)]. Natural penalty is

1 √nM∗ = 1 √n q

s=1

σs(M) =

q

s=1
λs( 1

nMTM)

Population version: |||M|||∗ :=

Cov(M(X))
∗ =
Σ(M)1/2
∗

10

SLIDE 11

Constrained Rank Additive Models (CRAM)

Let Σj = Cov(Mj). Two natural penalties:

Σ1/2

1

∗ +
Σ1/2

2

∗ + · · · +
Σ1/2

p

∗
(Σ1/2

1

Σ1/2

2

· · · Σ1/2

p

)

∗

Population risk (first penalty)

1 2E

Y −

j Mj(Xj)

2

2 + λ j

Mj
∗

Linear case:

p

j=1
Σ1/2

p

∗ =

p

j=1

Bj2

(Σ1/2

1

Σ1/2

2

· · · Σ1/2

p

)

∗ = B∗

11

SLIDE 12

CRAM Backfitting Algorithm (Penalty 1)

Input: Data (Xi, Yi), regularization parameter λ. Iterate until convergence: For each j = 1, . . . , p: Compute residual: Rj = Y −

k=j

Mk(Xk) Estimate projection Pj = E(Rj | Xj), smooth: Pj = SjRj Compute SVD: 1

n

Pj PT

j = U diag(τ) UT

Soft-threshold: Mj = U diag([1 − λ/√τ]+)UT Pj Output: Estimator M(Xi) =

j

Mj(Xij).

12

SLIDE 13

Scaling of Estimation Error

Using a “double covering” technique, (1

2-parametric, 1 2-nonparametric), we bound the deviation between empirical and

population functional covariance matrices in spectral norm: sup

V

Σ(V) −

Σn(V)

sp = OP
q + log(pq)

n

.

This allows us to bound the excess risk of the empirical estimator relative to an oracle.

13

SLIDE 14

Summary

Variations on additive models enjoy most of the good statistical

and computational properties of sparse or low-rank linear models.

We’re building a toolbox for large scale, high dimensional

nonparametric inference.

14

SLIDE 15

Computation-Risk Tradeoffs

In “traditional” computational learning theory, dividing line

between learnable and non-learnable is polynomial

vs. exponential time
Valiant’s PAC model
Mostly negative results: It is not possible to efficiently learn in

natural settings

Claim: Distinctions in polynomial time matter most

15

SLIDE 16

Analogy: Numerical Optimization

In numerical optimization, it is understood how to tradeoff computation for speed of convergence

First order methods: linear cost, linear convergence
Quasi-Newton methods: quadratic cost, superlinear convergence
Newton’s method: cubic cost, quadratic convergence

Are similar tradeoffs possible in statistical learning?

16

SLIDE 17

Hints of a Computation-Risk Tradeoff

Graph estimation:

Our method for estimating graph for Ising models:

n = Ω(d3 log p), T = O(p4) for graphs with p nodes and maximum degree d

Information-theoretic lower bound: n = Ω(d log p)

17

SLIDE 18

Statistical vs. Computational Efficiency

Challenge: Understand how families of estimators with different computational efficiencies can yield different statistical efficiencies RateH,F(n) = inf

mn∈H sup

m∈F

Risk( mn, m)

H: computationally constrained hypothesis class
F: smoothness constraints on “true” model

18

SLIDE 19

Computation-Risk Tradeoffs for Linear Regression

Dinah Shender has been studying such a tradeoff in the setting of high dimensional linear regression

19

SLIDE 20

Computation-Risk Tradeoffs for Linear Regression

Standard ridge estimator solves 1 nX TX + λnI

βλ = 1

nX TY Sparsify sample covariance to get estimator

Tt[

Σ] + λnI

βt,λ = 1

nX TY where Tt[ Σ] is hard-thresholded sample covariance: Tt([mij]) =

mij 1(|mij| > t)
Recent advance in theoretical CS (Spielman et al.): Solving a

symmetric diagonally-dominant linear system with m nonzero matrix entries can be done in time

O(m log2 p)

20

SLIDE 21

Computation-Risk Tradeoffs for Linear Regression

Dinah has recently proved the statistical error scales as

βt,λ − β∗

β∗ = OP (Tt(Σ) − Σ2) = O(t1−q) for class of covariance matrices with rows in sparse ℓq balls (as studied by Bickel and Levina).

Combined with the computational advance, this gives us an

explicit, fine-grained risk/computation tradeoff

21

SLIDE 22

Simulation

0.0 0.5 1.0 1.5 2.0 0.8 0.9 1.0 1.1 1.2 1.3 1.4 lambda risk

22

SLIDE 23

Some Other Projects

Minhua Chen: Convex optimization for dictionary learning Eric Janofsky: Nonparanormal component analysis Min Xu: High dimensional conditional density and graph estimation

23

SLIDE 24

Courses in the Works

Winter 2013: Nonparametric Inference (Undergraduate and

Masters)

Spring 2013: Machine Learning for Big Data (Undergraduate

Statistics and Computer Science) Charles Cary: Developing Cloud-based infras- tructure for the course. Candidate data: 80 mil- lion images, Yahoo! clickthrough data, Science journal articles, City of Chicago datasets.

24