[PPT] - Two problems from neural data analysis: Sparse entropy estimation PowerPoint Presentation

SLIDE 1

Two problems from neural data analysis:

Sparse entropy estimation and efficient adaptive experimental design

Liam Paninski

Department of Statistics and Center for Theoretical Neuroscience Columbia University http://www.stat.columbia.edu/∼liam liam@stat.columbia.edu September 7, 2006

SLIDE 2

The fundamental question in neuroscience

The neural code: what is P(response | stimulus)? Main question: how to estimate P(r|s) from (sparse) experimental data?

SLIDE 3

Curse of dimensionality

Both stimulus and response can be very high-dimensional. Stimuli:

images
sounds
time-varying behavior

Responses:

observations from single or multiple simultaneously-recorded

point processes

SLIDE 4

Avoiding the curse of insufficient data

1: Estimate some functional f(p) instead of full joint p(r, s) — information-theoretic functionals 2: Select stimuli more efficiently — optimal experimental design 3: Improved nonparametric estimators — minimax theory for discrete distributions under KL loss (4: Parametric approaches; connections to biophysical models)

SLIDE 5

Part 1: Estimation of information

Many central questions in neuroscience are inherently information-theoretic:

What inputs are most reliably encoded by a given neuron?
Are sensory neurons optimized to transmit information

about the world to the brain?

Do noisy synapses limit the rate of information flow from

neuron to neuron? Quantification of “information” is fundamental problem. (...interest in neuroscience but also physics, telecommunications, genomics, etc.)

SLIDE 6

Shannon mutual information

I(X; Y ) =

X×Y

dp(x, y) log dp(x, y) dp(x) × p(y) Information-theoretic justifications:

invariance
“uncertainty” axioms
data processing inequality
channel and source coding theorems

But obvious open experimental question:

is this computable for real data?

SLIDE 7

How to estimate information

I very hard to estimate in general... ... but lower bounds are easier. Data processing inequality: I(X; Y ) ≥ I(S(X); T(Y )) Suggests a sieves-like approach.

SLIDE 8

Discretization approach

Discretize X, Y → Xdisc, Ydisc, estimate Idiscrete(X; Y ) = I(Xdisc; Ydisc)

Data processing inequality =

⇒ Idiscrete ≤ I

Idiscrete ր I as partition is refined

Strategy: refine partition as samples N increases; if number of bins m doesn’t grow too fast, ˆ I → Idiscrete ր I Completely nonparametric, but obvious concerns:

Want N ≫ m(N) samples, to “fill in” histograms p(x, y)
How large is bias, variance for fixed m?

SLIDE 9

Bias is major problem

ˆ IMLE(X; Y ) =

mx

x=1

my

y=1

ˆ pMLE(x, y) log ˆ pMLE(x, y) ˆ pMLE(x)ˆ pMLE(y) ˆ pMLE(x) = pN(x) = n(x) N (empirical measure) Fix p(x, y), mx, my and let sample size N → ∞. Then:

Bias(ˆ

IMLE) : ∼ (mxmy − mx − my + 1)/2N.

Variance(ˆ

IMLE) : ∼ (log m)2/N; dominated by bias if m = mxmy large.

No unbiased estimator exists.

(Miller, 1955; Paninski, 2003)

SLIDE 10

Convergence of common information estimators

Result 1: If N/m → ∞, ˆ IMLE and related estimators universally almost surely consistent. Converse: if N/m → c < ∞, ˆ IMLE and related estimators typically converge to wrong answer almost surely. (Asymptotic bias can often be computed explicitly.) Implication: if N/m small, large bias although errorbars vanish, even if “bias-corrected” estimators are used (Paninski, 2003).

SLIDE 11

Estimating information on m bins with fewer than m samples

Result 2: A new estimator that is uniformly consistent as N → ∞ even if N/m → 0 (albeit sufficiently slowly) Error bounds good for all underlying distributions: estimator works well even in worst case Interpretation: information is strictly easier to estimate than p! (Paninski, 2004)

SLIDE 12

Derivation of new estimator

Suffices to develop good estimator of discrete entropy: Idiscrete(X; Y ) = H(Xdisc) + H(Ydisc) − H(Xdisc, Ydisc) H(X) = −

mx

x=1

p(x) log p(x)

SLIDE 13

Derivation of new estimator

Variational idea: choose estimator that minimizes upper bound

n error over

H = { ˆ H : ˆ H(pN) =

i

g(pN(i))} (pN = empirical measure) Approximation-theoretic (binomial) bias bound max

p

Biasp( ˆ H) ≤ B∗( ˆ H) ≡ m· max

0≤p≤1 |−p log p− N

j=0

g( j N )BN,j(p)| McDiarmid-Steele bound on variance max

p

V arp( ˆ H) ≤ V ∗( ˆ H) ≡ N max

j

g( j

N ) − g(j − 1 N )

2

SLIDE 14

Derivation of new estimator

Choose estimator to minimize (convex) error bound over (convex) space H: ˆ HBUB = argmin ˆ

H∈H [B∗( ˆ

H)2 + V ∗( ˆ H)]. Optimization of convex functions on convex parameter spaces is computationally tractable by simple descent methods Consistency proof involves Stone-Weierstrass theorem, penalized polynomial approximation theory in Poisson limit N/m → c.

SLIDE 15

Error comparisons: upper and lower bounds

10

1

10

2

10

3

10

4

10

5

10

6

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 N RMS error bound (bits) Upper and lower bounds on maximum rms error; N/m = 0.25 BUB JK

SLIDE 16

Undersampling example

y x true p(x,y) 5 10 15 20 2 4 6 8 10 12 14 16 18 20 y estimated p(x,y) 5 10 15 20 y | error | 5 10 15 20 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 x 10

−5

mx = my = 1000; N/mxy = 0.25 ˆ IMLE = 2.42 bits “bias-corrected” ˆ IMLE = −0.47 bits ˆ IBUB = 0.74 bits; conservative (worst-case RMS upper bound) error: ±0.2 bits true I(X; Y ) = 0.76 bits

SLIDE 17

Shannon (−p log p) is special

Obvious conjecture:

i pα i , 0 < α < 1 (Renyi entropy) should

behave similarly.

0.05 0.1 0.15 0.2 0.05 0.1 0.15 0.2 0.25 0.3 − p log (p) 0.05 0.1 0.15 0.2 0.1 0.2 0.3 0.4 sqrt (p)

Result 3: Surprisingly, not true: no estimator can uniformly estimate

i pα i , α ≤ 1/2, if N ∼ m (Paninski, 2004).

In fact, need N > m(1−α)/α: smaller α = ⇒ more data needed. (Proof via Bayesian lower bounds on minimax error.)

SLIDE 18

Directions

KL-minimax estimation of full distribution in sparse limit

N/m → 0 (Paninski, 2005b)

Continuous (unbinned) entropy estimators: similar result

holds for kernel density estimates

Sparse testing for uniformity: much easier than estimation

(N ≫ m1/2 suffices)

Open questions: 1/2 < α < 1? Other functionals?

SLIDE 19

Part 2: Adaptive optimal design of experiments

Assume:

parametric model pθ(y|

x) on outputs y given inputs x

prior distribution p(θ) on finite-dimensional model space

Goal: estimate θ from experimental data Usual approach: draw stimuli i.i.d. from fixed p( x) Adaptive approach: choose p( x) on each trial to maximize I(θ; y| x) (e.g. “staircase” methods).

SLIDE 20

Snapshot: one-dimensional simulation

0.5 1 p(y = 1 | x, θ0) x 2 4 x 10

−3

I(y ; θ | x) 10 20 30 40 θ p(θ)

trial 100

ptimized

i.i.d.

SLIDE 21

Asymptotic result

Under regularity conditions, a posterior CLT holds (Paninski, 2005a): pN √ N(θ − θ0)

→ N(µN, σ2);

µN ∼ N(0, σ2)

(σ2

iid)−1 = Ex(Ix(θ0))

(σ2

info)−1 = argmaxC∈co(Ix(θ0)) log |C|

= ⇒ σ2

iid > σ2 info unless Ix(θ0) is constant in x

co(Ix(θ0)) = convex closure (over x) of Fisher information matrices Ix(θ0). (log |C| strictly concave: maximum unique.)

SLIDE 22

Illustration of theorem

θ 10 20 30 40 50 60 70 80 90 100 0.2 0.4 θ 10 20 30 40 50 60 70 80 90 100 0.2 0.4 10 20 30 40 50 60 70 80 90 100 0.2 0.4 E(p) 10

1

10

2

10

−2

σ(p) 10 20 30 40 50 60 70 80 90 100 0.5 1 P(θ0) trial number

SLIDE 23

Technical details

Stronger regularity conditions than usual to prevent “obsessive” sampling and ensure consistency. Significant complication: exponential decay of posteriors pN off

f neighborhoods of θ0 does not necessarily hold.

SLIDE 24

Psychometric example

stimuli x one-dimensional: intensity
responses y binary: detect/no detect

p(1|x, θ) = f((x − θ)/a)

scale parameter a (assumed known)
want to learn threshold parameter θ as quickly as possible

0.5 1 θ p(1 | x, θ)

SLIDE 25

Psychometric example: results

variance-minimizing and info-theoretic methods

asymptotically same

just one unique function f ∗ for which σiid = σopt; for any
ther f, σiid > σopt

Ix(θ) = ( ˙ fa,θ)2 fa,θ(1 − fa,θ)

f ∗ solves

˙ fa,θ = c

fa,θ(1 − fa,θ)

f ∗(t) = sin(ct) + 1 2

σ2

iid/σ2

pt ∼ 1/a for a small

SLIDE 26

Computing the optimal stimulus

Simple Poisson regression model for neural data: yi ∼ Poiss(λi) λi| xi, θ = f( θ · xi) Goal: learn θ in as few trials as possible. Problems:

θ is very high-dimensional; difficult to update p( θ| xi, yi), compute I(θ, y| x)

x is very high-dimensional; difficult to optimize I(θ, y| x)

— Joint work with J. Lewi and R. Butera, Georgia Tech (Lewi et al., 2006)

SLIDE 27

Efficient updating

Idea: Laplace approximation p( θ|{ xi, yi}i≤N) ≈ N(µN, CN) Justification:

posterior CLT
f convex and log-concave =

⇒ log-likelihood concave in θ = ⇒ log-posterior is also concave

SLIDE 28

Efficient updating

Likelihood is “rank-1” — only depends on θ along x. Updating µN: one-d search Updating CN: rank-one update, CN = (C−1

N−1 + b

xt x)−1 — use Woodbury lemma Total time for update of posterior: O(d2)

SLIDE 29

Step 3: Efficient stimulus optimization

Laplace approximation = ⇒ I(θ; r| x) ∼ Er|

x log |CN−1| |CN|

— this is nonlinear and difficult, but we can simplify using perturbation theory: log |I + A| ≈ trace(A). Now we can take averages over p(r| x) =

p(r|θ,

x)pN(θ)dθ: standard Fisher info calculation given Poisson assumption on r. Further assuming f(.) = exp(.) allows us to compute expectation exactly, using m.g.f. of Gaussian. ...finally, we want to maximize F( x) = g(µN · x)h( xtCN x).

SLIDE 30

Computing the optimal x

max

x g(µN ·

x)h( xtCN x) increases with || x||2: constraining || x||2 reduces problem to nonlinear eigenvalue problem. Lagrange multiplier approach reduces problem to 1-d linesearch,

nce eigendecomposition is computed — much easier than full

d-dimensional optimization! Rank-one update of eigendecomposition may be performed in O(d2) time (Gu and Eisenstat, 1994). = ⇒ Computing optimal stimulus takes O(d2) time.

SLIDE 31

Near real-time adaptive design

200 400 600 0.001 0.01 0.1 Dimensionality Time(Seconds) total time diagonalization posterior update 1d line Search

SLIDE 32

Asympotic variance

10 10

1

10

2

10

3

10

−4

10

−2

10 σ2 Parallel to mean info.max true info.max pred i.i.d true i.i.d pred 10 10

1

10

2

10

3

10

−4

10

−2

10 Trial σ2 Orthogonal to mean

— despite approximations, asymptotic covariance achieves analytical prediction from CLT.

SLIDE 33

Visual neuron model example

— infomax approach is an order of magnitude more efficient. — can also track non-stationary θ via extended Kalman filter formulation (Lewi et al., 2006). — currently extending results to GLM (not just Poisson regression model)

SLIDE 34

Conclusions

Neuroscience is becoming more “statistics-hungry”: growing

demand for sophisticated data analysis and statistical theories of the brain.

Conversely, neuroscience is a rich source of interesting

statistical problems.

SLIDE 35

References

Gu, M. and Eisenstat, S. (1994). A stable and efficient algorithm for the rank-one modification of the symmetric eigenproblem. SIAM J. Matrix Anal. Appl., 15(4):1266–1276. Lewi, J., Butera, R., and Paninski, L. (2006). Real-time adaptive information-theoretic

ptimization of neurophysiological experiments. NIPS.

McDiarmid, C. (1989). On the method of bounded differences. In Surveys in Combinatorics, pages 148–188. Cambridge University Press. Miller, G. (1955). Note on the bias of information estimates. In Information theory in psychology II-B, pages 95–100. Paninski, L. (2003). Estimation of entropy and mutual information. Neural Computation, 15:1191–1253. Paninski, L. (2004). Estimating entropy on m bins given fewer than m samples. IEEE Transactions on Information Theory, 50:2200–2203. Paninski, L. (2005a). Asymptotic theory of information-theoretic experimental design. Neural Computation, 17:1480–1507. Paninski, L. (2005b). Variational minimax estimation of discrete distributions under KL

loss. Advances in Neural Information Processing Systems, 17.

SLIDE 36

Entropy bias bound

Biasp( ˆ H) = Ep( ˆ H) − H(p) =

m

i=1
p(i) log p(i) +

N

j=0

g( j N )BN,j(p(i))

≤

m · max

0≤p≤1 | − p log p − N

j=0

g( j N )BN,j(p)|

BN,j(p) =

N

j

pj(1 − p)N−j: polynomial in p
If

j g(j)BN,j(p) close to −p log p for all p, bias will be small

= ⇒ standard uniform polynomial approximation theory Back

SLIDE 37

Entropy variance bound

“Method of bounded differences” (McDiarmid, 1989): let F(x1, x2, ..., xN) be a function of N i.i.d. r.v.’s. If any single xi has small effect on F, i.e, sup |F(..., x, ...) − F(..., y, ...)| < c, then V ar(F) < N 4 c2 (inequalities due to Azuma-Hoeffding, Efron-Stein, Steele, etc.). Our case: ˆ H =

i

g(n(i) N ) max

j

g( j

N ) − g(j − 1 N )

< c =

⇒ V ar

i

g(n(i) N )

≤ Nc2

Back

SLIDE 38

37-1