Rao r Cram r Rao Bounds and Bounds and Cram Monte Carlo - - PowerPoint PPT Presentation

rao r cram r rao bounds and bounds and cram monte carlo
SMART_READER_LITE
LIVE PREVIEW

Rao r Cram r Rao Bounds and Bounds and Cram Monte Carlo - - PowerPoint PPT Presentation

Rao r Cram r Rao Bounds and Bounds and Cram Monte Carlo Calculation of the Monte Carlo Calculation of the Fisher Information Matrix Fisher Information Matrix Interfaces 2004 Interfaces 2004 James C. Spall The Johns Hopkins


slide-1
SLIDE 1

Cram Cramé ér r− −Rao Rao Bounds and Bounds and Monte Carlo Calculation of the Monte Carlo Calculation of the Fisher Information Matrix Fisher Information Matrix Interfaces 2004 Interfaces 2004

James C. Spall The Johns Hopkins University Applied Physics Laboratory james.spall@jhuapl.edu

slide-2
SLIDE 2

2

Introduction Introduction

  • Fundamental role of data analysis is to extract

information from data

  • Parameter estimation for models is central to process of

extracting information

  • The Fisher information matrix plays a central role in

parameter estimation for measuring information: Information matrix summarizes the amount Information matrix summarizes the amount

  • f information in the data relative to the
  • f information in the data relative to the

parameters parameters being estimated being estimated

slide-3
SLIDE 3

3

Problem Setting Problem Setting

  • Consider the classical statistical problem of estimating parameter

vector θ from n data vectors z1, z2 ,…, zn

  • Suppose have a probability density and/or mass function

associated with the data

  • The parameters θ appear in the probability function and affect the

nature of the distribution – Example: zi ∼ N(mean(θ), covariance(θ)) for all i

  • Let L(θ| z1, z2 ,…, zn) represent the likelihood function, i.e., the

p.d.f./p.m.f. viewed as a function of θ conditioned on the data

slide-4
SLIDE 4

4

Selected Applications Selected Applications

  • Information matrix is measure of performance for several
  • applications. Four uses are:
  • 1. Confidence regions for parameter estimation

– Uses asymptotic normality and/or Cramer-Rao inequality

  • 2. Prediction bounds for mathematical models
  • 3. Basis for “D-optimal” criterion for experimental

design

– Information matrix serves as measure of how well θ can be estimated for a given set of inputs

  • 4. Basis for “noninformative prior” in Bayesian

analysis

– Sometimes used for “objective” Bayesian inference

slide-5
SLIDE 5

5

Information Matrix Information Matrix

  • Recall likelihood function L(θ| z1, z2 ,…, zn)
  • Information matrix defined as

where expectation is w.r.t. z1, z2 ,…, zn

  • Equivalent form based on Hessian matrix:
  • Fn(θ) is positive semidefinite of dimension p×p (p=dim(θ))

∂ ∂ ⎛ ⎞ ⎜ ⎟ ∂ ∂ ⎝ ⎠

=

log log

( )

n T

L L E θ θ θ F ⎛ ⎞ ∂ − ⎜ ⎟ ⎜ ⎟ ∂ ∂ ⎝ ⎠

=

2 log

( )

n T

L E θ θ θ F

slide-6
SLIDE 6

6

Information Matrix (cont’d) Information Matrix (cont’d)

  • Connection of Fn(θ) and uncertainty in estimate is

rigorously specified via two famous results (θ∗ = true value of θ):

  • 1. Asymptotic normality:
  • 1. Asymptotic normality:

where

  • 2. Cram
  • 2. Cramé

ér r-

  • Rao

Rao inequality: inequality: Above two results indicate: greater variability of ⇒“smaller” Fn(θ) (and vice versa)

− ⎯⎯⎯ →

dist 1

ˆ ,

( ) ( )

n

n N 0 θ θ F

∗ −

1

ˆ ˆ cov for all

( ) ( )

n n

n θ θ F

ˆn θ

→∞

≡ lim

( )

n n

n θ F F

ˆn θ

slide-7
SLIDE 7

7

Computation of Information Matrix Computation of Information Matrix

  • Analytical formula for Fn(θ) requires first or second

derivative info. and expectation calculation

– Often impossible or very difficult to compute in real-world models – Involves expected value of highly nonlinear (possibly unknown) functions of data

  • Schematic below summarizes “easy” Monte Carlo-

based method for determining Fn(θ)

– Uses averages of very efficient (simultaneous perturbation) Hessian estimates – Hessian estimates evaluated at artificial (pseudo) data – Computational horsepower instead of analytical analysis

slide-8
SLIDE 8

8

Schematic of Monte Carlo Method for Schematic of Monte Carlo Method for Estimating Information Matrix Estimating Information Matrix

slide-9
SLIDE 9

9

Optimal Implementation Optimal Implementation

  • Several implementation questions/answers:

Q.

  • Q. How to compute (cheap) Hessian estimates?

A.

  • A. Use simultaneous perturbation (SP) based method

(IEEE Trans. Auto. Control, 2000, pp. 1839–1853) Q.

  • Q. How to allocate per-realization (M) and across-

realization (N) averaging? A.

  • A. M = 1 is the optimal solution for a fixed total number of

Hessian estimates. However, M > 1 is useful when accounting for cost of generating pseudo data. Q.

  • Q. Can correlation be introduced to improve overall

accuracy of ? A.

  • A. Yes, antithetic random numbers can reduce variance
  • f the elements in . Discussed on slides below.

M N , ( )

θ F

M N , ( )

θ F

slide-10
SLIDE 10

10

Antithetic Random Numbers Antithetic Random Numbers

  • Above solution (M = 1) assumes all Hessian estimates

generated with independent perturbation vectors

  • Is it possible to introduce correlated perturbations to

reduce variability?

  • Implemented based on M >1

– Contrasts with optimal solution above of M = 1

  • Antithetic random numbers (

Antithetic random numbers (ARNs ARNs) ) are a way to reduce variability of sums of pseudo random numbers

– Contrast with common random numbers for differences of pseudo random numbers

  • Based on introducing negative correlation according to

var(X+Y) = var(X) + var(Y) + 2cov(X,Y)

slide-11
SLIDE 11

11

Implementing Antithetic Random Numbers Implementing Antithetic Random Numbers

  • Implementing ARNs represents both art and science

– Typically more difficult than common random numbers

  • Possible to write down analytical basis for “best”

implementation of ARNs

– Unusable in practice – Requires full knowledge of true Hessian values

  • Practical implementation requires problem insight and

approximations

  • Not a panacea, but sometimes useful to increase

accuracy and/or reduce computational cost

slide-12
SLIDE 12

12

Numerical Experiments for Monte Carlo Numerical Experiments for Monte Carlo Method of Estimating Information Matrix Method of Estimating Information Matrix

  • Consider a problem of estimating µ and Σ from data zi ∼ N(µ,

Σ+Pi ) ∀ i. Let n = 30 – A problem with known information matrix – Useful for comparing approach here with known result – Pi’s assumed known (non-identical)

  • Have dim(zi) = 4 and dim(θ) = 14

⇒ 14×(14+1)/2=105 unique elements in Fn(θ) need to be calculated

  • Real-world implementation of Monte Carlo method is for

problems where solution is not known (unlike this example)

slide-13
SLIDE 13

13

Evaluation Criteria Evaluation Criteria

  • Let denote the estimate for the Fisher info.

matrix from M Hessian estimates at each pseudodata vector and N pseudodata vectors

  • Many ways of comparing and the true matrix

Fn(θ) = F30(θ)

  • As summary measure we use the standard matrix

(spectral) norm (scaled):

M N , ( )

θ F

M N , ( )

θ F

− =

, ( )

( ) norm ( )

M N n n

θ θ θ F F F

slide-14
SLIDE 14

14

Focus of Numerical Experiments Focus of Numerical Experiments

  • Two tables below show results of numerical studies of various

implementations – Optimality of M = 1 under fixed budget B = MN of Hessian estimates – Value of gradient information (when available) in improving estimate – Value of ARNs

  • Assume only likelihood values are available (i.e., no gradient) in

study of M = 1 – Crude Hessian estimates based on difference of SP gradient estimates – Harder to obtain good Hessian estimates than when exact gradient is available

slide-15
SLIDE 15

15

Two Studies: Optimality of Two Studies: Optimality of M M = 1 and = 1 and Value of Gradient Information Value of Gradient Information

  • Values in columns (a), (b), and (c) are scaled matrix norms;

associated statistical P P-

  • v

va al lu ue es s shown to right

  • Constant budget B of SP Hessian estimates (B=MN)
  • P-values based on two-sided t-test

M = 1 N = 40,000 Likelihood Values (a) M = 20 N = 2000 Likelihood Values (b) M = 1 N = 40,000 Gradient Values (c) P-value (a) vs. (b) P-value (a) vs. (c)

0.0502 0.0532 0.0183 0.0009 < 10−10

slide-16
SLIDE 16

16

Test of Antithetic Random Numbers for Test of Antithetic Random Numbers for µ µ Portion of Portion of F Fn

n(

(θ θ): Matrix Norms and ): Matrix Norms and P P-

  • Value

Value

  • Constant budget of SP Hessian estimates (B=MN)
  • P-values based on two-sided t-test
  • SP Hessian estimates based on true gradient values

M = 1 N = 40,000 (no ARNs) M = 2 N = 20,000 (ARNs) P-value 0.0084 0.0071 0.018

slide-17
SLIDE 17

17

Concluding Remarks Concluding Remarks

  • Fisher information matrix is a central quantity in data

analysis and parameter estimation

–Measures information in data relative to quantities being estimated –Applications in confidence bounds, prediction error bounds, experimental design, Bayesian analysis, etc.

  • Direct computation of information matrix in general

nonlinear problems usually impossible

  • Described Monte Carlo approach for computing matrix

in arbitrarily complex (nonlinear) estimation problems:

  • Replaces detailed analytical analysis with computational

Replaces detailed analytical analysis with computational power via resampling power via resampling

  • Easy to implement, but may be computationally demanding

Easy to implement, but may be computationally demanding