[PPT] - Overview of Innovations Topics Introduction Most of the lecture PowerPoint Presentation

SLIDE 1

Notation ˆ x = Koy

Bold face will be used to denote random variables (vectors) in this

set of slides

Normal face will be used to denote non-random variables (known

constants, scalars, vectors, and matrices)

Upper case letters will be reserved for matrices, as before
The time index will be indicated by a subscript, rather than a

parenthetical expression, e.g., xn

Our goal is still to estimate a random vector xn from the observed

random vector yn – Thus the role of xn and yn have been reversed

The operator ∗ will be used to denote conjugate transpose

(formerly H)

J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

3

Overview of Innovations Topics

Notation
Normal equations and MMSE revisited (generalized)
Derivation
Equivalence
Time-updates
Computational savings?
J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

1

Linear Estimation Problem Definition Revisited ˆ x = Koy

Note again that the roles of x and y have been reversed
Also, the coefficient vector (formerly co) is now a matrix Ko
Suppose we wish to estimate a random vector x ∈ Cℓ×1 from a

random vector y ∈ Cp×1

The estimator will be denoted as ˆ

x for now

As before, we will restrict our attention to linear estimators for the

time being

However, this is more general than our earlier discussion because

x is a vector

J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

4

Introduction

Most of the lecture materials on the Kalman filter will be drawn

from [1]

Best book on the Kalman filter that I’m aware of
Much more thorough than most books on statistical signal

processing

J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

2

SLIDE 2

The Error Covariance Matrix E[˜ z2] = E[˜ z ˜ z∗] = E[(a∗ ˜ x)(a∗ ˜ x)∗] = a∗ E[˜ x˜ x∗]a = a∗P(K)a

For any given vector a, the MSE is minimized by finding the error

covariance matrix such that a∗P(K)a ≥ a∗P(Ko)a a∗[P(K) − P(Ko)]a ≥ 0

This means that an equivalent problem is to find Ko such that

P(K) − P(Ko) is nonnegative definite for all K

The solution to this problem is independent of a!
Commonly denoted as P(K) ≥ P(Ko)
Thus we can generalize the notion of minimizing MSE in the

vector case to mean minimize the matrix error covariance P(K) = E[˜ x˜ x∗]

J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

7

Inner Products x, y E [xy∗] x2 x, x x⊥y : x, y = 0

The · notation is used to denote an inner product
It is unusual that our inner product is actually an (expected) outer

product of vectors

Nonetheless, all of the properties of a inner product space are

satisfied and the projection theorem holds

Thus the optimal estimate is still orthogonal to the residuals

(errors) y, x − Koy = 0

Generalizes all of the solutions to any inner product space
For all of our applications just denotes the expected outer product
f random vectors
J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

5

Solving the Normal Equations (again) P(K) = E [(x − Ky)(x − Ky)∗] = E[xx∗] − K E[yx∗] − E[xy∗]K∗ + K E[yy∗]K∗ = Rx − KRyx − RxyK∗ + KRyK =

I

−K Rx Rxy Ryx Ry I −K∗

=
I

−K

R
I

−K∗

Suppose Ry > 0 (positive definite, invertible) then we can factor the

joint covariance matrix in uper-lower block triangular form as

Rx

Rxy Ryx Ry

=

I RxyR−1

y

I R˜

x

Ry I R−1

y Ryx

I

where

Rxy = R∗

yx

R˜

x Rx − RxyR−1 y Ryx

is called the Schur complement of Rx in R

J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

8

Problem Definition: Revisiting the MSE P(K) E [(x − ˆ x)(x − ˆ x)∗] = E [˜ x˜ x∗]

If we have multiple outputs that we are estimating simultaneously,

how do we calculate the mean squared error (MSE)?

What if estimating some elements of x is more important than
thers?
Let’s say we were interested in a linear combination of the

individual errors for each element of x: z a∗x ˆ z = E[z|y] = E[a∗x|y] = a∗ E[x|y] = a∗ ˆ x ˜ z z − ˆ z = a∗(x − ˆ x) = a∗ ˜ x

How is the minimum of E[˜

z2] related to the error covariance matrix P(K)?

J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

6

SLIDE 3

Normal Equations Revisited More generally, you can show that the MMSE estimator is given by any solution of the normal equations even when Ry is not invertible KoRy = Rxy These look quite different from the normal equations we obtained earlier Rco = d but they become the same if we consider that we are now estimating x from y. Suppose x is a scalar, then we can express the normal equations in a form similar to the one we worked with before: Ko

1×p Ry p×p

= Rxy

1×p

Ry

p×p

K∗

p×1

= Ryx

p×1

J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

11

Normal Equations: Completing the Square Then P(K) = I −K Rx Rxy Ryx Ry I −K∗

=

I −K I RxyR−1

y

I R˜

x

Ry I R−1

y Ryx

I I −K∗

=

I

RxyR−1

y

− K R˜

x

Ry I

R−1

y Ryx − K∗

=

R˜

x

(RxyR−1

y

− K)Ry I (RxyR−1

y

− K)∗

= R˜

x + (RxyR−1 y

− K)Ry(RxyR−1

y

− K)∗ = Rx − RxyR−1

y Ryx + (RxyR−1 y

− K)Ry(RxyR−1

y

− K)∗

J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

9

Exploiting Structure

So far we have tried to estimate one random variable (formerly y,

now xn) from another (formerly x, now y)

In many applications the random variables have additional

structure that we can exploit – More insightful solutions – Can frame problem in terms of known quantities and not just the auto- and cross-correlations – Can reduce the computation

Example: Stationary FIR filters
J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

12

Normal Equations: Finding the Optimal Solution P(K) = R˜

x + (RxyR−1 y

− K)Ry(RxyR−1

y

− K)∗ a∗P(K)a = a∗R˜

xa + a∗(RxyR−1 y

− K)Ry(RxyR−1

y

− K)∗a ´ a (RxyR−1

y

− K)∗a P(K) = a∗R˜

xa + ´

a∗Ry´ a ´ a∗Ry´ a ≥ 0 for any vector ´ a P(K) ≥ a∗R˜

xa

Since the final expression doesn’t depend on K, P(K) is minimized when K = Ko RxyR−1

y

and the minimum error covariance matrix is P(Ko) = R˜

x = Rx − RxyR−1 y Ryx = Rx − RxyKo

J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

10

SLIDE 4

Large Linear Estimation Problems Recall from the projection theorem that our estimator is given by the

rthogonal projection of x ∈ Cℓ×1 onto the set of all observations

{y0, y1, . . . , yN} where yn ∈ Cp×1: KoRy = Rxy Ry

(N+1)p×(N+1)p

= [yi, yj]N

i,j=0 =

⎡ ⎢ ⎢ ⎢ ⎣ y02 y0, y1 . . . y0, yN y1, y0 y12 . . . y1, yN . . . . . . ... . . . yN, y0 y1, y0 . . . yN2 ⎤ ⎥ ⎥ ⎥ ⎦ Rxy

ℓ×(N+1)p

= [x, yi]N

i=0

=

x, y0

x, y1 . . . x, yN

J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

15

Normal Equations for Random Processes KoRy = Rxy

The signal processing context has a bit more structure than the

general linear estimation problem

Suppose

– We can make sequential observations vector-valued random process yn – We have made N + 1 observations of yn

Our goal is then to estimate a random vector x ∈ Cℓ×1 from the

entire set of observations yn ∈ Cp×1 for n = 0 . . . N

If we collect all N + 1 observation vectors into a composite

(N + 1) × p vector y = col{y0, . . . , yN}, then the solution is still given by the normal equations

But now the auto-correlation matrix Ry is enormous,

(N + 1)p × (N + 1)p

J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

13

Property 1: Linearity of Linear MMSE Estimators There are two important properties of projections onto a linear space L{y}. The first is for any two matrices A1 and A2 of appropriate dimensions ´ x A1x1 + A2x2 R´

xy = ´

x, y = A1x1 + A2x2, y = A1x1, y + A2x2, y = A1Rx1y + A2Rx2y ˆ ´ x =

(A1x1 + A2x2)

= R´

xyR−1 y y

= (A1Rx1y + A2Rx2y)R−1

y y

= A1Rx1yR−1

y y + A2Rx2yR−1 y y

= A1 ˆ x1 + A2 ˆ x2 Thus

(A1x1 + A2x2) = A1 ˆ

x1 + A2 ˆ x2

J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

16

Normal Equations: Opportunities KoRy = Rxy

In general, if Ry is of dimension M, it requires O(M 3) operations

to solve the normal equations

In modern computers this can be slow (seconds) if M ≥ 2000 (7 s
n my computer)
For large M, there may also be storage problems
If M = (N + 1)p is growing (the signal processing context), these

problems are much worse

Ideally, we could update our estimator each time a new
bservation arrived and discard the observation
Only works when the problem has some special structure
We know in the stationary FIR filter case, the solution requires
nly O(N 2) operations
J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

14

SLIDE 5

Equivalent, but Orthogonal Observations LN L{y0, . . . , yN} = L{e0, . . . , eN}

Fortunately, we can transform {y0, . . . , yN} to an equivalent set
f orthogonal vectors {e0, . . . , eN} such that ei⊥ej for all i = j
It would be nicer still if we could efficiently calculate eN+1 from

yN+1 and eN without having to recalculate all of the other {e0, . . . , eN}

Thus, we want the random process en to have a recursive

relationship to yn

Then we can calculate eN+1 as

eN+1 = yN+1 − ˆ yN+1|N ˆ yN+1|N the linear MMSE estimator of yN+1 given {y0, . . . , yN} = yN+1, yy−2y = RyN+1yR−1

y y

This is just the forward prediction error filter (generalized)
J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

19

Property 2: The Orthogonality Condition ˆ x|y1,y2 = RxyR−1

y y

= Rxy1 Rxy2 Ry1 Ry1y2 Ry2y1 Ry2 −1 y1 y2

=

Rxy1 Rxy2 Ry1 Ry2 −1 y1 y2

=
Rxy1

Rxy2 R−1

y1

R−1

y2

y1 y2

=

Rxy1R−1

y1

Rxy2R−1

y2

y1 y2

= Rxy1R−1

y1 y1 + Rxy2R−1 y2 y2

= ˆ x|y1 + ˆ x|y2 ˆ x|y1,y2 = ˆ x|y1 + ˆ x|y2 if and only if y1⊥y2 (Ry1y2 = 0)

J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

17

Finding Equivalent, Orthogonal Observations The estimates can be obtained easily from the prior estimates using the orthogonality condition ˆ yN+1|N the linear MMSE estimator of yN+1 given {y0, . . . , yN} = yN+1, yy−2y = the linear MMSE estimator of yN+1 given {e0, . . . , eN} = yN+1, ee−2e =

N

n=0

ˆ yN+1|en since en⊥ek for n = k =

N

n=0

yN+1, enen−2en

J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

20

Linear Estimation with Orthogonal Observations Ko

ℓ×(N+1)p

Ry

(N+1)p×(N+1)p

= Rxy

ℓ×(N+1)p

If Ry were a diagonal matrix this would be much easier to solve
Only O ((N + 1)p) operations, rather than O
[(N + 1)p]3
Equivalently, would be much easier to invert if Ry were block

diagonal, O

(N + 1)p3
perations
In other words, yi⊥yj for all i = j
In this case

ˆ xN the linear MMSE estimator of x given {y0, . . . , yN} = RxyR−1

y y = x, yy−2y

=

N

n=0

RxynR−1

yn yn

This would not happen in most applications
J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

18

SLIDE 6

Finding the Innovations

If our only goal was to diagonalize the covariance matrix Ry, there

are many possible transformations ǫ Ay Rǫ = ǫ, ǫ = ARyA∗ = a diagonal matrix

If we want ǫ to contain the same information, A must be

nonsingular Ry = A−1RǫA−∗, Rǫ diagonal

If A is an M × M matrix, there are M 2 free parameters in A and
nly M(M + 1)/2 constraints imposed by requiring Rǫ to be a

diagonal matrix

However, we also desire A to be lower triangular so that en is a

causal function of yn and vice versa

J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

23

Gram-Schmidt Orthogonalization Then the next orthogonalized random vector can be obtained from eN+1 = yN+1 − ˆ yN+1|N = yN+1 −

N

n=0

yN+1, enen−2en which can be initialized with e0 = y0.

This is the Gram-Schmidt orthogonalization procedure for

random vectors

Note, however, that this does not tell you how to calculate the

cross-covariance yN+1, en or autocovariance matrices Re = en2 from Ry = yn2

To obtain these matrices requires O([(N + 1)p]3) operations
J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

21

Finding the Innovations Continued Ry = LDL∗ e = L−1y Re = L−1R−∗

y

We have three constraints on the linear transformation ǫ Ay
1. The linear transformation of y be invertible (A is nonsingular)
2. The linear transform be causal (A is lower triangular)
3. That Re is diagonal
If L has a unit diagonal and Ry is positive definite, this

factorization is unique

These are the same innovations that are obtained via the

Gram-Schmidt procedure ǫ = e A = L−1

J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

24

Innovations eN+1 = yN+1 −

N

n=0

yN+1, enen−2en

Note that eN+1 is the portion of the random variation in yN+1

that cannot be estimated from the previous observations {y0, . . . , yN}

The remainder contains the “new information”
It is in this sense that they are called innovations

{en} = the innovations process associated with {yn}

Each innovation en is uncorrelated with the prior innovations and

thus only contains the “new information”

In other words, it’s a white noise process
J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

22

SLIDE 7

Estimation with Innovations The estimate of x is now straight-forward ˆ x|N the MMSE estimator of x given {y0, . . . , yN} = the MMSE estimator of x given {e0, . . . , eN} =

N

n=0

x, enen−2en

J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

27

Equivalence of Causal, Invertible Innovations and Gram-Schmidt yn = en +

n

i=0

yn, eiei−2ei y = Le L = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ I y1, e0e0−2 I y2, e0e0−2 y1, e1e1−2 I . . . . . . . . . ... yN, e0e0−2 yN, e1e1−2 yN, e2e2−2 . . . I ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ Ry = LDL∗ D = diag{en2}

J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

25

Recursive Estimation with Innovations The really cool thing is that we can update our estimate easily when yN+1 shows up eN+1 = yN+1 − ˆ yN+1 = yN+1 −

N

n=0

yN+1, enen−2en ˆ x|N+1 =

N

i=0

x, enen−2en + x, eN+1eN+1−2eN+1 = ˆ x|N + xn, eN+1eN+1−2eN+1

J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

28

Innovations and Whitening y = Le yn = en +

n

i=0

yn, eiei−2ei Ry = LDL∗ D = diag{en2}

Thus a white-noise innovations process can be obtained by

performing and LDL (block matrix) decomposition of Ry

Sometimes L is called the canonical modeling filter
Similarly L−1 is called the canonical whitening filter
The representation y = Le is called the innovations

representation of yn

J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

26

SLIDE 8

Summary eN+1 = yN+1 − ˆ yN+1 ˆ x|N+1 = ˆ x|N + xn, eN+1eN+1−2eN+1

The innovations process will ultimately help us

– Reduce computation – Reduce storage requirements – Exploit structure in the covariance matrix Ry – Give us insight into the solution

This is essentially a whitening filter followed by an estimator
Same architecture used for order recursive algorithms and linear

FIR/IIR estimation problems

J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

31

Similarity of Time- and Order-Updates eN+1 = yN+1 − ˆ yN+1 ˆ x|N+1 = ˆ x|N + xn, eN+1eN+1−2eN+1

Note the similarity of these time updates with the order updates
btained from the Levinson algorithm
1. Estimate the new observation/variables from all of the previous
bservatons/variables
2. The residuals are white and orthogonal
3. Estimate the variable of interest with the residuals
J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

29

References

[1] T. Kailath, A. H. Sayed, and B. Hassibi. Linear Estimation. Prentice Hall, 2000.

J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

32

Computational Savings? KoRy = Rxy Ry = LDL∗

So far the innovations have given us insight
They have not saved us any computation
Performing an LDL decomposition of an M × M matrix requires

O(M 3) operations — same amount required to solve the normal equations directly

As always, we need additional structure in the matrices to reduce

the computation

The state-space formulation gives us a structure that permits us

to solve the normal equations much more efficiently

J. McNames

Portland State University ECE 539/639 Innovations

Ver. 1.02

30