[PPT] - Estimating Filaments and Manifolds Larry Wasserman Dept of PowerPoint Presentation

SLIDE 1

Estimating Filaments and Manifolds

Larry Wasserman Dept of Statistics and Machine Learning Department Carnegie Mellon University June 2012

SLIDE 2

Co-authors

Geometry: Chris Genovese, Marco Perone-Pacifico and Isa Verdinelli Topology: Sivaraman Balakrishnan, Ale Rinaldo, Don Sheehy and Aarti Singh

SLIDE 3

Introduction

The Geometric problem: find a manifold M which is close to an unknown manifold M. Topological problem: find a manifold M which has the same homology as an unknown manifold M. When the manifold is one-dimensional we call it a filament. We are not using manifolds for dimension reduction. We are interested in estimating the manifold.

Genovese, Perone-Pacifico, Verdinelli, Wasserman (2010) (arXiv:1003.5536, arXiv:1007.0549, arXiv:1109.4540). Rinaldo, Sheehy, Balakrishnan, Singh, Wasserman (2011).

SLIDE 4

Motivating Example: The Cosmic Web

SLIDE 5

Example

SLIDE 6

Example

SLIDE 7

Low-Dimensional Structure in Point Cloud Data

Many datasets exhibit complex, low-dimensional structure. More Examples:

Networks of blood vessels in medical imaging.
River and road systems in remote sensing.
Fault lines in seismology.
Landmark paths for moving objects in computer vision.

In addition, high-dimensional datasets often have hidden structure that we would like to identify. Several distinct problems here, including: Dimension Reduction, Clustering, and Estimation.

SLIDE 8

Manifolds and Manifold Complexes

Manifolds give a useful representation of low dimensional structure. A manifold is a space that looks locally like a Euclidean space of some dimension (called the dimension of the manifold).

Examples: point (0-dim), filaments (1-dim), surface of the sphere or torus (2-dim), three-dimensional sphere, space-time (4-dim).

To allow for intersections and other complexities, consider a union of manifolds embedded in RD with maximal dimensions d < D. We call this a d-dimensional manifold complex.

SLIDE 9

Outline

1 The Geometric Problem 1 Minimax Theory 2 Methods 2 The Topological Problem

SLIDE 10

Minimax Manifold Estimation

Y1, . . . , Yn are noisy measurements near a manifold M.
M is a d-manifold embedded in RD.
G is a distribution supported on M.
Four different noise models:

1 noiseless: Yi ∼ G where support(G) = M. 2 clutter: Yi ∼ (1 − π)U + πG where U is uniform. 3 perpendicular: Yi = Xi + ǫi where Xi ∼ G and ǫi is perpendicular

to M. (Niyogi, Smale, Weinberger 2008).

4 additive: Yi = Xi + ǫi and ǫi ∼ Φ.

SLIDE 11

Minimax Manifold Estimation

Let QM be the induced distribution on Y.
Let Q = {QM : M ∈ M}.
Loss function: Hausdorff distance H(M,

M) where H(A, B) = inf{ǫ : A ⊂ B ⊕ ǫ and B ⊂ A ⊕ ǫ} where A ⊕ ǫ =

x∈A B(x, ǫ) and B(x, ǫ) = {y : ||x − y|| ≤ ǫ}.

Goal: determine:

inf

M

sup

Q∈Q

EQH( M, M).

SLIDE 12

Hausdorff Distance

A B

H(A, B) = max{2.5, 1.5} = 2.5

SLIDE 13

Condition Number (or Reach)

∆(M) is the largest number κ such that, if d(x, M) ≤ κ then x has a

unique projection onto M.

Intuitively, a ball of radius ≤ ∆(M) can roll freely but a ball of radius

> ∆(M) cannot roll freely.

∆(M) larges means: M is smooth and not close to being

self-intersecting.

M = {M : ∆(M) ≥ κ}.
See Niyogi, Smale and Weinberger (2009) for more on condition

number.

SLIDE 14

Condition Number

From Gonzalez and Maddocks (1999) A large value of ∆(M) generates a manifold that is smooth and far from looping around itself.

SLIDE 15

Condition Number in One Dimension

circles have radius r κ > r κ > 2r κ < r κ < 2r

SLIDE 16

Normals of size < ∆ do not Cross

SLIDE 17

A Synthetic Example

A 2-d Manifold in 3-d space

SLIDE 18

A Synthetic Example

●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
Sample from Q

SLIDE 19

A Synthetic Example

●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●

SLIDE 20

Existing Literature

Computational geometry (e.g., Cheng et al. 2005, Dey 2006)

Here, “noise” does not have the statistical meaning of points drawn randomly from a distribution; instead, data must be close to M but not too close to each other. (There are a few notable exceptions.)

Manifold learning (e.g., Ozertem and Erdogmus 2011)

The primary focus here is on dimension reduction

Homology estimation (e.g., Niyoki, Smale, and Weinberer 2009) Focus on

topological rather than geometric information. (Later)

Filaments, principle curves, support estimation, . . .

e.g., Hastie and Stuetzle (1989), Tibshirani (1992), Arias-Castro et

al. (2006)

SLIDE 21

Minimax Manifold Estimation

Define M ≡ Mκ = {M : ∆(M) ≥ κ} and Q = {QM : M ∈ M}, where QM(A) =

M

Φ(Y ∈ A|X = x) dG(x) is the induced distribution on Y. Draw Y1, Y2, . . . , Yn IID from QM and estimate M ≡ Mn. Goal: determine the minimax risk Rn = inf

Mn

sup

Q∈Q

EQH( Mn, M), at least up to rates, with Hausdorff loss.

SLIDE 22

Minimax Rates under Various Noise Models

inf

Mn supQ∈Q EQH(

Mn, M) ≍ Cψn Noise Model ψn Perpendicular Compact n−

2 2+d

Clutter/Noiseless (πn)− 2

d

Additive Compact/Polynomial in progress Additive sub-Gaussian (log n)−1 Note that these rates do not depend on the ambient dimension D. There are strong connections between the additive noise model and errors in variables regression but also some notable differences.

SLIDE 23

Le Cam’s Lemma

The lower bounds will all be found using Le Cam’s lemma. Y1, . . . , Yn ∼ Q ∈ Q. Metric ρ and estimator θ = θ(Y1, . . . , Yn). sup

Q∈Q

EQn

ρ(

θ, θ(Q))

≥ C ρ[θ(Q0), θ(Q1)] (1 − TV(Q0, Q1))2n

where TV(Q0, Q1) = sup

A

|Q0(A) − Q1(A)| = 1 2

|q0 − q1|.

They key is to choose a least favorable pair Q0, Q1.

SLIDE 24

Perpendicular Noise

Draw X1, . . . , Xn ∼ G, where support(G) = M. Draw Yi = Xi + ǫi where ǫi is uniform on the Normal (perpendicular) to M at Xi.

Theorem:

Rn = inf

M

sup

Q∈Q

EQ

H(M,

M)

≍ n−

2 2+d .

Note that this depends on d but not D.

SLIDE 25

Outline of Proof of Lower Bound

Construct M0 and M1 such that:
∆(Mi) ≥ κ
H(M0, M1) = γ
TV = 1

2

|q0 − q1| = O(γ(d+2)/2), (which is minimum possible).
Apply Le Cam’s Lemma: For any

M: sup

Q∈Q

EQn[H(M, M)] ≥ H(M0, M1) × (1 − TV)2n = γ(1 − cγ(d+2)/2)2n. Setting γ = n−2/(d+2) yields the result.

Least Favorable Pair M0 and M1:

M0 = plane and M1 = flying saucer.

SLIDE 26

M0

SLIDE 27

A Ball pushing up.

SLIDE 28

The Ball goes through the plane

The dome height is γ but ∆(M) = 0.

SLIDE 29

Roll another Ball around it . . .

SLIDE 30

Rolling . . .

SLIDE 31

Flying Saucer Manifold

∆(M) = κ > 0.

SLIDE 32

Outline of Upper Bound

Construct an “estimator” that achieves the bound:

1 Split the data into two halves. 2 Using first half, construct a pilot esimator. This is a (sieve)

maximum likelihood estimator.

3 Cover the pilot estimator with thin, long, slabsג1,...,גN. 4 Fit local linear estimators in each slab (using second half of

data).

5

M =

j

Mj. The details are messy and the estimator is not practical but it suffices for establishing the bound.

SLIDE 33

Clutter Model

Y1, . . . , Yn ∼ Q ≡ (1 − π)U + πG where 0 < π ≤ 1, U is uniform on the compact set K ⊂ RD and support of G is M. inf

M

sup

Q∈Q

EQn[H( M, M)] ≍∗ C 1 nπ 2

d

. The ≍∗ means I am hiding log factors. Lower bound uses the same least favorable pair.

SLIDE 34

Clutter Model: Upper Bound

Let ǫn = (log n/n)2/d. Let Qn be the empirical measure. Let SM(y) denote a ǫd/2 × ǫD−d slab:

y

b1 εn b2 εn

Define s(M) = inf

y∈M

Qn[SM(y)]

and

Mn = argmax

M

s(M).

SLIDE 35

Additive Model

X1, X2, . . . , Xn ∼ G where support(G) = M, and Yi = Xi + Zi, i = 1, . . . , n, where Zi ∼ Φ = Gaussian. This is like the usual deconvolution problem except:

1 We want to estimate the support of G not G itself. 2 G is singular. 3 The underlying object is a manifold not a function.

SLIDE 36

Additive Model

For technical reasons, we allow the manifolds to be noncompact. Define a truncated loss function, L(M, M) = H(M ∩ K, M ∩ K). inf

M

sup

Q∈Q

EQ[L(M, M)] ≥ C log n. Rate is similar to deconvolution but the proof is somewhat different (since Q0 and Q1 have different supports). Least favorable pair:

M0 M1 γ γ

SLIDE 37

Additive Model: Upper Bound

Let g be a deconvolution density estimator (even though G has no density). Let M = { g > λ}. Fix any 0 < δ < 1/2. inf

M

sup

Q∈Q

EQ[L(M, M)] ≤ C

1

log n 1−δ

2

. In some special cases, we can achieve 1/ log n but in general not.

SLIDE 38

A Practical Estimator

The estimators we described are not practical. They are used simply to prove the upper bounds. A more practical approach is to use the ridges in the density of q as a surrogate for M. Genovese, Perone-Pacifico, Verdinelli and Wasserman (2009): find the ridges by tracing the mean shift paths. Ozertem and Erdogmus (2011): find the ridges by using projected mean shift paths.

SLIDE 39

Modes

Before dicussion ridges, let’s review modes. Let m1, . . . , mk be the modes of a density q. For any point y there is an integral curve π (gradient ascent curve) π leading from y to a mode mj. Formally π′(t) = ∇q(π(t)) and destination(y) = limt→∞ π(t) = mj. Let Aj = {y : destination(y) = mj} be the ascending manifold for mj. These partition the space.

SLIDE 40

Hyper-Ridge Sets

Y1, . . . , Yn sampled IID from Q = (1 − π)U + π(G ⋆ Φσ), the additive model with clutter. Let q be the density of q with gradient g(x) = ∇q(x) and Hessian H(x). Write H(x) = U(x)Λ(x)U(x)T and let V(x) be the first D − d columns

f U(x). Let P(x) = V(x)V(x)T be the projector on the linear space

L(x) defined by the columns of V(x). Define the vector field W(x) = P(x)g(x). Let Π denote the set of integral curves for the vector field W. Thus, π ∈ Π if and only if π′(t) = W(π(t)) = P(π(t))g(π(t)). (1) The ridge R consists of the destinations of the integral curves: x ∈ R if lim→∞ π(t) = x for some π ∈ Π. Under weak conditions, H(M, R) = O(σ) and R and M have the same topology.

SLIDE 41

Ridge Set as a Surrogate for M

M R Rh

SLIDE 42

Mean Shift

Before finding the ridges, first suppose we wanted to find the modes

f q. Then we could use mean shift algorithm. Compute a kernel

density estimator q. Select mesh points v1, . . . , vN. Repeat: t = 1, 2, 3, . . . , vj(t + 1) ← −

i Yi Kh(||vj(t) − Yi||)
i Kh(||vj(t) − Yi||) .

Then the vj(t) converge to the modes as t → ∞.

SLIDE 43

Mean Shift

●

SLIDE 44

Modified Mean Shift

In Genovese, Perone-Pacifico, Verdinelli and Wasserman (2009), we used the trajectories of the {vj(t) : t = 1, 2, 3, . . .} to trace out the ridges. Ozertem and Erdogmus (2011) give a more direct approach: project each vj(t) onto the space spanned by the largest D − d eigenvectors

f the Hessian of
q. The projected trajectories converge to the ridges.

SCMS (Subspace Constrained Mean Shift).

SLIDE 45

A Hyper-Ridge Set Estimator

Steps:

1 Estimation: estimate the density q, its gradient g, and its

Hessian h.

2 Denoising: remove background clutter and low-probability

regions, restricting attention to a set where q is not too small;

3 Mean-Shift: apply the SCMS algorithm within the restriction set.

We can show that: H(R, R) = OP

n−

2 4+D

. However, if we can live

with bias, then we can set h = O(σ) and then H(Rh, Rh) = OP

n− 1

2

.

We are currently developing more of the theory. Here are two examples.

SLIDE 46

Example

●
●
●

SLIDE 47

Example: Data

SLIDE 48

Example: De-noised Data

SLIDE 49

Example: Ridge Set from SCMS

SLIDE 50

Example

But we need to denoise first or else ...

SLIDE 51

Estimating The Homology of M

SLIDE 52

Estimating the Homology of M

Want M to have similar topological properties as M. Specifically, we want H(M) = H( M) where H(M) denotes the homology of M. What is homology? Here is the 2 slide explanation:

SLIDE 53

One Cluster

SLIDE 54

One Cluster + One Hole

●
●
● ●
●

SLIDE 55

Simplicial Complex

Begin with a finite set of points. Form the p-simplices: 0-simplex: points, 1-simplex: edge, 2-simplex: triangle, etc. A simplicial complex K is a class of simplices: such that

1 If σ ∈ K and τ is a face of σ then τ ∈ K. 2 If σ1, σ2 ∈ K then σ1 ∩ σ2 is a face of σ1 and σ2.

The ˇ Cech complex is all σ such that ∩x∈σBǫ(x) = ∅.

SLIDE 56

Sums and Boundaries

A chain is a “sum” of simplices: σ1 + σ2 can be thought of as set difference: Example: [a, b] + [b, c] = [a, c] If σ = [v0, v1, . . . , vp] then the boundary of σ is ∂pσ = [v1, . . . , vp] + [v0, v2, . . . , vp] + [v0, v1, v3, . . . , vp] + · · · Boundary of a chain: c =

j σj is

∂p c =

j

∂pσj.

SLIDE 57

Example

a

b c

K = {∅, [a], [b], [c], [a, b], [a, c], [b, c], [a, b, c]} ∂0[a] = ∂0[b] = ∂0[c] = ∅ ∂1[a, b] = [a] + [b], ∂1[a, c] = [a] + [c], ∂1[b, c] = [b] + [c] ∂2[a, b, c] = [b, c] + [a, c] + [a, b] = ∅

SLIDE 58

Chains

p-chain: a p-chain is a “sum” of p-simplices. These form a group Cp. Like paths. symbol name intuition Cp chains paths Zp cycles loops Bp boundary cycles filled in loops Bp ⊂ Zp ⊂ Cp ∂p σ = boundary of simplex σ Chain c = {σ1, . . . , σk}. ∂ c =

j ∂ σj

SLIDE 59

Chain Complex

∂p+1 ∂p ∂p−1

Cp+1 Cp Cp−1 Zp−1 Zp Zp+1 Bp+1 Bp−1 Bp

∂p+2

SLIDE 60

Equivalent Cycles

If z, z′ ∈ Zp differ by a boundary cycle, z = z+b, we consider them the same (you can deform one loop into the other).

z b z+b

z ≈ z + b (from Zomorodian (2005).)

SLIDE 61

Homology Groups

The equivalence classes of cycles in Zp are Hp, the homology group. Homology group Hp = Zp/Bp = kernel ∂p/image ∂p+1. Hp is roughly, the number of distinct loops. Betti number βp is the

rder of Hp.

β0 = number of clusters β1 = number of holes β2 = number of voids H = {H0, H1, H2, . . . , }.

SLIDE 62

5 vertices, 6 edges, 1 triangle
Ignoring orientation three 1-cycles
Only one non-trivial equivalence class of 1-cycles
i.e. only one hole

A toy example

SLIDE 63

Inference

The Niyogi, Smale, Weinberger (2008) estimator:

1 Throw away low density points. 2 For remaining points, form the ˇ

Cech complex. The homology groups can be extracted by matrix calculations from the ˇ Cech complex. We use this (or a modified version) for the upper bounds. We use Le Cam’s lemma for lower bounds.

SLIDE 64

Inference

Minimax risk: Rn = inf

H

sup

Q∈Q

P( H = H). Recall that κ is the condition number: Results: noiseless: Rn ≍ e−nκd

n. So κn ≈ (1/n)1/d is detectable.

Clutter: Rn ≍ e−nπκd

n. So κn ≈ (1/nπ)1/d is detectable.

Additive: σ < κ/(8 √ D): Rn ≍ e−cnκd. (In the additive case, we need to deconvolve first then use a Niyogi-Smale-Weinberger style estimator.)

SLIDE 65

Tuning Parameter Selection

●

SLIDE 66

Tuning Parameter Selection

The homology of the Cech complex is very sensitive to the choice of ǫ. In computational topology they use persistence diagrams that show toplogical features as a function of ǫ. Our current work is in data-driven methods to choose ǫ.

SLIDE 67

Summary

1 Manifolds show up in many problems. 2 Manifold estimation is a special case; more generally, we want to

find structure in data.

3 Choosing tuning parameters is difficult and the subject of our

current work. (Even choosing the number of clusters is not a settled problem.)

4 Statisticians have a lot to learn from the computational geometry

community, and vice versa.

SLIDE 68

Estimating Filaments and Manifolds

Co-authors

Introduction

Motivating Example: The Cosmic Web

Example

Example

Low-Dimensional Structure in Point Cloud Data

Manifolds and Manifold Complexes

Outline

Minimax Manifold Estimation

Minimax Manifold Estimation

Hausdorff Distance

Condition Number (or Reach)

Condition Number

Condition Number in One Dimension

Normals of size < ∆ do not Cross

A Synthetic Example

A Synthetic Example

A Synthetic Example

Existing Literature

Minimax Manifold Estimation

Minimax Rates under Various Noise Models

Le Cam’s Lemma

Perpendicular Noise

Outline of Proof of Lower Bound

M0

A Ball pushing up.

The Ball goes through the plane

Roll another Ball around it . . .

Rolling . . .

Flying Saucer Manifold

Outline of Upper Bound

Clutter Model

Clutter Model: Upper Bound

Additive Model

Additive Model

M0 M1 γ γ

Additive Model: Upper Bound

A Practical Estimator

Modes

Hyper-Ridge Sets

Ridge Set as a Surrogate for M

Mean Shift

Mean Shift

Modified Mean Shift

A Hyper-Ridge Set Estimator

Example

Example: Data

Example: De-noised Data

Example: Ridge Set from SCMS

Example

Estimating The Homology of M

Estimating the Homology of M

One Cluster

One Cluster + One Hole

Simplicial Complex

Sums and Boundaries

Example

Chains

Chain Complex

Equivalent Cycles

Homology Groups

A toy example

Inference

Inference

Tuning Parameter Selection

Tuning Parameter Selection

Summary

Summary

THE END