Estimating Filaments and Manifolds
Larry Wasserman Dept of Statistics and Machine Learning Department Carnegie Mellon University June 2012
Estimating Filaments and Manifolds Larry Wasserman Dept of - - PowerPoint PPT Presentation
Estimating Filaments and Manifolds Larry Wasserman Dept of Statistics and Machine Learning Department Carnegie Mellon University June 2012 Co-authors Geometry: Chris Genovese, Marco Perone-Pacifico and Isa Verdinelli Topology: Sivaraman
Larry Wasserman Dept of Statistics and Machine Learning Department Carnegie Mellon University June 2012
Geometry: Chris Genovese, Marco Perone-Pacifico and Isa Verdinelli Topology: Sivaraman Balakrishnan, Ale Rinaldo, Don Sheehy and Aarti Singh
The Geometric problem: find a manifold M which is close to an unknown manifold M. Topological problem: find a manifold M which has the same homology as an unknown manifold M. When the manifold is one-dimensional we call it a filament. We are not using manifolds for dimension reduction. We are interested in estimating the manifold.
Genovese, Perone-Pacifico, Verdinelli, Wasserman (2010) (arXiv:1003.5536, arXiv:1007.0549, arXiv:1109.4540). Rinaldo, Sheehy, Balakrishnan, Singh, Wasserman (2011).
Many datasets exhibit complex, low-dimensional structure. More Examples:
In addition, high-dimensional datasets often have hidden structure that we would like to identify. Several distinct problems here, including: Dimension Reduction, Clustering, and Estimation.
Manifolds give a useful representation of low dimensional structure. A manifold is a space that looks locally like a Euclidean space of some dimension (called the dimension of the manifold).
Examples: point (0-dim), filaments (1-dim), surface of the sphere or torus (2-dim), three-dimensional sphere, space-time (4-dim).
To allow for intersections and other complexities, consider a union of manifolds embedded in RD with maximal dimensions d < D. We call this a d-dimensional manifold complex.
1 The Geometric Problem 1 Minimax Theory 2 Methods 2 The Topological Problem
1 noiseless: Yi ∼ G where support(G) = M. 2 clutter: Yi ∼ (1 − π)U + πG where U is uniform. 3 perpendicular: Yi = Xi + ǫi where Xi ∼ G and ǫi is perpendicular
to M. (Niyogi, Smale, Weinberger 2008).
4 additive: Yi = Xi + ǫi and ǫi ∼ Φ.
M) where H(A, B) = inf{ǫ : A ⊂ B ⊕ ǫ and B ⊂ A ⊕ ǫ} where A ⊕ ǫ =
x∈A B(x, ǫ) and B(x, ǫ) = {y : ||x − y|| ≤ ǫ}.
inf
sup
Q∈Q
EQH( M, M).
A B
H(A, B) = max{2.5, 1.5} = 2.5
unique projection onto M.
> ∆(M) cannot roll freely.
self-intersecting.
number.
From Gonzalez and Maddocks (1999) A large value of ∆(M) generates a manifold that is smooth and far from looping around itself.
circles have radius r κ > r κ > 2r κ < r κ < 2r
A 2-d Manifold in 3-d space
Computational geometry (e.g., Cheng et al. 2005, Dey 2006)
Here, “noise” does not have the statistical meaning of points drawn randomly from a distribution; instead, data must be close to M but not too close to each other. (There are a few notable exceptions.)
Manifold learning (e.g., Ozertem and Erdogmus 2011)
The primary focus here is on dimension reduction
Homology estimation (e.g., Niyoki, Smale, and Weinberer 2009) Focus on
topological rather than geometric information. (Later)
Filaments, principle curves, support estimation, . . .
e.g., Hastie and Stuetzle (1989), Tibshirani (1992), Arias-Castro et
Define M ≡ Mκ = {M : ∆(M) ≥ κ} and Q = {QM : M ∈ M}, where QM(A) =
Φ(Y ∈ A|X = x) dG(x) is the induced distribution on Y. Draw Y1, Y2, . . . , Yn IID from QM and estimate M ≡ Mn. Goal: determine the minimax risk Rn = inf
sup
Q∈Q
EQH( Mn, M), at least up to rates, with Hausdorff loss.
inf
Mn supQ∈Q EQH(
Mn, M) ≍ Cψn Noise Model ψn Perpendicular Compact n−
2 2+d
Clutter/Noiseless (πn)− 2
d
Additive Compact/Polynomial in progress Additive sub-Gaussian (log n)−1 Note that these rates do not depend on the ambient dimension D. There are strong connections between the additive noise model and errors in variables regression but also some notable differences.
The lower bounds will all be found using Le Cam’s lemma. Y1, . . . , Yn ∼ Q ∈ Q. Metric ρ and estimator θ = θ(Y1, . . . , Yn). sup
Q∈Q
EQn
θ, θ(Q))
where TV(Q0, Q1) = sup
A
|Q0(A) − Q1(A)| = 1 2
They key is to choose a least favorable pair Q0, Q1.
Draw X1, . . . , Xn ∼ G, where support(G) = M. Draw Yi = Xi + ǫi where ǫi is uniform on the Normal (perpendicular) to M at Xi.
Rn = inf
sup
Q∈Q
EQ
M)
2 2+d .
Note that this depends on d but not D.
2
M: sup
Q∈Q
EQn[H(M, M)] ≥ H(M0, M1) × (1 − TV)2n = γ(1 − cγ(d+2)/2)2n. Setting γ = n−2/(d+2) yields the result.
M0 = plane and M1 = flying saucer.
The dome height is γ but ∆(M) = 0.
∆(M) = κ > 0.
Construct an “estimator” that achieves the bound:
1 Split the data into two halves. 2 Using first half, construct a pilot esimator. This is a (sieve)
maximum likelihood estimator.
3 Cover the pilot estimator with thin, long, slabsג1,...,גN. 4 Fit local linear estimators in each slab (using second half of
data).
5
j
Mj. The details are messy and the estimator is not practical but it suffices for establishing the bound.
Y1, . . . , Yn ∼ Q ≡ (1 − π)U + πG where 0 < π ≤ 1, U is uniform on the compact set K ⊂ RD and support of G is M. inf
sup
Q∈Q
EQn[H( M, M)] ≍∗ C 1 nπ 2
d
. The ≍∗ means I am hiding log factors. Lower bound uses the same least favorable pair.
Let ǫn = (log n/n)2/d. Let Qn be the empirical measure. Let SM(y) denote a ǫd/2 × ǫD−d slab:
b1 εn b2 εn
Define s(M) = inf
y∈M
and
M
s(M).
X1, X2, . . . , Xn ∼ G where support(G) = M, and Yi = Xi + Zi, i = 1, . . . , n, where Zi ∼ Φ = Gaussian. This is like the usual deconvolution problem except:
1 We want to estimate the support of G not G itself. 2 G is singular. 3 The underlying object is a manifold not a function.
For technical reasons, we allow the manifolds to be noncompact. Define a truncated loss function, L(M, M) = H(M ∩ K, M ∩ K). inf
sup
Q∈Q
EQ[L(M, M)] ≥ C log n. Rate is similar to deconvolution but the proof is somewhat different (since Q0 and Q1 have different supports). Least favorable pair:
Let g be a deconvolution density estimator (even though G has no density). Let M = { g > λ}. Fix any 0 < δ < 1/2. inf
sup
Q∈Q
EQ[L(M, M)] ≤ C
log n 1−δ
2
. In some special cases, we can achieve 1/ log n but in general not.
The estimators we described are not practical. They are used simply to prove the upper bounds. A more practical approach is to use the ridges in the density of q as a surrogate for M. Genovese, Perone-Pacifico, Verdinelli and Wasserman (2009): find the ridges by tracing the mean shift paths. Ozertem and Erdogmus (2011): find the ridges by using projected mean shift paths.
Before dicussion ridges, let’s review modes. Let m1, . . . , mk be the modes of a density q. For any point y there is an integral curve π (gradient ascent curve) π leading from y to a mode mj. Formally π′(t) = ∇q(π(t)) and destination(y) = limt→∞ π(t) = mj. Let Aj = {y : destination(y) = mj} be the ascending manifold for mj. These partition the space.
Y1, . . . , Yn sampled IID from Q = (1 − π)U + π(G ⋆ Φσ), the additive model with clutter. Let q be the density of q with gradient g(x) = ∇q(x) and Hessian H(x). Write H(x) = U(x)Λ(x)U(x)T and let V(x) be the first D − d columns
L(x) defined by the columns of V(x). Define the vector field W(x) = P(x)g(x). Let Π denote the set of integral curves for the vector field W. Thus, π ∈ Π if and only if π′(t) = W(π(t)) = P(π(t))g(π(t)). (1) The ridge R consists of the destinations of the integral curves: x ∈ R if lim→∞ π(t) = x for some π ∈ Π. Under weak conditions, H(M, R) = O(σ) and R and M have the same topology.
M R Rh
Before finding the ridges, first suppose we wanted to find the modes
density estimator q. Select mesh points v1, . . . , vN. Repeat: t = 1, 2, 3, . . . , vj(t + 1) ← −
Then the vj(t) converge to the modes as t → ∞.
In Genovese, Perone-Pacifico, Verdinelli and Wasserman (2009), we used the trajectories of the {vj(t) : t = 1, 2, 3, . . .} to trace out the ridges. Ozertem and Erdogmus (2011) give a more direct approach: project each vj(t) onto the space spanned by the largest D − d eigenvectors
SCMS (Subspace Constrained Mean Shift).
Steps:
1 Estimation: estimate the density q, its gradient g, and its
Hessian h.
2 Denoising: remove background clutter and low-probability
regions, restricting attention to a set where q is not too small;
3 Mean-Shift: apply the SCMS algorithm within the restriction set.
We can show that: H(R, R) = OP
2 4+D
with bias, then we can set h = O(σ) and then H(Rh, Rh) = OP
2
We are currently developing more of the theory. Here are two examples.
But we need to denoise first or else ...
Want M to have similar topological properties as M. Specifically, we want H(M) = H( M) where H(M) denotes the homology of M. What is homology? Here is the 2 slide explanation:
Begin with a finite set of points. Form the p-simplices: 0-simplex: points, 1-simplex: edge, 2-simplex: triangle, etc. A simplicial complex K is a class of simplices: such that
1 If σ ∈ K and τ is a face of σ then τ ∈ K. 2 If σ1, σ2 ∈ K then σ1 ∩ σ2 is a face of σ1 and σ2.
The ˇ Cech complex is all σ such that ∩x∈σBǫ(x) = ∅.
A chain is a “sum” of simplices: σ1 + σ2 can be thought of as set difference: Example: [a, b] + [b, c] = [a, c] If σ = [v0, v1, . . . , vp] then the boundary of σ is ∂pσ = [v1, . . . , vp] + [v0, v2, . . . , vp] + [v0, v1, v3, . . . , vp] + · · · Boundary of a chain: c =
j σj is
∂p c =
∂pσj.
b c
K = {∅, [a], [b], [c], [a, b], [a, c], [b, c], [a, b, c]} ∂0[a] = ∂0[b] = ∂0[c] = ∅ ∂1[a, b] = [a] + [b], ∂1[a, c] = [a] + [c], ∂1[b, c] = [b] + [c] ∂2[a, b, c] = [b, c] + [a, c] + [a, b] = ∅
p-chain: a p-chain is a “sum” of p-simplices. These form a group Cp. Like paths. symbol name intuition Cp chains paths Zp cycles loops Bp boundary cycles filled in loops Bp ⊂ Zp ⊂ Cp ∂p σ = boundary of simplex σ Chain c = {σ1, . . . , σk}. ∂ c =
j ∂ σj
∂p+1 ∂p ∂p−1
Cp+1 Cp Cp−1 Zp−1 Zp Zp+1 Bp+1 Bp−1 Bp
∂p+2
If z, z′ ∈ Zp differ by a boundary cycle, z = z+b, we consider them the same (you can deform one loop into the other).
z b z+b
z ≈ z + b (from Zomorodian (2005).)
The equivalence classes of cycles in Zp are Hp, the homology group. Homology group Hp = Zp/Bp = kernel ∂p/image ∂p+1. Hp is roughly, the number of distinct loops. Betti number βp is the
β0 = number of clusters β1 = number of holes β2 = number of voids H = {H0, H1, H2, . . . , }.
The Niyogi, Smale, Weinberger (2008) estimator:
1 Throw away low density points. 2 For remaining points, form the ˇ
Cech complex. The homology groups can be extracted by matrix calculations from the ˇ Cech complex. We use this (or a modified version) for the upper bounds. We use Le Cam’s lemma for lower bounds.
Minimax risk: Rn = inf
sup
Q∈Q
P( H = H). Recall that κ is the condition number: Results: noiseless: Rn ≍ e−nκd
Clutter: Rn ≍ e−nπκd
Additive: σ < κ/(8 √ D): Rn ≍ e−cnκd. (In the additive case, we need to deconvolve first then use a Niyogi-Smale-Weinberger style estimator.)
The homology of the Cech complex is very sensitive to the choice of ǫ. In computational topology they use persistence diagrams that show toplogical features as a function of ǫ. Our current work is in data-driven methods to choose ǫ.
1 Manifolds show up in many problems. 2 Manifold estimation is a special case; more generally, we want to
find structure in data.
3 Choosing tuning parameters is difficult and the subject of our
current work. (Even choosing the number of clusters is not a settled problem.)
4 Statisticians have a lot to learn from the computational geometry
community, and vice versa.