Linear Manifold Clustering Robert Haralick and Rave Harpaz Outline - - PowerPoint PPT Presentation
Linear Manifold Clustering Robert Haralick and Rave Harpaz Outline - - PowerPoint PPT Presentation
Linear Manifold Clustering Robert Haralick and Rave Harpaz Outline Background The linear manifold cluster model The Linear manifold clustering algorithm Linear manifold modeling Linear manifold subspace correlation clustering Conclusion
Outline
Background The linear manifold cluster model The Linear manifold clustering algorithm Linear manifold modeling Linear manifold subspace correlation clustering Conclusion
Background
Clustering is the process of classifying a collection patterns, into classes called clusters so that the patterns within a cluster are “similar” to one another, yet “dissimilar” to patterns in other clusters. Each clustering technique makes implicit assumptions The shape of the clusters The similarity criteria The grouping technique
Cluster Models
database 2 hyper-spherical hyper-ellipsoidal arbitrary shaped linear nonlinear
K-Means Hyper-Spherical Clusters
Choose K points at random to be cluster centers Assign each point to its closest cluster center Make the new cluster centers be the cluster means Iterate
K-Means Clusters
Subspace Clustering
Definition
Subspace clustering produces clusters which are compact on a subset of dimensions aligned with the coordinate axes and not compact on the orthogonal complement of those dimensions.
x y z x
z
full space subspace (x-z projection) Subspace clustering handles High dimensional data Irrelevant features
Pattern and Correlation Clustering
1 2 3 4 5 6 7 8
parallel coordinate view Object similarity is no longer measured by physical distance, but by the behavior patterns objects manifest or the magnitude of correlations they induce. Problem Statement: Identify groups of points that exhibit coherent behavior patterns across a subset of the measurement features.
Pattern and Correlation Clustering - Applications
1 2 3 4 5 6 7 8
Gene expression micro-array analysis - identify groups of genes that exhibit similar expression patterns under some subset of conditions, from which gene function or regulatory mechanisms may be inferred. Collaborative filtering/recommendation systems - sets of customers/users with similar interest patterns need to be identified so that future interests can be predicted and proper recommendations be made. Dimensionality reduction by correlation Finance - identify groups of stocks that show similar price fluctuations under a certain time period.
Linear Manifold Clusters
Definition
L is a linear manifold of vector space V if and only if for some subspace S of V and translation t ∈ V, L = {x ∈ V|for some s ∈ S, x = t + s}. The dimension of L is the dimension of S, and if the dimension of L is one less than the dimension of V then L is called a hyperplane.
Linear Manifold Clusters
Definition
L is a linear manifold of vector space V if and only if for some subspace S of V and translation t ∈ V, L = {x ∈ V|for some s ∈ S, x = t + s}. The dimension of L is the dimension of S, and if the dimension of L is one less than the dimension of V then L is called a hyperplane. A linear manifold is, in other words, a subspace that may have been shifted away from the origin. A subspace is a linear manifold that contains the origin.
Dense Linear Manifold Clusters
50 100 150 50 100 50 100 150 200 C1 C2 C3
The Linear Manifold Cluster Model
The linear manifold cluster model has the following properties: The points in each cluster are embedded in a lower dimensional linear manifold.
The Linear Manifold Cluster Model
The linear manifold cluster model has the following properties: The points in each cluster are embedded in a lower dimensional linear manifold. The intrinsic dimensionality of the cluster is the dimensionality of the linear manifold.
The Linear Manifold Cluster Model
The linear manifold cluster model has the following properties: The points in each cluster are embedded in a lower dimensional linear manifold. The intrinsic dimensionality of the cluster is the dimensionality of the linear manifold. The manifold is arbitrarily oriented.
The Linear Manifold Cluster Model
The linear manifold cluster model has the following properties: The points in each cluster are embedded in a lower dimensional linear manifold. The intrinsic dimensionality of the cluster is the dimensionality of the linear manifold. The manifold is arbitrarily oriented. The points in the cluster induce a correlation among two or more attributes (or linear combinations of attributes) of the data.
The Linear Manifold Cluster Model
The linear manifold cluster model has the following properties: The points in each cluster are embedded in a lower dimensional linear manifold. The intrinsic dimensionality of the cluster is the dimensionality of the linear manifold. The manifold is arbitrarily oriented. The points in the cluster induce a correlation among two or more attributes (or linear combinations of attributes) of the data. In the orthogonal complement space to the manifold the points form a compact densely populated region, which can be used to cluster the data.
The Linear Manifold Cluster Model
Comment
Classical clustering algorithms such as K-means assume that each cluster is associated with a zero dimensional manifold (the center) and therefore omit the possibility that a cluster may have non-zero dimensional linear manifold associated with it.
The Range Space of a Matrix
Suppose B is a matrix. B = . . . . . . . . . b1 b2 · · · bN . . . . . . . . . and x is a vector x = x1 x2 . . . xN Let y = Bx.
The Range Space of a Matrix
y = Bx = . . . . . . . . . b1 b2 · · · bN . . . . . . . . . x1 x2 . . . xN y =
N
- n=1
xnbn y is a linear combination of the columns of B.
The Linear Manifold Cluster Model
Each point x in a k-D linear manifold cluster is modeled by:
x = µ + Bφ + Bǫ
x : d × 1 random vector µ : d × 1 translation vector in Rd B : d × k matrix B : d × d − k matrix, B
′B = 0
φ : k × 1 random vector ∼ U(−R, R) ǫ : d − k × 1 random vector ∼ N(0, Σ) |Σ| is small
b1 b2 b3
µ
Linear Manifold Cluster Model
x = µ + Bφ + Bǫ E[x] = E[µ + Bφ + Bǫ] = E[µ] + E[Bφ] + E[Bǫ] = µ + BE[φ] + BE[ǫ] = µ
Orthogonal Projection
Definition
Let V be a vector space and W be any subspace of V. Represent vector v ∈ V as v = w + w⊥ where w ∈ W and w⊥ ∈ W ⊥. Then w is called the orthogonal projection of v onto W and w⊥ is the orthogonal projection of v
- nto W ⊥.
Theorem
Let V be a vector space and W be any subspace of V. Let B be a matrix whose columns constitute an orthonormal basis of
- W. Let v ∈ V satisfy v = w + w⊥ where w ∈ W and w⊥ ∈ W ⊥.
Then w = BB
′v
Singular Value Decomposition
Definition
The Singular Value Decomposition of a real matrix X N×K is the factoring of X as X N×K = UN×NΛN×KV
′ K×K
where UU
′
= I VV
′
= I Λ = rectangular diagonal
Thin Singular Value Decomposition
Definition
The Thin Singular Value Decomposition of a real matrix X N×K, K < N is the factoring of X as X N×K = UN×K
K
ΛK×K
K
V
′ K×K
where U
′
KUK
= IK×K VV
′
= IK×K ΛK = diagonal
Orthonormal Basis of Subspace
Theorem
Let X N×K have columns which span a K-dimensional subspace
- W. Let the thin singular value decomposition of X be
X N×K = UN×K
K
ΛK×K
K
V
′ K×K
Then UKU
′
KX
= X
Proof.
UN×K
K
U
′ K×N
K
X N×K = UN×K
K
U
′ K×N
K
(UN×K
K
ΛK×K
K
V
′ K×K)
= UN×K
K
(U
′ K×N
K
UN×K
K
)ΛK×K
K
V
′ K×K
= UN×K
K
ΛK×K
K
V
′ K×K
= X
Distance To Linear Manifold
Theorem
Let a linear manifold L be represented by L = {z | z = µ + Bφ} where µ is a vector that translates the origin to the manifold and the columns of B are
- rthonormal. Then the Euclidean distance of x to L is given by
ρ(x, L) = (I − BB
′)(x − µ)
Proof.
BB′ is the orthogonal projection operator to the subspace spanned by the columns of B I − BB
′ is the orthogonal projection operator to the orthogonal complement of
the subspace spanned by the columns of B (I − BB
′)(x − µ) is the projection of x to the orthogonal complement of the linear
manifold L (I − BB
′)(x − µ) is the distance of x to the manifold L
Distance To Linear Manifold
Proposition
Let B be a matrix whose columns are orthonormal. Then (I − BB
′)y
=
- y2 − B
′y2
Proof.
(I − BB
′)y2
= y − BB
′y2
= (y − BB
′y) ′(y − BB ′y)
= y
′y − 2y ′BB ′y + y ′(BB ′)(BB ′)y
= y
′y − 2y ′BB ′y + y ′(B(B ′B)B ′)y
= y
′y − 2y ′BB ′y + y ′(BB ′)y
= y
′y − y ′BB ′y
= y2 − B
′y2
The Linear Manifold Clustering Algorithm
50 100 150 50 100 50 100 150 200 C1 C2 C3
Outline - stochastic model fitting technique
The Linear Manifold Clustering Algorithm
50 100 150 50 100 50 100 150 200 C1 C2 C3
Outline - stochastic model fitting technique
1
Sample trial linear manifolds of various dimensions.
The Linear Manifold Clustering Algorithm
50 100 150 50 100 50 100 150 200 C1 C2 C3 threshold
Outline - stochastic model fitting technique
1
Sample trial linear manifolds of various dimensions.
2
Compute distance histograms of the data to each trial manifold.
The Linear Manifold Clustering Algorithm
50 100 150 50 100 50 100 150 200 C1 C2 C3 threshold
Outline - stochastic model fitting technique
1
Sample trial linear manifolds of various dimensions.
2
Compute distance histograms of the data to each trial manifold.
3
Of all the manifolds sampled, select the one whose associated histogram shows the best separation between a mode near zero and the rest of the data.
The Linear Manifold Clustering Algorithm
50 100 150 50 100 50 100 150 200 C1 C2 C3 threshold
Outline - stochastic model fitting technique
1
Sample trial linear manifolds of various dimensions.
2
Compute distance histograms of the data to each trial manifold.
3
Of all the manifolds sampled, select the one whose associated histogram shows the best separation between a mode near zero and the rest of the data.
4
Partition the data based on the best separation.
The Linear Manifold Clustering Algorithm
50 100 150 50 100 50 100 150 200 C1 C2 C3 threshold
Outline - stochastic model fitting technique
1
Sample trial linear manifolds of various dimensions.
2
Compute distance histograms of the data to each trial manifold.
3
Of all the manifolds sampled, select the one whose associated histogram shows the best separation between a mode near zero and the rest of the data.
4
Partition the data based on the best separation.
5
Repeat the procedure on each block of the partitioned data.
How Are Trial Linear Manifolds Sampled?
To construct a k-D linear manifold we need to sample k + 1 points.
How Are Trial Linear Manifolds Sampled?
To construct a k-D linear manifold we need to sample k + 1 points. constructing a 2D linear manifold x = µ + Bφ
How Are Trial Linear Manifolds Sampled?
To construct a k-D linear manifold we need to sample k + 1 points. constructing a 2D linear manifold
x0 x1 x2
x = µ + Bφ
How Are Trial Linear Manifolds Sampled?
To construct a k-D linear manifold we need to sample k + 1 points. constructing a 2D linear manifold
x0 x1 x2
x = µ + Bφ µ = x0 X = (x1 − x0, x2 − x0) = UKΛKV
′
B = UK
Selecting the best trial manifold/best separation
20 40 60 80 100 120 20 40 60 80 100 120 20 40 60 80 100 120
Selecting the best trial manifold/best separation
20 40 60 80 100 120 20 40 60 80 100 120 20 40 60 80 100 120
To compute a separation score we first need to find the two classes or distributions involved.
Selecting the best trial manifold/best separation
20 40 60 80 100 120 20 40 60 80 100 120 20 40 60 80 100 120
To compute a separation score we first need to find the two classes or distributions involved. This problem is cast into histogram thresholding problem.
Selecting the best trial manifold/best separation
20 40 60 80 100 120 20 40 60 80 100 120 20 40 60 80 100 120
To compute a separation score we first need to find the two classes or distributions involved. This problem is cast into histogram thresholding problem. Left and right sides of the histogram are parametrically estimated
Bayesian Minimum Error Thresholding
−2 2 4 6 8 10 12 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 x p(x) Classification error for Mixture of Two Gaussians T
Find threshold T to minimize: P(error; T) =
- x>T
p(x|c1)P(c1)dx +
- x≤T
p(x|c2)P(c2)dx where P(x|c1) = 1 √ 2πσ1 e
− 1
2 ( x−µ1 σ1
)2
P(x|c2) = 1 √ 2πσ2 e
− 1
2 ( x−µ2 σ2
)2
Kittler and Illingworth Minimum Error Thresholding (1986)
−2 2 4 6 8 10 12 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 x p(x) Classification error for Mixture of Two Gaussians T
Let the range of the distance to manifold be divided up into N bins each of width ∆. For any threshold T = n∆, define
p1(T) = n
i=0 h(i∆)
p2(T) = N−1
i=n+1 h(i∆)
µ1(T) =
n
i=0 i∆h(i∆)
p1(T)
µ2(T) =
N−1
i=n+1 i∆h(i∆)
p2(T)
σ1(T) =
n
i=0(i∆−µ1(T))2h(i∆)
p1(T)
σ2(T) =
N−1
i=n+1(i∆−µ2(T))2h(i∆)
p2(T)
For any distance δ to manifold, define P(δ | 1) = P(δ | µ1, σ1) P(δ | 2) = P(δ | µ2, σ2)
Kittler and Illingworth Minimum Error
For quantized distance m∆, the probability of correct classification is Pc(m∆, T) =
- P(m∆ | µ1(T), σ1(T))P1(T)
m∆ ≤ T P(m∆ | µ2(T), σ2(T))P2(T) m∆ > T =
1 √ 2πσ1(T) e − 1 2 m∆−µ1(T) σ1(T) 2
P1(T) m∆ ≤ T
1 √ 2πσ2(T) e − 1 2 m∆−µ2(T) σ2(T) 2
P2(T) m∆ > T −2 log(Pc(m∆, T, 1)) = log(2π) + 2 log σ1(T) +
- m∆ − µ1(T)
σ1(T) 2 − 2 log P1(T), m∆ ≤ T −2 log(Pc(m∆, T, 2)) = log(2π) + 2 log σ2(T) +
- m∆ − µ2(T)
σ2(T) 2 − 2 log P2(T), m∆ > T
Kittler and Illingworth Minimum Error
Find the T to minimize J(T) =
n
- m=0
[2 log σ1(T) +
- m∆ − µ1(T)
σ1(T) 2 − 2 log P1(T)]h(m∆) +
N−1
- m=n+1
[2 log σ2(T) +
- m∆ − µ2(T)
σ2(T) 2 − 2 log P2(T)]h(m∆) = 2 log σ1(T)
n
- m=0
h(m∆) + 1 σ2
1(T) n
- m=0
(m∆ − µ1(T))2 h(m∆) − 2 log P1(T)
n
- m=0
h(m∆) + 2 log σ2(T)
N−1
- m=n+1
h(m∆) + 1 σ2
2(T) N−1
- m=n+1
(m∆ − µ2(T))2 h(m∆) − 2 log P2(T)
N−1
- m=n+1
h(m∆) = 2P1(T) log σ1(T) + P1(T) − 2P1(T) log P1(T) + 2P2(T) log σ2(T) + P2(T) − 2P2(T) log P2(T) = 1 + 2[P1(T) log σ1(T) + P2(T) log σ2(T) − P1(T) log P1(T) − P2(T) log P2(T)]
Selecting The Best Trial Manifold
−2 2 4 6 8 10 12 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 x p(x) Classification error for Mixture of Two Gaussians T
Find T to minimize: P(error; T) =
- x>T
p(x|c1)P(c1)dx +
- x≤T
p(x|c2)P(c2)dx J(T) = 1+2 (P1(T) log σ1(T) + P2(T) log σ2(T))−2 (P1(T) log P1(T) + P2(T) log P2(T)) Find T to maximize: discriminability(T) = (µ1(T) − µ2(T))2 σ2
1(T) + σ2 2(T) [J(T ′) − J(T)]
Selecting The Best Trial Manifold
−2 2 4 6 8 10 12 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 x p(x) Classification error for Mixture of Two Gaussians T
t J(t) depth τ τ’
Find T to minimize: P(error; T) =
- x>T
p(x|c1)P(c1)dx +
- x≤T
p(x|c2)P(c2)dx J(T) = 1+2 (P1(T) log σ1(T) + P2(T) log σ2(T))−2 (P1(T) log P1(T) + P2(T) log P2(T)) Find T to maximize: discriminability(T) = (µ1(T) − µ2(T))2 σ2
1(T) + σ2 2(T) [J(T ′) − J(T)]
Selecting The Best Trial Manifold
−2 2 4 6 8 10 12 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 x p(x) Classification error for Mixture of Two Gaussians T
t J(t) depth τ τ’
−5 5 10 15 20 25 30 −5 5 10 15 20 25 30 −5 5 10 15 20 25 30
Selecting The Best Trial Manifold
−2 2 4 6 8 10 12 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 x p(x) Classification error for Mixture of Two Gaussians T
t J(t) depth τ τ’
−5 5 10 15 20 25 30 −5 5 10 15 20 25 30 −5 5 10 15 20 25 30
Find T to maximize: discriminability(T) = (µ1(T) − µ2(T))2 σ2
1(T) + σ2 2(T) [J(T ′) − J(T)]
Probability All Draws From Same Cluster
Suppose C clusters each about the same size What is the probability p that in k + 1 random draws, all will be from the same cluster? p = C−k
Number of Trials
Each trial consists of k + 1 draws p: the probability that in k + 1 random draws all will be from the same cluster What is the probability that in S trials each of k + 1 draws,
- ne trial will have all draws from the same cluster?
P(Success) = 1 − (1 − p)S Want P(Success) ≥ 1 − ǫ 1 − (1 − p)S ≥ 1 − ǫ (1 − p)S ≤ ǫ S log(1 − p) ≤ log ǫ S ≥ log ǫ log(1 − p)
A Run of the Algorithm
50 100 150 50 100 50 100 150 200 C1 C2 C3
A Run of the Algorithm
50 100 150 50 100 50 100 150 200 C1 C2 C3 threshold
A Run of the Algorithm
50 100 150 50 100 50 100 150 200 C1 C2 C3 threshold
A Run of the Algorithm
50 100 150 50 100 50 100 150 200 C1 C2 C3 threshold
A Run of the Algorithm
50 100 150 50 100 50 100 150 200 C1 C2 C3 threshold
A Run of the Algorithm
50 100 150 50 100 50 100 150 200 C1 C2 C3 threshold threshold
A Run of the Algorithm
50 100 150 50 100 50 100 150 200 C1 C2 C3 threshold threshold
A Run of the Algorithm
50 100 150 50 100 50 100 150 200 C1 C2 C3 threshold threshold
A Run of the Algorithm
50 100 150 50 100 50 100 150 200 C1 C2 C3 threshold threshold
Empirical Evaluation - Accuracy
size clusters dim LM dim LMCLUS ORCLUS DBSCAN HPPC accuracy time accuracy time accuracy time accuracy time D1 3000 3 4 2-3 0.95 0:0:08 0.80 0:0:22 0.34 0:0:9 0.72 0:0:51 D2 3000 3 20 13-17 0.98 0:0:33 0.59 0:2:18 0.65 0:0:36 0.97 0:1:39 D3 30000 4 30 1-4 1.00 0:15:38 0.65 1:5:30 1.00 1:31:52 0.99 0:1:32 D4 6000 3 30 4-12 0.99 0:9:22 0.98 0:8:20 0.66 0:3:49 0.97 0:0:12 D5 4000 3 100 2-3 1.00 0:0:20 0.88 0:54:30 0.65 0:5:24 0.99 0:3:54 D6 90000 3 10 1-2 0.99 0:0:29 1.00 0:29:02 0.67 4:58:49 1.00 0:1:23 D7 5000 4 10 2-6 0.99 0:2:05 0.99 0:2:41 0.74 0:0:54 0.96 0:0:35 D8 10000 5 50 1-4 0.99 0:1:42 0.63 1:33:52 1.00 0:17:00 0.99 0:3:43 D9 80000 8 30 2-7 0.99 3:12:46 0.96 13:30:30 1.00 10:51:15 0.99 0:4:57 D10 5000 5 3 1-2 0.86 0:0:48 0.68 0:0:45 0.59 0:0:5 0.78 0:0:33 ⋆D11 1500 3 3 1 0.98 0:0:01 0.99 0:0:10 0.43 0:0:02 0.33 0:0:52 ⋆D12 1500 3 3 2 0.97 0:0:02 0.99 0:0:11 0.34 0:0:02 0.33 0:0:26 ⋆D13 1500 3 7 3 0.97 0:0:05 0.99 0:0:17 0.33 0:0:04 0.33 0:0:34 ⋆D14 5000 5 20 4 0.99 0:5:46 1.00 0:10:42 0.21 0:1:39 0.20 0:1:30 ⋆D15 4000 4 50 3 0.99 0:9:14 1.00 0:25:52 0.25 0:2:34 0.25 0:3:20
Summary alg # data sets able to time accuracy≥ 0.85 cluster ⋆ rank LMCLUS 15 + 1.5 ORCLUS 10 + 10 DBSCAN 3
- 9
HPPC 8
- 1
Efficiency and Scalability
1 2 3 4 5 6 7 8 9 10 x 10
5
0.5 1 1.5 2 2.5 x 104 Number of points Time (seconds) Scalability Time vs. Number of points
LMCLUS ORCLUS DBSCAN
20 40 60 80 100 120 2000 4000 6000 8000 10000 Scalability Time vs. number of Dimensions Number of Dimensions Time (seconds)
LMCLUS ORCLUS DBSCAN
alg complexity LMCLUS O(N2K2L3d) ORCLUS O(K3 + KNd + K2d3) DBSCAN O(N2d) HPPC O(Nd)
Handwritten Digit Recognition 3823 × 64 (UCI MLR)
32 32
xi = {xi1, xi2, . . . , xi64}
alg even dig.
- dd dig.
accuracy accuracy LMCLUS 0.99 0.87 ORCLUS 0.85 0.83 DBSCAN 0.82 0.58 HPPC 0.50 0.93 even: 5 clusters
- dd: 7 clusters (2 for dig. 1 and 9)
LM dim: 1-2
E3D Point Cloud Segmentation (ALPHATECH Inc.)
Time Series Clustering 600 × 60 (UCI KDD Archive)
- utput
normal cyclic increasing decreasing upward downward total cluster trend trend shift shift C1 57 57 C2 80 1 81 C3 43 99 142 C4 20 98 118 C5 99 99 C6 41 41 C7 23 23 C8 1 36 1 1 39 total 100 100 100 100 100 100 600
alg accuracy LMCLUS 0.89 ORCLUS 0.50 DBSCAN 0.68 HPPC 0.64
Publications
1 Linear manifold clustering in high dimensional spaces by stochastic search (with Robert Haralick). Pattern Recognition (2007), vol. 40(10), pp 2672-2684. 2 Linear Manifold Correlation Clustering (with Robert Haralick). Invited paper, International Journal of Information Technology and Intelligent Computing (2007), vol 2, no. 2 . 3 Mining Subspace Correlations (with Robert Haralick), In Proceedings of the IEEE Symposium on Computational Intelligence and Data Mining (CIDM 2007), pp 335-342. 4 Exploiting the Geometry of Gene Expression Patterns for Unsupervised Learning (with Robert Haralick), In Proceedings of the 18th International Conference on Pattern Recognition (ICPR 2006), vol. 2 pp 670-674. 5 Linear Manifold Clustering (with Robert Haralick), In Proceedings of the International Conference on Data Mining and Machine Learning (MLDM 2005), Lecture Notes in Computer Science, Springer Verlag LNAI 3587 pp 132-141. 6 Linear Manifold Embedding of Pattern Clusters (with Robert Haralick), DIMACS Workshop on Detecting and Processing Regularities in High Throughput Biological Data, 2005.