Semi-Supervised Kernel Mean Shift Clustering A Semi-Supervised - - PowerPoint PPT Presentation
Semi-Supervised Kernel Mean Shift Clustering A Semi-Supervised - - PowerPoint PPT Presentation
Semi-Supervised Kernel Mean Shift Clustering A Semi-Supervised Clustering Approach Motivation: Need for Supervision Data may not form clusters in Mapping to Feature Space Constraint Vector Input Points Clustering Projection the input
Mapping to Feature Space Constraint Vector
Motivation: Need for Supervision
- Data may not form clusters in
the input space.
- Mapping to a different space
helps.
- Pairwise constraints can guide
clustering to find the desired structure.
- Mapping function is not always
known.
- 1
1
- 1
- 0.5
0.5 1 1.5 x x2
- 1
1
- 1
- 0.5
0.5 1 1.5 x x2
φ(c2) φ(c1)
- 1
1
- 1
- 0.5
0.5 1 1.5 x x2
φ(c2) φ(c1) φ(c2)-φ(c1)
- 1
1
- 1
- 0.5
0.5 1 1.5 x c2 c1
Input Points Projection Clustering
28
- Given and a p.s.d. kernel
satisfying is an unknown mapping to a feature space.
- For n points , the n x n kernel matrix K is
computed using the kernel function
- Distance between two points and in the
feature space can be computed implicitly using the kernel matrix
Tuzel, O., Porikli, F., Meer, P., “Kernel methods for weakly supervised mean shift clustering”. ICCV, 48 – 55, 2009
Previous Work: Kernel Mean Shift
29
- The mean shift update can be written using
this kernel matrix as denotes the i-th canonical basis vector Rn
Previous Work: Kernel Mean Shift
30
- Let be the set of
similarity constraint pairs and the constraint matrix is
- A transformation is defined
using the projection matrix
Previous Work: Kernel Learning Using Linear Projections
- 1
1
- 1
- 0.5
0.5 1 1.5
- 1
1
- 1
- 0.5
0.5 1 1.5
φ(c2) φ(c1) φ(c2)-φ(c1)
Feature Space Projection
31
- The corresponding learned kernel
function is
- The learned kernel matrix can be
expressed in terms of the original kernel is n x m where each column corresponds to a constraint point in the kernel matrix. is the m x m scaling matrix.
Previous Work: Kernel Learning Using Linear Projections
- 1
1
- 1
- 0.5
0.5 1 1.5
- 1
1
- 1
- 0.5
0.5 1 1.5
φ(c2) φ(c1) φ(c2)-φ(c1)
Feature Space Projection
32
Limitations: Kernel Learning Using Linear Projections
- No dissimilarity
constraints
- No relaxation of
constraints
– Prone to overfitting.
- Sensitive to labeling
errors in training samples.
Clustering after adding one mislabeled constraint (black link) Initial data with constraint points Clustering using clean training data Example – Five concentric circles
33
Squared Euclidean distance
- For a strictly convex function , the Bregman divergence
between real, symmetric n x n matrices X and Y is defined
- Bregman divergences:
– squared Frobenius norm, – K-L divergence, – squared Euclidean distance
- The log det divergence is a Bregman divergence
for the convex function
Bregman Divergences
34
Log Det Divergence: Properties
- Nonnegative scalar function
= 0 iff X = Y
It is not a metric since it does not follow triangle inequality.
- Transformation invariance
for an n x n invertible matrix M
- Defined only for positive semidefinite
matrices
35
- Using the log det divergence with both
similarity ( ) and dissimilarity ( ) constraints Addresses all the limitations of the linear projections method.
Kernel Learning Using the Log Det Divergence
36
- For each constraint, the optimization is solved by
Bregman projection based updates.
- Updates are repeated until convergence.
- The initial kernel matrix K has rank r ≤ n, then,
and the update can be rewritten as
- The scalar variable is computed in each iteration using
the kernel matrix and the constraint pair.
- The n x r matrix is updated using the Cholesky
decomposition.
- The final learned kernel matrix is
Kernel Learning Using the Log Det Divergence
Jain,P. et. al., “Metric and kernel learning using a linear transformation”. JMLR, 13:519-547, 2012.
37
Low rank kernel learning algorithm
38
- For very large datasets, it is infeasible
– to learn the entire kernel matrix, and – to store it in the memory.
- Generalization to out of sample points
where x or y or both are out of sample.
- Distances can be computed using the
learned kernel function
Kernel Learning Using the Log Det Divergence
39
Semi-Supervised Kernel Mean Shift Clustering
- Input:
– Unlabeled data – Pairwise constraints – Number of expected clusters
- Output:
– Clusters and labels
40
- Gaussian kernel function
for the initial kernel matrix
- Kernel parameter
estimation using log det divergence
- The initial kernel matrix is
Kernel Parameter Selection σ
Desired distances
41
- Kr low rank approximation
- f
- Learning using Log Det
divergence to find .
- Mean shift parameters
estimated from the curves.
- The trade-off parameter
was determined by crossvalidation. Low Rank Representation
42
Experimental Evaluation
- Two synthetic data sets
– Olympic circles (5 classes) – Concentric circles (10 classes) Nonlinearly separable. Can have intersecting boundaries.
- Four real data sets
– Small number of classes.
- USPS (10 Classes)
- MIT Scene (8 Classes)
– Large number of classes.
- PIE faces (68 Classes)
- Caltech Objects (50 Classes)
43
Comparisons
1. Efficient and exhaustive constraint propagation for spectral clustering. 2. Semi-supervised kernel k-means. 3. Kernel k-means using Bregman divergences All these methods have to be given the number of clusters as input.
1. Zhiwu Lu and Horace H.S. Ip, Constrained Spectral Clustering via Exhaustive and Efficient Constraint Propagation, ECCV, 1—14,2010 2.
- B. Kulis, S. Basu, I. S. Dhillon, and R. J. Mooney. Semi-supervised graph clustering: A kernel
- approach. Machine Learning, 74:1–22, 2009
44
Evaluation Criterion Adjusted Rand Index
- Scalar measure to evaluate clustering
performance from the clustering output.
TP – true positive; TN – true negative; FP – false positive; FN – false negative.
- Compensates for chance; randomly
assigned cluster labels get a low score.
45
Pairwise Constraint Generation
- Assuming b labeled points are selected at
random from each class.
- similarity pairs are generated from
each class.
- An equal number of dissimilarity pairs
are also generated.
- The value of b is varied.
46
Synthetic Example 1: Olympic Circles
- 300 points along each of
the five circles.
- 25 points per class
selected at random.
- Experiment 1:
– Varied number of labeled points [5, 7, 10, 12, 15, 17, 20, 25] from each class to generate pairwise constraints.
Original Sample Result
47
Synthetic Example 1: Olympic Circles
- Experiment 2
– 20 labeled points per class. – Introduce labeling errors by swapping similarity pairs with dissimilarity pairs. – Varied fraction of mislabeled constraints.
48
Synthetic Example 2: Concentric Circles
- 100 points along each of the
ten concentric circles.
- Experiment 1
– Varied number of labeled points [5, 7, 10, 12, 15, 17, 20, 25] from each class to generate pairwise constraints.
- Experiment 2
– 25 labeled points per class. – Introduce labeling errors by swapping similarity pairs with dissimilarity pairs.
49
Real Example 1: USPS Digits
- Ten classes with 1100 points per
- class. A total of 11000 points.
- 100 points per class → K
1000x1000 initial kernel matrix.
- Varied number of labeled points
[5, 7, 10, 12, 15, 17, 20, 25] from each class to generate pairwise constraints.
- Cluster all 11000 data points by
generalizing to the remaining 10000 points. ARI = 0.7529 0.051
11000 x 11000 PDM 50
Real Example 2: MIT Scene
- Eight classes with 2688 points. The
number of samples range between 260 and 410.
- 100 points per class → K 800x800
initial kernel matrix.
- Varied number of labeled points
[5, 7, 10, 12, 15, 17, 20] from each class to generate pairwise constraints.
- Cluster all 2688 data points by
generalizing to the remaining 1888 points.
51
Real Example 3: PIE Faces
- 68 subjects with 21 samples per
subjects.
- K → 1428 x 1428 full initial kernel
matrix.
- Varied number of labeled points
[3, 4, 5, 6, 7] from each class to generate pairwise constraints.
- Obtained perfect clustering for more
than 5 labeled points per class.
52
Real Example 4: Caltech-101 (subset)
- 50 categories with number of
samples ranging between 31 and 40 points per class.
- K → 1959 x 1959 full initial
kernel matrix.
- Varied number of labeled
points [5, 7, 10, 12, 15] from each class to generate pairwise constraints.
53