Semi-Supervised Kernel Mean Shift Clustering A Semi-Supervised - - PowerPoint PPT Presentation

semi supervised kernel mean shift clustering
SMART_READER_LITE
LIVE PREVIEW

Semi-Supervised Kernel Mean Shift Clustering A Semi-Supervised - - PowerPoint PPT Presentation

Semi-Supervised Kernel Mean Shift Clustering A Semi-Supervised Clustering Approach Motivation: Need for Supervision Data may not form clusters in Mapping to Feature Space Constraint Vector Input Points Clustering Projection the input


slide-1
SLIDE 1

Semi-Supervised Kernel Mean Shift Clustering

A Semi-Supervised Clustering Approach

slide-2
SLIDE 2

Mapping to Feature Space Constraint Vector

Motivation: Need for Supervision

  • Data may not form clusters in

the input space.

  • Mapping to a different space

helps.

  • Pairwise constraints can guide

clustering to find the desired structure.

  • Mapping function is not always

known.

  • 1

1

  • 1
  • 0.5

0.5 1 1.5 x x2

  • 1

1

  • 1
  • 0.5

0.5 1 1.5 x x2

φ(c2) φ(c1)

  • 1

1

  • 1
  • 0.5

0.5 1 1.5 x x2

φ(c2) φ(c1) φ(c2)-φ(c1)

  • 1

1

  • 1
  • 0.5

0.5 1 1.5 x c2 c1

Input Points Projection Clustering

28

slide-3
SLIDE 3
  • Given and a p.s.d. kernel

satisfying is an unknown mapping to a feature space.

  • For n points , the n x n kernel matrix K is

computed using the kernel function

  • Distance between two points and in the

feature space can be computed implicitly using the kernel matrix

Tuzel, O., Porikli, F., Meer, P., “Kernel methods for weakly supervised mean shift clustering”. ICCV, 48 – 55, 2009

Previous Work: Kernel Mean Shift

29

slide-4
SLIDE 4
  • The mean shift update can be written using

this kernel matrix as denotes the i-th canonical basis vector Rn

Previous Work: Kernel Mean Shift

30

slide-5
SLIDE 5
  • Let be the set of

similarity constraint pairs and the constraint matrix is

  • A transformation is defined

using the projection matrix

Previous Work: Kernel Learning Using Linear Projections

  • 1

1

  • 1
  • 0.5

0.5 1 1.5

  • 1

1

  • 1
  • 0.5

0.5 1 1.5

φ(c2) φ(c1) φ(c2)-φ(c1)

Feature Space Projection

31

slide-6
SLIDE 6
  • The corresponding learned kernel

function is

  • The learned kernel matrix can be

expressed in terms of the original kernel is n x m where each column corresponds to a constraint point in the kernel matrix. is the m x m scaling matrix.

Previous Work: Kernel Learning Using Linear Projections

  • 1

1

  • 1
  • 0.5

0.5 1 1.5

  • 1

1

  • 1
  • 0.5

0.5 1 1.5

φ(c2) φ(c1) φ(c2)-φ(c1)

Feature Space Projection

32

slide-7
SLIDE 7

Limitations: Kernel Learning Using Linear Projections

  • No dissimilarity

constraints

  • No relaxation of

constraints

– Prone to overfitting.

  • Sensitive to labeling

errors in training samples.

Clustering after adding one mislabeled constraint (black link) Initial data with constraint points Clustering using clean training data Example – Five concentric circles

33

slide-8
SLIDE 8

Squared Euclidean distance

  • For a strictly convex function , the Bregman divergence

between real, symmetric n x n matrices X and Y is defined

  • Bregman divergences:

– squared Frobenius norm, – K-L divergence, – squared Euclidean distance

  • The log det divergence is a Bregman divergence

for the convex function

Bregman Divergences

34

slide-9
SLIDE 9

Log Det Divergence: Properties

  • Nonnegative scalar function

= 0 iff X = Y

It is not a metric since it does not follow triangle inequality.

  • Transformation invariance

for an n x n invertible matrix M

  • Defined only for positive semidefinite

matrices

35

slide-10
SLIDE 10
  • Using the log det divergence with both

similarity ( ) and dissimilarity ( ) constraints Addresses all the limitations of the linear projections method.

Kernel Learning Using the Log Det Divergence

36

slide-11
SLIDE 11
  • For each constraint, the optimization is solved by

Bregman projection based updates.

  • Updates are repeated until convergence.
  • The initial kernel matrix K has rank r ≤ n, then,

and the update can be rewritten as

  • The scalar variable is computed in each iteration using

the kernel matrix and the constraint pair.

  • The n x r matrix is updated using the Cholesky

decomposition.

  • The final learned kernel matrix is

Kernel Learning Using the Log Det Divergence

Jain,P. et. al., “Metric and kernel learning using a linear transformation”. JMLR, 13:519-547, 2012.

37

slide-12
SLIDE 12

Low rank kernel learning algorithm

38

slide-13
SLIDE 13
  • For very large datasets, it is infeasible

– to learn the entire kernel matrix, and – to store it in the memory.

  • Generalization to out of sample points

where x or y or both are out of sample.

  • Distances can be computed using the

learned kernel function

Kernel Learning Using the Log Det Divergence

39

slide-14
SLIDE 14

Semi-Supervised Kernel Mean Shift Clustering

  • Input:

– Unlabeled data – Pairwise constraints – Number of expected clusters

  • Output:

– Clusters and labels

40

slide-15
SLIDE 15
  • Gaussian kernel function

for the initial kernel matrix

  • Kernel parameter

estimation using log det divergence

  • The initial kernel matrix is

Kernel Parameter Selection σ

Desired distances

41

slide-16
SLIDE 16
  • Kr low rank approximation
  • f
  • Learning using Log Det

divergence to find .

  • Mean shift parameters

estimated from the curves.

  • The trade-off parameter

was determined by crossvalidation. Low Rank Representation

42

slide-17
SLIDE 17

Experimental Evaluation

  • Two synthetic data sets

– Olympic circles (5 classes) – Concentric circles (10 classes) Nonlinearly separable. Can have intersecting boundaries.

  • Four real data sets

– Small number of classes.

  • USPS (10 Classes)
  • MIT Scene (8 Classes)

– Large number of classes.

  • PIE faces (68 Classes)
  • Caltech Objects (50 Classes)

43

slide-18
SLIDE 18

Comparisons

1. Efficient and exhaustive constraint propagation for spectral clustering. 2. Semi-supervised kernel k-means. 3. Kernel k-means using Bregman divergences All these methods have to be given the number of clusters as input.

1. Zhiwu Lu and Horace H.S. Ip, Constrained Spectral Clustering via Exhaustive and Efficient Constraint Propagation, ECCV, 1—14,2010 2.

  • B. Kulis, S. Basu, I. S. Dhillon, and R. J. Mooney. Semi-supervised graph clustering: A kernel
  • approach. Machine Learning, 74:1–22, 2009

44

slide-19
SLIDE 19

Evaluation Criterion Adjusted Rand Index

  • Scalar measure to evaluate clustering

performance from the clustering output.

TP – true positive; TN – true negative; FP – false positive; FN – false negative.

  • Compensates for chance; randomly

assigned cluster labels get a low score.

45

slide-20
SLIDE 20

Pairwise Constraint Generation

  • Assuming b labeled points are selected at

random from each class.

  • similarity pairs are generated from

each class.

  • An equal number of dissimilarity pairs

are also generated.

  • The value of b is varied.

46

slide-21
SLIDE 21

Synthetic Example 1: Olympic Circles

  • 300 points along each of

the five circles.

  • 25 points per class

selected at random.

  • Experiment 1:

– Varied number of labeled points [5, 7, 10, 12, 15, 17, 20, 25] from each class to generate pairwise constraints.

Original Sample Result

47

slide-22
SLIDE 22

Synthetic Example 1: Olympic Circles

  • Experiment 2

– 20 labeled points per class. – Introduce labeling errors by swapping similarity pairs with dissimilarity pairs. – Varied fraction of mislabeled constraints.

48

slide-23
SLIDE 23

Synthetic Example 2: Concentric Circles

  • 100 points along each of the

ten concentric circles.

  • Experiment 1

– Varied number of labeled points [5, 7, 10, 12, 15, 17, 20, 25] from each class to generate pairwise constraints.

  • Experiment 2

– 25 labeled points per class. – Introduce labeling errors by swapping similarity pairs with dissimilarity pairs.

49

slide-24
SLIDE 24

Real Example 1: USPS Digits

  • Ten classes with 1100 points per
  • class. A total of 11000 points.
  • 100 points per class → K

1000x1000 initial kernel matrix.

  • Varied number of labeled points

[5, 7, 10, 12, 15, 17, 20, 25] from each class to generate pairwise constraints.

  • Cluster all 11000 data points by

generalizing to the remaining 10000 points. ARI = 0.7529 0.051

11000 x 11000 PDM 50

slide-25
SLIDE 25

Real Example 2: MIT Scene

  • Eight classes with 2688 points. The

number of samples range between 260 and 410.

  • 100 points per class → K 800x800

initial kernel matrix.

  • Varied number of labeled points

[5, 7, 10, 12, 15, 17, 20] from each class to generate pairwise constraints.

  • Cluster all 2688 data points by

generalizing to the remaining 1888 points.

51

slide-26
SLIDE 26

Real Example 3: PIE Faces

  • 68 subjects with 21 samples per

subjects.

  • K → 1428 x 1428 full initial kernel

matrix.

  • Varied number of labeled points

[3, 4, 5, 6, 7] from each class to generate pairwise constraints.

  • Obtained perfect clustering for more

than 5 labeled points per class.

52

slide-27
SLIDE 27

Real Example 4: Caltech-101 (subset)

  • 50 categories with number of

samples ranging between 31 and 40 points per class.

  • K → 1959 x 1959 full initial

kernel matrix.

  • Varied number of labeled

points [5, 7, 10, 12, 15] from each class to generate pairwise constraints.

53