Spectral Clustering on Handwritten Digits Database Danielle - - PowerPoint PPT Presentation

spectral clustering on handwritten digits database
SMART_READER_LITE
LIVE PREVIEW

Spectral Clustering on Handwritten Digits Database Danielle - - PowerPoint PPT Presentation

Introduction Approach Validation Implementation Project Schedule Deliverables References Spectral Clustering on Handwritten Digits Database Danielle Middlebrooks dmiddle1@math.umd.edu Advisor: Kasso Okoudjou kasso@umd.edu Department of


slide-1
SLIDE 1

Introduction Approach Validation Implementation Project Schedule Deliverables References

Spectral Clustering on Handwritten Digits Database

Danielle Middlebrooks dmiddle1@math.umd.edu Advisor: Kasso Okoudjou kasso@umd.edu Department of Mathematics

University of Maryland- College Park Advance Scientific Computing I

October 6, 2015

Middlebrooks Spectral Clustering on Handwritten Digits Database

slide-2
SLIDE 2

Introduction Approach Validation Implementation Project Schedule Deliverables References

Outline

1

Introduction

2

Approach

3

Validation

4

Implementation

5

Project Schedule

6

Deliverables

7

References

Middlebrooks Spectral Clustering on Handwritten Digits Database

slide-3
SLIDE 3

Introduction Approach Validation Implementation Project Schedule Deliverables References

Background Information

Spectral Clustering is clustering technique that makes use of the spectrum of the similarity matrix derived from the data set. Implements a clustering algorithm on a reduced dimension. Advantages: Simple algorithm to implement and uses standard linear algebra methods to solve the problem efficiently. Motivation: Implement an algorithm that groups objects in a data set to other objects with ones that have a similar behavior.

Middlebrooks Spectral Clustering on Handwritten Digits Database

slide-4
SLIDE 4

Introduction Approach Validation Implementation Project Schedule Deliverables References

Definitions

A graph G = (V , E) where V = {v1, ..., vn} W- Adjacency matrix. wij =

  • 1,

if vi, vj are connected by an edge 0,

  • therwise

The degree of a vertex di = n

j=1 wij. The Degree matrix

denoted D,where each d1, ..., dn are on the diagonal. Denote a subset of vertices A ⊂ V and the compliment ¯ A = V \ A |A| = number of vertices in A vol(A) =

i∈A di

W (A, B) =

i∈A,j∈B wij

Middlebrooks Spectral Clustering on Handwritten Digits Database

slide-5
SLIDE 5

Introduction Approach Validation Implementation Project Schedule Deliverables References

Example

1 2 3 4 5 6 7 G = (V , E) where V = {1, ..., 7}. W =           1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1           D =           2 2 2 3 3 3 3          

Middlebrooks Spectral Clustering on Handwritten Digits Database

slide-6
SLIDE 6

Introduction Approach Validation Implementation Project Schedule Deliverables References

Definitions

Similarity graph: Given a data set x1, ..., xn and a notion of “similar”, a similarity graph is a graph where xi and xj have an edge between them if they are considered “similar”. Some ways to determine if data points are similar are:

e-neighborhood graph k-nearest neighborhood graph Use Similarity Function

Unnormalized Laplacian Matrix: L = D − W Normalized Laplacian Matrix: Lsym = D−1/2LD−1/2 = I − D−1/2WD−1/2

Middlebrooks Spectral Clustering on Handwritten Digits Database

slide-7
SLIDE 7

Introduction Approach Validation Implementation Project Schedule Deliverables References

Why this works?

Spectral Clustering is motivated by approximating the RatioCut or NCut on a given graph. Given a similarity graph, to construct a partition is the solve the min cut problem. That is min cut(A1, ...Ak) := 1 2

k

  • 1

W (Ai, ¯ Ai) In order to insist each partition is reasonably large, use RatioCut or NCut. Thus the size of each partition is measured by the number of vertices or weights of the edges, respectively.

Middlebrooks Spectral Clustering on Handwritten Digits Database

slide-8
SLIDE 8

Introduction Approach Validation Implementation Project Schedule Deliverables References

Why this works?

Thus RatioCut(A1, ..., Ak) := 1 2

k

  • i=1

W (Ai, ¯ Ai) |Ai| NCut(A1, ..., Ak) := 1 2

k

  • i=1

W (Ai, ¯ Ai) vol(Ai) Solving these versions makes the problem NP hard. Spectral Clustering solves the relaxed versions of these

  • problems. [2.]

Middlebrooks Spectral Clustering on Handwritten Digits Database

slide-9
SLIDE 9

Introduction Approach Validation Implementation Project Schedule Deliverables References

Why this works?

Case k = 2. Given a subset A. Then NCut(A, ¯ A) = W (A, ¯

A) vol(A) + W (A, ¯ A) vol( ¯ A) . Define the cluster indicator vector

f by f (vi) = fi =

  • 1

vol(A),

if vi ∈ A −

1 vol( ¯ A),

if vi ∈ ¯ A Then f TLf =

  • wij(fi − fj)2 = W (A, ¯

A)( 1 vol(A) + 1 vol( ¯ A))2 f TDf =

  • dif 2

i =

1 vol(A) + 1 vol( ¯ A)

Middlebrooks Spectral Clustering on Handwritten Digits Database

slide-10
SLIDE 10

Introduction Approach Validation Implementation Project Schedule Deliverables References

Why this works?

Thus minimizing the NCut problem is equivalent to min NCut(A, B) = f TLf f TDf The relaxation problem is given by minimize

f ∈Rn

f TLf f TDf subject to f TD1 = 0

Middlebrooks Spectral Clustering on Handwritten Digits Database

slide-11
SLIDE 11

Introduction Approach Validation Implementation Project Schedule Deliverables References

Why this works?

It can be should the relaxation problem is a form of the Rayleigh-Ritz quotient. The Rayleigh Ritz theorem states: Given A a Hermitian matrix, then λmin = minx=0 xTAx xTx Thus in the relaxation problem, the solution f is the second eigenvector of the generalized problem.

Middlebrooks Spectral Clustering on Handwritten Digits Database

slide-12
SLIDE 12

Introduction Approach Validation Implementation Project Schedule Deliverables References

Procedure

Database Similarity Graph Normalized Laplacian Compute the Eigenvectors Put the eigenvectors in a matrix and Normalize Perform dimension reduction Cluster the points

Middlebrooks Spectral Clustering on Handwritten Digits Database

slide-13
SLIDE 13

Introduction Approach Validation Implementation Project Schedule Deliverables References

Databases

The database I will be using is the MNIST Handwritten digits

  • database. Has 1000 of each digit 0-9. Each image is of size 28x28

pixels.

Figure: Test images. Simon A. J. Winder

Middlebrooks Spectral Clustering on Handwritten Digits Database

slide-14
SLIDE 14

Introduction Approach Validation Implementation Project Schedule Deliverables References

Similarity Graph

Guassian Similarity Function: s(xi, xj) = e

−||xi −xj ||2 2σ2

where σ is a

  • parameter. If s(xi, xj) < ǫ connect an edge between xi and xj.

Each xi ∈ R28x28 and corresponds to an image. Thus ||xi − xj||2

2 = 28

  • k=1

28

  • l=1

(xi

kl − xj kl)2

Middlebrooks Spectral Clustering on Handwritten Digits Database

slide-15
SLIDE 15

Introduction Approach Validation Implementation Project Schedule Deliverables References

Laplacian Matrix

W - Adjacency matrix wij =

  • 1,

if s(xi, xj) < ǫ 0,

  • therwise

D- Degree matrix L = D − W Lsym = D−1/2LD−1/2 = I − D−1/2WD−1/2

Middlebrooks Spectral Clustering on Handwritten Digits Database

slide-16
SLIDE 16

Introduction Approach Validation Implementation Project Schedule Deliverables References

Computing Eigenvectors

Use an iterative method called the Power Method to find the first k eigenvectors of Lsym = D−1/2LD−1/2 = I − D−1/2WD−1/2. Start with an initial nonzero vector, v0, for the eigenvector Let B = D−1/2WD−1/2. Form the sequence given by: for i = 1, ..., l xi = Bvi−1 vi = xi ||xi|| end

Middlebrooks Spectral Clustering on Handwritten Digits Database

slide-17
SLIDE 17

Introduction Approach Validation Implementation Project Schedule Deliverables References

Computing Eigenvectors (Con’t)

For large values of l we will obtain a good approximation of the dominant eigenvector of B. This will give us the eigenvector corresponding the the largest eigenvalue of B which corresponds to the smallest eigenvalue of Lsym. To find the next eigenvector, after selecting the random initial vector v0, subtract the component of v0 that is parallel to the eigenvector of the largest eigenvalue.

Middlebrooks Spectral Clustering on Handwritten Digits Database

slide-18
SLIDE 18

Introduction Approach Validation Implementation Project Schedule Deliverables References

Computing Eigenvectors (Con’t)

We put the first k eigenvectors into a matrix and normalize it. Let T ∈ Rnxk be the eigenvector matrix with norm 1. Set ti,j = vi,j (

k v2 i,k)1/2

        v11 v12 v13 . . . v1k . . . . . . . . . ... . . . vi1 vi2 vi3 . . . vik . . . . . . . . . ... . . . vn1 vn2 vn3 . . . vnk         ⇒         t11 t12 t13 . . . t1k . . . . . . . . . ... . . . ti1 ti2 ti3 . . . tik . . . . . . . . . ... . . . tn1 tn2 tn3 . . . tnk        

Middlebrooks Spectral Clustering on Handwritten Digits Database

slide-19
SLIDE 19

Introduction Approach Validation Implementation Project Schedule Deliverables References

Dimension Reduction

Project the eigenvectors onto new space. Let yi ∈ Rk be a vector from the ith row of T         t11 t12 t13 . . . t1k . . . . . . . . . ... . . . ti1 ti2 ti3 . . . tik . . . . . . . . . ... . . . tn1 tn2 tn3 . . . tnk         ⇒ yi =      ti1 ti2 . . . tik     

Middlebrooks Spectral Clustering on Handwritten Digits Database

slide-20
SLIDE 20

Introduction Approach Validation Implementation Project Schedule Deliverables References

Clustering

Perform a k-means algorithm on the set of vectors. Randomly select k cluster centroids, zj. Calculate the distance between each yi and zj. Assign the data point to the closest centroid. Recalculate centroids and distances from data points to new centroids. If no data point was reassigned then stop, else reassign data points and repeat.

Middlebrooks Spectral Clustering on Handwritten Digits Database

slide-21
SLIDE 21

Introduction Approach Validation Implementation Project Schedule Deliverables References

Clustering

Perform a k-means algorithm on the set of vectors. Randomly select k cluster centroids, zj. Calculate the distance between each yi and zj. Assign the data point to the closest centroid. Recalculate centroids and distances from data points to new centroids. If no data point was reassigned then stop, else reassign data points and repeat. Finally, assign the original point xi to cluster j if and only if row i

  • f the matrix Y was assigned to cluster j.

Middlebrooks Spectral Clustering on Handwritten Digits Database

slide-22
SLIDE 22

Introduction Approach Validation Implementation Project Schedule Deliverables References

Validation

Validate the k-means clustering on a well known clustered set. See if they match. Validate the computation of the eigenvectors by using Matlab toolbox. Validate the solution of clustering found by seeing if similar images are grouped together.

Middlebrooks Spectral Clustering on Handwritten Digits Database

slide-23
SLIDE 23

Introduction Approach Validation Implementation Project Schedule Deliverables References

Implementation

Personal Laptop: Macbook Pro. I will be using Matlab R2014b for the coding.

Middlebrooks Spectral Clustering on Handwritten Digits Database

slide-24
SLIDE 24

Introduction Approach Validation Implementation Project Schedule Deliverables References

Project Schedule

End of October/ Early November: Construct Similarity Graph and Normalized Laplacian matrix. End of November/ Early December: Compute first k eigenvectors validate this. February: Normalize the rows of matrix of eigenvectors and perform dimension reduction. March/April: Cluster the points using k-means and validate this step. End of Spring semester: Implement entire algorithm, optimize and obtain final results.

Middlebrooks Spectral Clustering on Handwritten Digits Database

slide-25
SLIDE 25

Introduction Approach Validation Implementation Project Schedule Deliverables References

Results

By the end of the project, I will deliver Code that delivers database Codes that implement the algorithm Final report of algorithm outline, testing on database and results Final presentation

Middlebrooks Spectral Clustering on Handwritten Digits Database

slide-26
SLIDE 26

Introduction Approach Validation Implementation Project Schedule Deliverables References

References

[1.] Von Cybernetics, U. A Tutorial on Spectral Clustering. Statistics and Computing, 7 (2007) 4. [2.] Shi, J. and Malik J. Normalized cuts and image segmentation. IEEE Transations on Pattern Analysis and Machine Intelligence, 22 (2000) 8. [3.] Chung, Fan. Spectral Graph Theory. N.p.: American Mathematical Society. Regional Conference Series in Mathematics.

  • 1997. Ser. 92.

[4.] Vishnoi, Nisheeth K.Lx = b Laplacian Solvers and their Algorithmic Applications. N.p.: Foundations and Trends in Theoretical Computer Science, 2012.

Middlebrooks Spectral Clustering on Handwritten Digits Database

slide-27
SLIDE 27

Introduction Approach Validation Implementation Project Schedule Deliverables References

Thank you

Middlebrooks Spectral Clustering on Handwritten Digits Database