Clustering Algorithms Dalya Baron (Tel Aviv University) XXX Winter - - PowerPoint PPT Presentation

▶

clustering algorithms

Clustering Algorithms Dalya Baron (Tel Aviv University) XXX Winter - - PowerPoint PPT Presentation

Sep 05, 2022 267 likes •816 views

Clustering Algorithms Dalya Baron (Tel Aviv University) XXX Winter School, November 2018 Clustering Feature 2 Feature 1 Clustering cluster #1 Feature 2 cluster #2 Feature 1 Clustering Why should we look for clusters? cluster #1 Feature

slide-1

SLIDE 1

Clustering Algorithms

Dalya Baron (Tel Aviv University) XXX Winter School, November 2018

slide-2

SLIDE 2

Clustering

Feature 1 Feature 2

slide-3

SLIDE 3

Clustering

Feature 1 Feature 2 cluster #1 cluster #2

slide-4

SLIDE 4

Clustering

Feature 1 Feature 2 cluster #1 cluster #2 Why should we look for clusters?

slide-5

SLIDE 5

Clustering

slide-6

SLIDE 6

K-means

Input: measured features, and the number of clusters, k. The algorithm will classify all the objects in the sample into k clusters. Feature 1 Feature 2

slide-7

SLIDE 7

K-means

(I) The algorithm places randomly k points that represent the centroids of the clusters. The algorithm performs several iterations, in each of them: (II) The algorithm associates each object with a single cluster, according to its distance from the cluster centroid. (III) The algorithm recalculates the cluster centroid according to the objects that are associated with it.

Feature 1 Feature 2

slide-8

SLIDE 8

K-means

(I) The algorithm places randomly k points that represent the centroids of the clusters. The algorithm performs several iterations, in each of them: (II) The algorithm associates each object with a single cluster, according to its distance from the cluster centroid. (III) The algorithm recalculates the cluster centroid according to the objects that are associated with it.

Feature 1 Feature 2 Two centroids are randomly placed

slide-9

SLIDE 9

K-means

(I) The algorithm places randomly k points that represent the centroids of the clusters. The algorithm performs several iterations, in each of them: (II) The algorithm associates each object with a single cluster, according to its distance from the cluster centroid. (III) The algorithm recalculates the cluster centroid according to the objects that are associated with it.

Feature 1 Feature 2 The objects are associated to the closest cluster centroid (Euclidean distance).

slide-10

SLIDE 10

K-means

(I) The algorithm places randomly k points that represent the centroids of the clusters. The algorithm performs several iterations, in each of them: (II) The algorithm associates each object with a single cluster, according to its distance from the cluster centroid. (III) The algorithm recalculates the cluster centroid according to the objects that are associated with it.

Feature 1 Feature 2 New cluster centroids are computed using the average location of the cluster members.

slide-11

SLIDE 11

K-means

(I) The algorithm places randomly k points that represent the centroids of the clusters. The algorithm performs several iterations, in each of them: (II) The algorithm associates each object with a single cluster, according to its distance from the cluster centroid. (III) The algorithm recalculates the cluster centroid according to the objects that are associated with it.

Feature 1 Feature 2 The objects are associated to the closest cluster centroid (Euclidean distance).

slide-12

SLIDE 12

K-means

(I) The algorithm places randomly k points that represent the centroids of the clusters. The algorithm performs several iterations, in each of them: (II) The algorithm associates each object with a single cluster, according to its distance from the cluster centroid. (III) The algorithm recalculates the cluster centroid according to the objects that are associated with it.

Feature 1 Feature 2 The process stops when the objects that are associated with a given class do not change.

slide-13

SLIDE 13

The anatomy of K-means

Internal choices and/or internal cost function: (I) Initial centroids are randomly selected from the set of examples. (II) The global cost function that is minimized by K-means:

Euclidean distance cluster centroids cluster members

slide-14

SLIDE 14

The anatomy of K-means

Internal choices and/or internal cost function: (I) Initial centroids are randomly selected from the set of examples. (II) The global cost function that is minimized by K-means:

Euclidean distance cluster centroids cluster members

k=3, and two different random placements of centroids

slide-15

SLIDE 15

The anatomy of K-means

Input dataset: a list of objects with measured features. For which datasets should we use K-means?

Feature 1 Feature 2 Feature 1 Feature 2

slide-16

SLIDE 16

slide-17

SLIDE 17

The anatomy of K-means

Input dataset: a list of objects with measured features. What happens when we have an outlier in the dataset?

Feature 1 Feature 2

utlier!

slide-18

SLIDE 18

The anatomy of K-means

Input dataset: a list of objects with measured features. What happens when we have an outlier in the dataset?

Feature 1 Feature 2

utlier!

slide-19

SLIDE 19

The anatomy of K-means

Input dataset: a list of objects with measured features. What happens when the features have different physical units?

input dataset K-means output

slide-20

SLIDE 20

The anatomy of K-means

Input dataset: a list of objects with measured features. What happens when the features have different physical units?

input dataset K-means output How can we avoid this?

slide-21

SLIDE 21

The anatomy of K-means

Hyper-parameters: the number of clusters, k. Can we find the optimal k using the cost function?

k=2 k=3 k=5

slide-22

SLIDE 22

The anatomy of K-means

Hyper-parameters: the number of clusters, k. Can we find the optimal k using the cost function?

k=2 k=3 k=5

Number of clusters Minimal cost function

Elbow

slide-23

SLIDE 23

Questions?

slide-24

SLIDE 24

Hierarchal Clustering

Correa-Gallego+ 2016

r, how to visualize complicated similarity measures

slide-25

SLIDE 25

Hierarchal Clustering

Input: measured features, or a distance matrix that represents the pair-wise distances between the objects. Also, we must specify a linkage method. Initialization: each object is a cluster of size 1. Feature 1 Feature 2

slide-26

SLIDE 26

Hierarchal Clustering

Input: measured features, or a distance matrix that represents the pair-wise distances between the objects. Also, we must specify a linkage method. Initialization: each object is a cluster of size 1. Feature 1 Feature 2 Next: the algorithm merges the two closest clusters into a single cluster. Then, the algorithm re-calculates the distance of the newly-formed cluster to all the rest.

slide-27

SLIDE 27

Hierarchal Clustering

Input: measured features, or a distance matrix that represents the pair-wise distances between the objects. Also, we must specify a linkage method. Initialization: each object is a cluster of size 1. Feature 1 Feature 2 Next: the algorithm merges the two closest clusters into a single cluster. Then, the algorithm re-calculates the distance of the newly-formed cluster to all the rest. distance Dendrogram

slide-28

SLIDE 28

Hierarchal Clustering

Input: measured features, or a distance matrix that represents the pair-wise distances between the objects. Also, we must specify a linkage method. Initialization: each object is a cluster of size 1. Feature 1 Feature 2 Next: the algorithm merges the two closest clusters into a single cluster. Then, the algorithm re-calculates the distance of the newly-formed cluster to all the rest. distance Dendrogram

slide-29

SLIDE 29

Hierarchal Clustering

Input: measured features, or a distance matrix that represents the pair-wise distances between the objects. Also, we must specify a linkage method. Initialization: each object is a cluster of size 1. Feature 1 Feature 2 Next: the algorithm merges the two closest clusters into a single cluster. Then, the algorithm re-calculates the distance of the newly-formed cluster to all the rest. distance Dendrogram

slide-30

SLIDE 30

Hierarchal Clustering

Input: measured features, or a distance matrix that represents the pair-wise distances between the objects. Also, we must specify a linkage method. Initialization: each object is a cluster of size 1. Feature 1 Feature 2 Next: the algorithm merges the two closest clusters into a single cluster. Then, the algorithm re-calculates the distance of the newly-formed cluster to all the rest. distance Dendrogram

slide-31

SLIDE 31

Hierarchal Clustering

Input: measured features, or a distance matrix that represents the pair-wise distances between the objects. Also, we must specify a linkage method. Initialization: each object is a cluster of size 1. Feature 1 Feature 2 Next: the algorithm merges the two closest clusters into a single cluster. Then, the algorithm re-calculates the distance of the newly-formed cluster to all the rest. distance Dendrogram

slide-32

SLIDE 32

Hierarchal Clustering

Input: measured features, or a distance matrix that represents the pair-wise distances between the objects. Also, we must specify a linkage method. Initialization: each object is a cluster of size 1. Feature 1 Feature 2 Next: the algorithm merges the two closest clusters into a single cluster. Then, the algorithm re-calculates the distance of the newly-formed cluster to all the rest. distance Dendrogram

slide-33

SLIDE 33

Hierarchal Clustering

Input: measured features, or a distance matrix that represents the pair-wise distances between the objects. Also, we must specify a linkage method. Initialization: each object is a cluster of size 1. Feature 1 Feature 2 Next: the algorithm merges the two closest clusters into a single cluster. Then, the algorithm re-calculates the distance of the newly-formed cluster to all the rest. distance Dendrogram

slide-34

SLIDE 34

Hierarchal Clustering

Input: measured features, or a distance matrix that represents the pair-wise distances between the objects. Also, we must specify a linkage method. Initialization: each object is a cluster of size 1. Feature 1 Feature 2 The process stops when all the objects are merged into a single cluster distance Dendrogram

slide-35

SLIDE 35

The anatomy of Hierarchal Clustering

Internal choices and/or internal cost function: The linkage method is used to define a distance between two newly formed

clusters. Methods include: single (minimal), complete (maximal), average, etc.

Feature 1 Feature 2 distance single Dendrogram

slide-36

SLIDE 36

Feature 1 Feature 2 distance complete Dendrogram

The anatomy of Hierarchal Clustering

Internal choices and/or internal cost function: The linkage method is used to define a distance between two newly formed

clusters. Methods include: single (minimal), complete (maximal), average, etc.

slide-37

SLIDE 37

Feature 1 Feature 2 distance average Dendrogram

The anatomy of Hierarchal Clustering

Internal choices and/or internal cost function: The linkage method is used to define a distance between two newly formed

clusters. Methods include: single (minimal), complete (maximal), average, etc.

slide-38

SLIDE 38

Feature 1 Feature 2 distance Dendrogram d

The anatomy of Hierarchal Clustering

Hyper-parameters: clusters are defined beneath a threshold d. Alternatively, we can select a threshold d that corresponds to the desired number of clusters, k.

slide-39

SLIDE 39

Feature 1 Feature 2 distance Dendrogram d

The anatomy of Hierarchal Clustering

Hyper-parameters: clusters are defined beneath a threshold d. Alternatively, we can select a threshold d that corresponds to the desired number of clusters, k.

slide-40

SLIDE 40

The anatomy of Hierarchal Clustering

Hyper-parameters: clusters are defined beneath a threshold d. Alternatively, we can select a threshold d that corresponds to the desired number of clusters, k. We can use the resulting dendrogram to choose a “good” threshold:

distance

slide-41

SLIDE 41

The anatomy of Hierarchal Clustering

Hyper-parameters: clusters are defined beneath a threshold d. Alternatively, we can select a threshold d that corresponds to the desired number of clusters, k. We can use the resulting dendrogram to choose a “good” threshold:

distance

slide-42

SLIDE 42

The anatomy of Hierarchal Clustering

Input dataset: can either be a list of objects with measured properties, or a distance matrix that represents pair-wise distances between objects. What happens if we have an outlier in the dataset?

slide-43

SLIDE 43

The anatomy of Hierarchal Clustering

Input dataset: can either be a list of objects with measured properties, or a distance matrix that represents pair-wise distances between objects. What happens if the dataset does not have clear clusters?

distance

slide-44

SLIDE 44

The anatomy of Hierarchal Clustering

Input dataset: can either be a list of objects with measured properties, or a distance matrix that represents pair-wise distances between objects. Different linkage methods are helpful with different datasets.

single linkage complete linkage average linkage

slide-45

SLIDE 45

“Statistics, Data Mining, and Machine Learning in Astronomy”, by Ivezic, Connolly, Vanderplas, and Gray (2013).

Hierarchal Clustering in Astronomy

slide-46

SLIDE 46

Visualizing similarity matrices with Hierarchical Clustering

Input: 10,000 emission line spectra, covering the wavelength range 300 - 700

nm. There are ~90 emission lines in each spectrum, with an average SNR of 2-4.

wavelength (nm) wavelength (nm) normalized flux normalized flux

slide-47

SLIDE 47

Visualizing similarity matrices with Hierarchical Clustering

We compute a correlation matrix of all the observed wavelengths. wavelength (nm) wavelength (nm) correlation coefficient

slide-48

SLIDE 48

Visualizing similarity matrices with Hierarchical Clustering

We convert the correlation matrix to a distance matrix, and build a dendrogram

slide-49

SLIDE 49

Visualizing similarity matrices with Hierarchical Clustering

We reorder the correlation matrix (the wavelengths) according to the resulting dendrogram. reordered axis

slide-50

SLIDE 50

Visualizing similarity matrices with Hierarchical Clustering

de Souza et. al 2015

slide-51

SLIDE 51

Questions?

slide-52

SLIDE 52

Gaussian Mixture models

See: http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html#sphx-glr-auto-examples-mixture-plot-gmm- covariances-py

slide-53

SLIDE 53

Questions?