[PDF] - Bi-clustering and co-clustering Going further in cluster analysis PDF Document

SLIDE 1

HAL Id: hal-01810380 https://hal.inria.fr/hal-01810380

Submitted on 7 Jun 2018 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entifjc research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la difgusion de documents scientifjques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Going further in cluster analysis and classifjcation: Bi-clustering and co-clustering

C Biernacki To cite this version:

C Biernacki. Going further in cluster analysis and classifjcation: Bi-clustering and co-clustering. Summer School on Clustering, Data Analysis and Visualization of Complex Data, May 2018, Catania,

Italy. ฀hal-01810380฀

SLIDE 2

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Going further in cluster analysis and classification: Bi-clustering and co-clustering

C. Biernacki

Summer School on Clustering, Data Analysis and Visualization of Complex Data May 21-25 2018, University of Catania, Italy

1/66

SLIDE 3

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Outline

1 HD clustering 2 Modeling 3 Estimating 4 Selecting 5 BlockCluster in MASSICCC 6 To go further

2/66

SLIDE 4

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Motivation

High dimensional (HD) data sets are now frequent: Marketing: d ∼ 102 microarray gene expression: d ∼ 102–104 SNP data: d ∼ 106 Curves: depends on discretization but can be very high Text mining . . . Clustering has to be applied for HD datasets for the same reasons as the lower dimensional datasets: Data summary Data exploratory Preprocessing for more flexibility of a forthcoming prediction step But clustering is even more important since visualization in the HD setting can be

hazardous. . .

3/66

SLIDE 5

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Today’: exponential growing of dimension1

1S. Alelyani, J. Tang and H. Liu (2013). Feature Selection for Clustering: A Review. Data Clustering:

Algorithms and Applications, 29

4/66

SLIDE 6

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

HD data: definition (1/2)

An attempt in the non-parametric case

Dataset x = (x1, . . . , xn), xj described by d variables, where n = o ed Justifications: To approximate within error ǫ a (Lipschitz) function of d variables, about (1/ǫ)d evaluations on a grid are required [Bellman, 61] Approximate a Gaussian distribution with fixed Gaussian kernels and with approximate error of about 10% [Silverman, 86] log10 n(d) ≈ 0.6(d − 0.25) For instance, n(10) ≈ 7.105

5/66

SLIDE 7

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

HD data: definition (2/2)

An attempt in the parametric case

Dataset x = (x1, . . . , xn), xj described by d variables and a model m with ν parameters, where n = o(g(ν)), with g a given function Justification: We consider the heteroscedastic Gaussian mixture with of true parameter θ∗ with K ∗ components. We note ˆ θ the Gaussian MLE with K ∗ components. We have g linear from the following result [Michel, 08]: it exists constants κ, A and C such that Ex[Hellinger2(pθ∗, p ˆ

θ ˆ

K )] ≤ C

κ ν

n

2A ln d + 1 − ln
1 ∧

ν n A ln d

+ 1

n

.

But ν can be high since ν ∼ d2/2, combined with potentially large constants.

6/66

SLIDE 8

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

HD density estimation: curse

A two-component d-variate Gaussian mixture: π1 = π2 = 1 2, X1|z11 = 1 ∼ Nd(0, I), X1|z12 = 1 ∼ Nd(1, I) Components are more and more separated when d grows: µ2 − µ1I = √

d. . .

−3 −2 −1 1 2 3 4 −4 −3 −2 −1 1 2 3 4

x1 x2

1 2 3 4 5 6 7 8 9 10 12 12.5 13 13.5 14 14.5 15 15.5 16 16.5

d Kullback−Leibler

. . . but density estimation quality decreases with d

7/66

SLIDE 9

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

HD clustering: blessing (1/2)

A two-component d-variate Gaussian mixture: π1 = π2 = 1 2, X1|z11 = 1 ∼ Nd(0, I), X1|z12 = 1 ∼ Nd(1, I) Each variable provides equal and own separation information Theoretical error decreases when d grows: errtheo = Φ(− √ d/2). . .

−3 −2 −1 1 2 3 4 −4 −3 −2 −1 1 2 3 4

x1 x2

1 2 3 4 5 6 7 8 9 10 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

d err Empirical Theoretical

. . . and empirical error rate decreases also with d!

8/66

SLIDE 10

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

HD clustering: blessing (2/2)

FDA

−4 −3 −2 −1 1 2 3 4 −2.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5

1st axis FDA 2nd axis FDA d=2

−2 −1.5 −1 −0.5 0.5 1 1.5 2 −3 −2 −1 1 2 3 4

1st axis FDA 2nd axis FDA d=20

−1.5 −1 −0.5 0.5 1 1.5 −4 −3 −2 −1 1 2 3 4 5

1st axis FDA 2nd axis FDA d=200

−1.5 −1 −0.5 0.5 1 1.5 −3 −2 −1 1 2 3

1st axis FDA 2nd axis FDA d=400

9/66

SLIDE 11

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

HD clustering: curse (1/2)

Many variables provide no separation information Same parameter setting except: X1|z12 = 1 ∼ Nd((1 0 . . . 0)′, I) Groups are not separated more when d grows: µ2 − µ1I = 1. . .

−4 −3 −2 −1 1 2 3 4 −5 −4 −3 −2 −1 1 2 3 4

x1 x2

1 2 3 4 5 6 7 8 9 10 0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44

d err Empirical Theoretical

. . . thus theoretical error is constant (= Φ(− 1

2 )) and empirical error increases with d

10/66

SLIDE 12

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

HD clustering: curse (2/2)

Many variables provide redundant separation information Same parameter setting except: Xj

1 = X1 1 + N1(0, 1)

(j = 2, . . . , d) Groups are not separated more when d grows: µ2 − µ1Σ = 1. . .

−3 −2 −1 1 2 3 4 −6 −4 −2 2 4 6

x1 x2

1 2 3 4 5 6 7 8 9 10 0.3 0.32 0.34 0.36 0.38 0.4 0.42

d err Empirical Theoretical

. . . thus errtheo is constant (= Φ(− 1

2)) and empirical error increases (less) with d

11/66

SLIDE 13

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

The trade-off bias/variance

The fundamental statistical principle

Always minimize an error err between truth (z) and estimate (ˆ z) Gap between true (z) and model-based (Zp) partitions: z∗ = arg min˜

z∈Zp ∆(z, ˜

z) Estimation ˆ z of z∗ in Zp: any relevant method (bias, consistency, efficiency. . . ) Fundamental decomposition of the observed error err(z, ˆ z): err(z, ˆ z) =

err(z, z∗) − err(z, z)
+
err(z, ˆ

z) − err(z, z∗)

=
bias
+
variance
=
error of approximation
+
error of estimation
12/66

SLIDE 14

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Bias/variance in HD: reduce variance, accept bias

A two-component d-variate Gaussian mixture with intra-dependency: π1 = π2 = 1 2, X1|z11 = 1 ∼ Nd(0, Σ), X1|z12 = 1 ∼ Nd(1, Σ) Each variable provides equal and own separation information Theoretical error decreases when d grows: errtheo = Φ(−µ2 − µ1Σ−1/2) Empirical error rate with the (true) intra-correlated model worse with d Empirical error rate with the (false) intra-independent model better with d!

−4 −3 −2 −1 1 2 3 4 5 −3 −2 −1 1 2 3 4 x1 x2 1 2 3 4 5 6 7 8 9 10 0.24 0.26 0.28 0.3 0.32 0.34 0.36 0.38

d err Empirical corr. Empirical indep. Theoretical

13/66

SLIDE 15

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Some alternatives for reducing variance

Dimension reduction in non-canonical space (PCA-like typically) Dimension reduction in the canonical space (variable selection) Model parsimony in the initial HD space (constraints on model parameters)

But which kind of parsimony?

Remember that clustering is a way for dealing with large n Why not reusing this idea for large d?

Co-clustering

It performs parsimony of row clustering through variable clustering

14/66

SLIDE 16

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

From clustering to co-clustering

[Govaert, 2011]

15/66

SLIDE 17

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Bi-clustering

A generalization of co-clustering Look for submatrices of

x which are homogeneous

We do not consider bi-clustering here

16/66

SLIDE 18

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Outline

1 HD clustering 2 Modeling 3 Estimating 4 Selecting 5 BlockCluster in MASSICCC 6 To go further

17/66

SLIDE 19

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Notations

zi: the cluster of the row i wj: the cluster of the column j

(

zi, wj): the block of the element xij (row i, column j)

z = (

z1, . . . , zn): partition of individuals in K custers of rows

w = (

w1, . . . , wd): partition of variables in L clusters of columns

(

z, w): bi-partition of the whole data set x

Both space partitions are respectively denoted by Z and W

Restriction

All variables are of the same kind (see discussion at the end)

18/66

SLIDE 20

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

A geometric approach

Example in the continuous case:

x ∈ Rn×d

It could be possible to define a within-block inertia criterion W (

z, w) =

K

k=1

L

l=1

n

i=1

d

j=1
i,j,k,l

zikwjlxij − µkl2 with µkl the center of the block (k, l) µkl = 1 nkl

i,j

zikwjlxij where nkl =

ij zikwjl is the sample size of the block (k, l)

But we know now that it hides some model-based assumptions. . .

19/66

SLIDE 21

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

The latent block model (LBM)

Generalization of some existing non-probabilistic methods Extend the latent class principle of local (or conditional) independence Thus xij is assumed to be independent once zi and wj are fixed (α = (αkl)): p(x|z, w; α) =

i,j

p(xij; αzi wj ) π = (πk) : vectors of proba. πk that a row belongs to the kth row cluster ρ = (ρk) : vectors of proba. ρk that a row belongs to the lth column cluster Independence between all zi and wj Extension of the traditional mixture model-based clustering (α = (αkl)): p(x; θ) =

(z,w)∈Z×W
i,j

πzi ρwj p(xij; αzi wj )

20/66

SLIDE 22

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Distribution for different kinds of data

[Govaert and Nadif, 2014] The pdf p(·; αzi wj ) depends on the kind of data xij:

Binary data: xij ∈ {0, 1}, p(·; αkl) = B(αkl) Categorical data with m levels:

xij = {xijh} ∈ {0, 1}m with m

h=1 xijh = 1 and p(·; αkl) = M(αkl) with αkl = {αkjh}

Count data: xj

i ∈ N, p(·; αkl) = P(µkνlγkl)2

Continuous data: xj

i ∈ R, p(·; αkl) = N(µkl , σ2 kl)

2The Poisson parameter is here split into µk and νl the effects of the row k and the column l respectively and

γkl the effect of the block kl. Unfortunately, this parameterization is not identifiable. It is therefore not possible to estimate simultaneously µk, νl and γkl without imposing further constraints. Constraints

k πkγkl =

l ρl γkl = 1 and k µk = 1, l νl = 1 are a possibility. 21/66

SLIDE 23

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Extreme parsimony ability

Model Number of parameters Binary dim(π) + dim(ρ) + KL Categorical dim(π) + dim(ρ) + KL(m − 1) Contingency dim(π) + dim(ρ) + KL Continuous dim(π) + dim(ρ) + 2KL

Very parsimonious so well suitable for the (ultra) HD setting

nb. param.HD = nb. param.classic × L

d Other advantage: stay in the canonical space thus meaningful for the end-user

22/66

SLIDE 24

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Binary illustration: easy interpretation

[Govaert, 2011]

23/66

SLIDE 25

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Binary illustration: user-friendly visualization

[Govaert, 2011]

n = 500, d = 10, K = 6, L = 4

24/66

SLIDE 26

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Other kind of data: ordinal

[Jacques and Biernacki, 2018]

25/66

SLIDE 27

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Other kind of data: functional

[Jacques, 2016]

26/66

SLIDE 28

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Other kind of data: image

27/66

SLIDE 29

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Particular case: graph clustering

Stochastic Block Model (SBM): adjacency matrix with n = d and K = L

28/66

SLIDE 30

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Outline

1 HD clustering 2 Modeling 3 Estimating 4 Selecting 5 BlockCluster in MASSICCC 6 To go further

29/66

SLIDE 31

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

MLE estimation: log-likelihood(s)

Remember Lesson 3: first estimate θ, then deduce estimate of (

z, w)

Observed log-likelihood: ℓ(θ;

x) = ln p( x; θ)

MLE: ˆ θ = arg max

θ

ℓ(θ;

x)

Complete log-likelihood: ℓc(θ; x, z, w) = ln p(

x, z, w; θ)

=

i,k

zik log πk +

k,l

wjl log ρl +

i,j,k,l

zikwjl log p(xj

i ; αkl)

Be careful with asymptotics. . .

If ln(d)/n → 0, ln(n)/d → 0 when n → ∞ and d → ∞, then the MLE is consistent

[Brault et al., 2017]

30/66

SLIDE 32

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

MLE estimation: EM algorithm

E-step of EM (iteration q): Q(θ, θ(q)) = E[ℓc(θ; x, z, w)| x; θ(q)] =

i,k

p(zi = k|x; θ(q))

t(q)

ik

ln πk +

j,l

p(wi = l|x; θ(q))

s(q)

jl

ln ρl +

i,j,k,l

p(zi = k, wj = l|x; θ(q))

e(q)

ijkl

ln p(xij; αkl) M-step of EM (iteration q): classical. For instance, for the Bernoulli case, it gives π(q+1)

k

=

i t(q)

ik

n , ρ(q+1)

l

=

j s(q)

jl

d , α(q+1)

kl

=

i,j e(q)

ijkl xij

i,j e(q)

ijkl

31/66

SLIDE 33

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

MLE: intractable E step

e(q)

ijkl is usually intractable. . .

Consequence of dependency between

xijs (link between rows and columns)

Involve K nLd calculus (number of possible blocks) Example: if n = d = 20 and K = L = 2 then 1012 blocks Example (cont’d): 33 years with a computer calculating 100,000 blocks/second

Alternatives to EM

32/66

SLIDE 34

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

MLE: variational EM (1/2)

Use a general variational result from [Hathaway, 1985] Maximizing ℓ(θ;

x) on θ is equivalent to maximize ˜

ℓc(θ;

x, e) on (θ, e)

˜ ℓc(θ;

x, e) =

i,k

tik ln πk +

j,l

sjl ln ρl +

i,j,k,l

eijkl ln p(xij; αkl) where e = (eijkl), eijkl ∈ {0, 1},

k,l eijkl = 1, tik = j,l eijkl, sjl = i,k eijkl

Of course maximizing ℓ(θ;

x) or ˜

ℓc(θ;

x, e) are both intractable

Idea: restriction on e to obtain tractability eijkl = tiksjl New variables are thus now t = (tik) and s = (sjl) As a consequence, it is a maximization of a lower bound of the max. likelihood max

θ

ℓ(θ;

x) ≥ max

θ,t,s

˜ ℓc(θ;

x, e)

33/66

SLIDE 35

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

MLE: variational EM (2/2)

Approximated E-step

Q(θ, θ(q)) ≈

i,k

t(q)

ik

ln πk +

j,l

s(q)

jl

ln ρl +

i,j,k,l

t(q)

ik s(q) jl

ln p(xij; αkl) We called it now VEM Also known as mean field approximation Consistency of the variational estimate [Brault et al., 2017]

34/66

SLIDE 36

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

MLE: local maxima

More local maxima than in classical mixture models It is a consequence of many more latent variables (blocks) Thus: either many VEM runs, or use the SEM-Gibbs algorithm

35/66

SLIDE 37

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

MLE: SEM-Gibbs

We have already seen the SEM algorithm in Lesson 3 (thus we do not detail more) It limits dependency to starting point, so it limits local maxima The S-step: a draw (

z(q), w (q)) ∼ p( z, w| x; θ(q)) instead an expectation

But it is still intractable, thus use a Gibbs algorithm to approx. this draw

Approximated S-step

Two easy draws

z(q) ∼ p( z| w (q−1), x; θ(q))

and

w (q) ∼ p( w| z(q), x; θ(q))

Rigorously speaking, many draws within the S-step should be performed Indeed, Gibbs has to reach a stochastic convergence In practice it works well while saving computation time

36/66

SLIDE 38

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

MLE: degeneracy

More degenerate situations than in classical mixture models It is again a consequence of many more latent variables (blocks) The Bayesian regularization (instead MLE) can be an answer

37/66

SLIDE 39

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Illustration of a degenerate situation

−10 10 20 0.005 0.01 x Density

Iteration 1

−10 10 20 0.05 0.1 0.15 0.2 x Density

Iteration 2

−10 10 20 0.05 0.1 0.15 0.2 x Density

Iteration 50

−10 10 20 0.05 0.1 0.15 0.2 x Density

Iteration 77

−10 10 20 0.05 0.1 0.15 0.2 x Density

Iteration 78

−10 10 20 0.05 0.1 0.15 0.2 x Density

Iteration 79

−10 10 20 0.05 0.1 0.15 0.2 x Density

Iteration 80

−10 10 20 0.05 0.1 0.15 0.2 x Density

Iteration 81

−10 10 20 0.1 0.2 0.3 0.4 x Density

Iteration 82

38/66

SLIDE 40

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Bayesian estimation: pitch

Everything passes by the posterior distribution of θ p(θ| x) ∝ p(

x|θ)

log-likelihood

p(θ)

prior

Then, take (for instance) the MAP as a θ estimate (use a VEM like algo. . . ) ˆ θ = arg max

θ

p(θ| x)

39/66

SLIDE 41

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Bayesian estimation: limiting degeneracy

Interest for avoiding degeneracy is the prior: it acts as a penalization term Typical choices are Dirichlet for π and ρ (with independence between π, ρ, α) p(θ) = p(π)

DK (a,...,a)

× p(ρ)

DL(a,...,a)

× p(α)

model dependent

The Dirichlet distribution is conjugate, thus easy calculus Control degeneracy frequency with the a value:

a = 1: uniform prior, so ˆ θ is strictly the MLE (no regularisation) a = 1/2: Jeffreys prior, classical (no informative prior) but may favor degeneracy a = 4: a rule of thumb working well for limiting degeneracy frequency

40/66

SLIDE 42

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Bayesian estimation: prior overview

41/66

SLIDE 43

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Block estimation: estimate

Once we have a parameter estimate ˆ θ, we need to have an block estimate (ˆ

z, ˆ w)

But MAP not directly available because of the following maximization difficulty (ˆ

z, ˆ w) = arg max

( z,w) p(

z, w| x; ˆ

θ)

intractable

Instead the following (easily, as classical mixtures) estimates are usually retained ˆ

z = arg max z

p(

z| x; ˆ

θ) and ˆ

w = arg max w

p(

w| x; ˆ

θ)

42/66

SLIDE 44

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Block estimation: evaluation

Empirical error rate between blocks: errblocks( (

z, w)

“True” blocks

, (ˆ

z, ˆ w)

Estimated blocks

) = err(

z, ˆ z) + err( w, ˆ w) − err( z, ˆ z) × err( w, ˆ w)

Rand index between blocks: it exists also a recent definition. . .

43/66

SLIDE 45

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Block estimation: consistency

[Mariadassou and Matias, 12]

ˆ θ

n,d→∞

− → θ∗

we have seen that. . .

⇒ p(ˆ z = z∗, ˆ w = w∗|x; ˆ θ)

n,d→∞

− → 1

exact bi-partition retrieval!

Thus we retrieve the HD clustering blessing. . .

44/66

SLIDE 46

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Block estimation: non asymptotic properties (1/2)

Binary case: marginals seems so simple mixtures! [Brault, 14]

45/66

SLIDE 47

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Block estimation: non asymptotic properties (2/2)

[Brault, 14]

Probability of xij with no regard to the column membership is Bernoulli p(xij = 1|zik = 1) = τk =

L

l=1

αklρl Thus marginal distribution of xij is a mixture (indep. of xij cond. zik = 1)  

j

xij   |zik = 1 ∼ B(d, τk) Control of error on this partition mixture estimate ˆ zmix of binomial distributions p(ˆ zmix = z∗) ≤ 2n exp

− 1

8 d

min

k=k′ |τk − τk′|

verlap
+ K(1 − min

k

πk)n We retrieve also consistency for very high dimension with constraint ln(n) = o(d)

46/66

SLIDE 48

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Illustration: document clustering (1/2)

Mixture of 1033 medical summaries and 1398 aeronautics summaries Lines: 2431 documents Columns: present words (except stop), thus 9275 unique words Data matrix: cross counting document×words Poisson model

47/66

SLIDE 49

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Illustration: document clustering (2/2)

Results with 2×2 blocks

Medline Cranfield Medline 1033 Cranfield 1398 Experiment illustrates previous theory: HD clustering is blessing

48/66

SLIDE 50

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Outline

1 HD clustering 2 Modeling 3 Estimating 4 Selecting 5 BlockCluster in MASSICCC 6 To go further

49/66

SLIDE 51

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Models in competition

m = (K, L) typically, but not restricted to

50/66

SLIDE 52

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

BIC criterion: two difficulties

Difficult 1: which BIC definition because of the double asymptotic on n and d? Difficult 2: the observed log-likelihood value is intractable ℓ(θ;

x) =

( z,w)∈Z×W

p(

x, z, w; θ)

Could be estimated by harmonic mean but time consuming and high variance

51/66

SLIDE 53

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

ICL criterion: overcome both difficulties

ICL uses complete likelihood thus no intractability ICL = ln p(x, ˆ z, ˆ w) = ln p(x|ˆ z, ˆ w) + ln p(ˆ z) + ln p(ˆ w) Multinomial case (r levels): [Keribin et al., 2014]

Derive an exact (non-asymptotic) ICL version Deduce an asymptotic approximation of ICL ICLbic = ℓc( ˆ θ; x, ˆ z, ˆ w) − K − 1 2 ln(n) − L − 1 2 ln(d) − KL(r − 1) 2 ln(nd)

We can make a conjecture for the general case ICLbic = ℓc(ˆ θ; x, ˆ z, ˆ w) − K − 1 2 ln(n) − L − 1 2 ln(d) − KLναkl 2 ln(nd)

52/66

SLIDE 54

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

ICL criterion: consistency

We can obtain a BIC expression from ICLbic BIC = ICLbic − ln p(ˆ z, ˆ w|x; ˆ θ) = ℓ(ˆ θ; x) difficult − K − 1 2 ln(n) − L − 1 2 ln(d) − KL(m − 1) 2 ln(nd)

[Brault et al., 2017] establish that asymptotically on n and d

“ℓ(ˆ θ; x) = ℓc(ˆ θ; x, ˆ

z, ˆ w)′′

Thus, since BIC is consistent, ICL is also consistent Again the HD clustering blessing is here!

53/66

SLIDE 55

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Strategy to smart browsing of (K, L)

[Robert, 2017] Algorithm Bi-KM1

54/66

SLIDE 56

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Illustration: discuss the dimension (1/2)

SPAM E-mail Database3 n = 4601 e-mails composed by 1813 “spams” and 2788 “good e-mails” d = 48 + 6 = 54 continuous descriptors4

48 percentages that a given word appears in an e-mail (“make”, “you’. . . ) 6 percentages that a given char appears in an e-mail (“;”, “$”. . . )

Transformation of continuous descriptors into binary descriptors xij = 1 if word/char j appears in e-mail i

therwise

3https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/ 4There are 3 other continuous descriptors we do not use 55/66

SLIDE 57

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Illustration: discuss the dimension (2/2)

Perform co-clustering with K = 2 and L = 5: ICL=-92,682, err=0.1984

Legend 1

Original Data Co−Clustered Data

Perform clustering5 with K = 2: ICL=-89,433, err=0.1837 Thus use preferably co-clustering in the HD setting

5Equivalent to co-clustering with L = 54 56/66

SLIDE 58

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Outline

1 HD clustering 2 Modeling 3 Estimating 4 Selecting 5 BlockCluster in MASSICCC 6 To go further

57/66

SLIDE 59

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

MASSICCC platform for the BLOCKCLUSTER software

https://massiccc.lille.inria.fr/

58/66

SLIDE 60

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

MASSICCC?

A high quality and easy to use web platform where are transfered mature research clustering (and more) software towards (non academic) professionals

59/66

SLIDE 61

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Here is the computer you need!

60/66

SLIDE 62

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Running BlockCluster

61/66

SLIDE 63

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Running BlockCluster

62/66

SLIDE 64

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Running BlockCluster

63/66

SLIDE 65

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Running BlockCluster

64/66

SLIDE 66

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Outline

1 HD clustering 2 Modeling 3 Estimating 4 Selecting 5 BlockCluster in MASSICCC 6 To go further

65/66

SLIDE 67

HD clustering Modeling Estimating Selecting BlockCluster in MASSICCC To go further

Co-clustering of mixed data

Same partitions in lines, disjoint partitions in columns Example: data set TED talks, with talks × (terms,scores)

66/66