Clustering Multivariate Binary Outcomes with Restricted Latent Class - - PowerPoint PPT Presentation

clustering multivariate binary outcomes with restricted
SMART_READER_LITE
LIVE PREVIEW

Clustering Multivariate Binary Outcomes with Restricted Latent Class - - PowerPoint PPT Presentation

Clustering Multivariate Binary Outcomes with Restricted Latent Class Models: A Bayesian Approach Zhenke Wu Assistant Professor of Biostatistics Schools of Public Health, University of Michigan, Ann Arbor Joint Statistical Meetings 2018


slide-1
SLIDE 1

Clustering Multivariate Binary Outcomes with Restricted Latent Class Models: A Bayesian Approach

Zhenke Wu

Assistant Professor of Biostatistics Schools of Public Health, University of Michigan, Ann Arbor Joint Statistical Meetings 2018 Vancouver August 2, 2018 (zhenkewu@umich.edu) zhenkewu.com R Package: rewind

https://github.com/zhenkewu/rewind

slide-2
SLIDE 2

Motivating Example

Aug 2, 2018 JSM2018 zhenkewu@umich.edu

20 40 60 80 100

Dimension (1:L) Subject (1:N)

50 40 30 20 10

(Yil): data

20 40 60 80 100

Dimension (1:L) Subject (1:N)

50 40 30 20 10

(Yil): design

10 20 30 40 50

Hierarchical Clustering (cut with true # clusters) Subject (1:N) Subject (1:N)

50 40 30 20 10 10 20 30 40 50

Standard LCA (true # clusters) Subject (1:N) Subject (1:N)

50 40 30 20 10 10 20 30 40 50

Subset clustering (Hoff, 2005) Subject (1:N) Subject (1:N)

50 40 30 20 10 10 20 30 40 50

Proposed Subject (1:N) Subject (1:N)

50 40 30 20 10

slide-3
SLIDE 3

Accurate clustering of multivariate binary data that 1) automatically selects feature subsets and 2) works well for unbalanced cluster sizes

Aug 2, 2018 JSM2018 zhenkewu@umich.edu

Take-away We achieve this goal via boolean matrix decomposition, or more generally, restricted latent class models

slide-4
SLIDE 4

Boolean Matrix Decomposition (noise-free version) (a special case of restricted latent class models)

Broad Applications:

  • Medicine:

clustering based on autoantibodies in autoimmune diseases

  • Disease epidemiology:

childhood pneumonia etiology estimation

  • Purchasing behavior: grocery shopping
  • Computer Science: text mining
  • Educational assessment:

cognitive classifications

  • Mobile health:

latent constructs, e.g., engagement with interventions, vulnerability and receptivity

Aug 2, 2018 JSM2018 zhenkewu@umich.edu

slide-5
SLIDE 5

Aug 2, 2018 JSM2018 zhenkewu@umich.edu

Statistical Formulation

slide-6
SLIDE 6

Model Setup: Quick Overview

  • Data: !" = $

"%, … , $ "( ) ∈ {0,1}(, / = 1, … , 0

  • Latent state vector: 1" ∈ 2 ⊂ {0,1}4
  • Latent dimension: M
  • Latent class: K distinct patterns of 1"
  • The number of clusters, K, unknown (no greater than

2M)

  • Q-matrix (M by L; binary): Q

Aug 2, 2018 JSM2018 zhenkewu@umich.edu

slide-7
SLIDE 7

Model Setup: Quick Overview

1) Given a latent state dimension M, specify likelihood ["# ∣ %#, '] via restricted latent class models (RLCM) ; with conditional independence For example, for dimension l:

  • Needs just one required state in ({m : Qml = 1}) for a positive ideal

response Γil = 1.

  • referred to as partially latent class model in epidemiology (Wu et al.,

2016); Deterministic In and Noise Or gate (DINO) in psychology (Junker and Sijtsma, 2001); non-negative matrix factorization if rows of Q are

  • rthogonal (Lee and Seung, 1999)

Aug 2, 2018 JSM2018 zhenkewu@umich.edu

slide-8
SLIDE 8

Model Setup: Quick Overview

In two steps, 1) Given a latent state dimension M, first specify the likelihood ["# ∣ %#, '] via restricted latent class models (RLCM) ; with conditional independence 2) A prior distribution [%#, ) = 1, … , -] obtained from a clustering mechanism with unknown # of clusters K (represented by cluster assignment indicators {/#, ) = 1, … , -} ); We use mixture of finite mixtures (Miller and Harrison, 2017 JASA)

Aug 2, 2018 JSM2018 zhenkewu@umich.edu

slide-9
SLIDE 9

Challenges: Boolean Matrix Decomposition (an example of restricted latent class models)

Aug 2, 2018 JSM2018 zhenkewu@umich.edu

  • C1. High-dimensional discrete space

Sparse priors that encourage:

  • 1. small # of latent state dimensions
  • 2. small # of distinct latent state patterns
  • C2. Unknown number of latent state dimensions

Infinite dimension model (based on semi-ordered formulation

  • f Indian Buffet Process); Identifiability issue
  • C3. Unknown number of clusters (i.e., # latent classes)

Mixture of finite mixture model

  • T1: Identifiability of model parameters based on likelihood
  • nly

Open and frontier problem; exciting progress at Michigan

C: computational T: theoretical

slide-10
SLIDE 10

Comparison of variants of latent class analysis

  • f multivariate binary data

Aug 2, 2018 JSM2018 zhenkewu@umich.edu

slide-11
SLIDE 11

Aug 2, 2018 JSM2018 zhenkewu@umich.edu

Data

slide-12
SLIDE 12
  • 76 autoantibody patterns from patients with rheumatic disease & cancer
  • all were negative for autoantibodies against prominent defined specificities

Can an algorithm be developed to identify common autoantibody signatures? And estimate clusters among patients?

Aug 2, 2018 JSM2018 zhenkewu@umich.edu

slide-13
SLIDE 13

Raw Intensity Scan Data (20 lanes on a single gel)

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2

Lane 1

0.0 0.2 0.4 0.6 0.8 1.0

Lane 2

0.0 0.2 0.4 0.6 0.8 1.0

Lane 3

0.0 0.2 0.4 0.6 0.8 1.0

Lane 4

0.0 0.2 0.4 0.6 0.8 1.0

Lane 5

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2

Lane 6

0.0 0.2 0.4 0.6 0.8 1.0

Lane 7

0.0 0.2 0.4 0.6 0.8 1.0

Lane 8

0.0 0.2 0.4 0.6 0.8 1.0

Lane 9

0.0 0.2 0.4 0.6 0.8 1.0

Lane 10

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2

Lane 11

0.0 0.2 0.4 0.6 0.8 1.0

Lane 12

0.0 0.2 0.4 0.6 0.8 1.0

Lane 13

0.0 0.2 0.4 0.6 0.8 1.0

Lane 14

0.0 0.2 0.4 0.6 0.8 1.0

Lane 15

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2

Lane 16

0.0 0.2 0.4 0.6 0.8 1.0

Lane 17

0.0 0.2 0.4 0.6 0.8 1.0

Lane 18

0.0 0.2 0.4 0.6 0.8 1.0

Lane 19

0.0 0.2 0.4 0.6 0.8 1.0

Lane 20 Location on Gel (tj

(g))

Intensity

Aug 2, 2018 JSM2018 zhenkewu@umich.edu

slide-14
SLIDE 14

Scientific Questions

  • How many clusters? What are the clusters?

[the clustering problem]

  • How many machines are there and what are the

component auto-antigens? [estimation of latent state dimensions]

  • What makes the clusters different in terms of presence
  • r absence of machines?

[interpretability of the clusters]

Aug 2, 2018 JSM2018 zhenkewu@umich.edu

slide-15
SLIDE 15

0.0 0.2 0.4 0.6 0.8 Location on Gel (tj

(g))

Lane

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

Preprocessing Step I-a: Automated Peak Detection

Example: Gel Set 1 Aug 2, 2018 JSM2018 zhenkewu@umich.edu

slide-16
SLIDE 16

Align the peaks (Wu et al., 2017)

Aug 2, 2018 JSM2018 zhenkewu@umich.edu R package: spotgear

slide-17
SLIDE 17

Aug 2, 2018 JSM2018 zhenkewu@umich.edu

Posterior Computation

slide-18
SLIDE 18

Posterior Computation

  • Designed and implemented MCMC algorithms that deal with
  • a) unknown number of clusters (mixture of finite mixture

models; split-merge), and

  • b) unknown number of machines (slice sampler for infinite

Indian Buffet Process). Also works for pre-specified number of machines.

Aug 2, 2018 JSM2018 zhenkewu@umich.edu R package: rewind

slide-19
SLIDE 19

Aug 2, 2018 JSM2018 zhenkewu@umich.edu

Simulation

slide-20
SLIDE 20

Simulation Setup

20 40 60 80 100

Dimension (1:L) Subject (1:N)

50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

(Gil): Design Matrix

Latent State (1:M) Subject (1:N)

50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

(him): Latent State

1 2 3 20 40 60 80 100

Dimension (1:L) Latent State (1:M)

(Qml): True Q (ordered)

1 2 3

1

Aug 2, 2018 JSM2018 zhenkewu@umich.edu

slide-21
SLIDE 21

Recovery of the matrix Q (low noise)

Aug 2, 2018 JSM2018 zhenkewu@umich.edu

slide-22
SLIDE 22

Recovery of the matrix Q (intermediate noise)

Aug 2, 2018 JSM2018 zhenkewu@umich.edu

slide-23
SLIDE 23

Recovery of the matrix Q (high noise)

Aug 2, 2018 JSM2018 zhenkewu@umich.edu

slide-24
SLIDE 24

Data: CTP negative sera Method: Bayesian machine- based restricted latent class analysis Figure: Three estimated clusters (top three panels) with distinct enrichment of three distinct estimated machines (bottom panel) Colored labels: red, blue, green - for clusters obtained by standard method; this algorithm is agnostic to them.

Preliminary clustering results based on machine models

Aug 2, 2018 JSM2018 zhenkewu@umich.edu

slide-25
SLIDE 25

Main Points Once Again

  • Goal: Based on multivariate binary data, find scientifically structured,

interpretable clusters

  • Proposed a framework for clustering using restricted latent class models

SRF

  • Designed and implemented MCMC algorithms that deal with unknown

number of clusters and machines; Bayesian binary factorization algorithm

  • Superior clustering performance compared to standard analyses; Improved

estimations under unbalanced cluster sizes.

Aug 2, 2018 JSM2018 zhenkewu@umich.edu

slide-26
SLIDE 26

References

Wu Z, Casciola-Rosen L, Shah A, Rosen A, Zeger SL. Estimating autoantibody signatures to detect autoimmune disease patient subsets. Submitted for publication. Biostatistics. In Press. doi: 10.1093/biostatistics/kxx061 Wu Z, Zeger SL. Clustering Multivariate Binary Outcomes with Restricted Latent Class Models: A Bayesian Approach. Working paper. Wu Z, Deloria-Knoll M, Hammitt LL, Zeger SL for the PERCH Core Team. Partially-latent class models (pLCM) for case-control studies of childhood pneumonia etiology. Journal of the Royal Statistical Society, Series C. 65: 97-114, 2016. Wu Z, Deloria-Knoll M, Zeger SL. Nested, partially-latent class models for dependent binary data with application to estimating disease etiology. Biostatistics 18 (2), 200-213. 2016

Aug 2, 2018 JSM2018 zhenkewu@umich.edu

slide-27
SLIDE 27

Open Source Software

  • spotgear: Subset Profiling and Organizing Tools for Gel

Electrophoresis Autoradiography in R

  • rewind: Reconstructing Etiology with Binary

Decomposition Available from https://github.com/zhenkewu

Aug 2, 2018 JSM2018 zhenkewu@umich.edu

slide-28
SLIDE 28

Thank you