Reasoning about Sets using Redescription Mining Mohammed J. Zaki - - PowerPoint PPT Presentation

reasoning about sets using redescription mining
SMART_READER_LITE
LIVE PREVIEW

Reasoning about Sets using Redescription Mining Mohammed J. Zaki - - PowerPoint PPT Presentation

Reasoning about Sets using Redescription Mining Mohammed J. Zaki Naren Ramakrishnan zaki@cs.rpi.edu naren@cs.vt.edu What are redescriptions? A shift-of-vocabulary, or a different way of communicating a given piece of information. Input to


slide-1
SLIDE 1

Reasoning about Sets using Redescription Mining

Mohammed J. Zaki Naren Ramakrishnan

zaki@cs.rpi.edu naren@cs.vt.edu

slide-2
SLIDE 2

What are redescriptions?

A shift-of-vocabulary, or a different way of communicating a given piece of information.

slide-3
SLIDE 3

Input to Redescription Mining

France Brazil Chile UK USA Russia Canada Argentina Cuba China

B R G Y

slide-4
SLIDE 4

Input to Redescription Mining (contd.)

France Brazil Chile UK USA Russia Canada Argentina Cuba China

B Y R G

slide-5
SLIDE 5

Input to Redescription Mining (contd.)

France Brazil Chile UK USA Russia Canada Argentina Cuba China

R B Y G

slide-6
SLIDE 6

Input to Redescription Mining (contd.)

France Brazil Chile UK USA Russia Canada Argentina Cuba China

G B R Y

slide-7
SLIDE 7

Input to Redescription Mining (contd.)

France Brazil Chile UK USA Russia Canada Argentina Cuba China

Y B R G

slide-8
SLIDE 8

Input to Redescription Mining (contd.)

France Brazil Chile UK USA Russia Canada Argentina Cuba China

G R B Y

slide-9
SLIDE 9

Basic Problem

Given

  • a set O of objects (e.g., countries)
  • a collection of subsets (descriptors) of O

Find

  • subsets of O that can be defined in at least two ways
slide-10
SLIDE 10

A Redescription

Canada Russia China USA Argentina Canada Brazil Chile USA China Cuba Russia France China Russia USA UK

=

EXCEPT AND

‘Countries with land area > 3,000,000 sq. miles’ − ‘Tourist Destinations in the Americas’ ⇔ ‘Permanent members of U.N. Security Council’ ∩ ‘Countries with history of communism’

slide-11
SLIDE 11

Redescription is sort of like ...

association rule mining

  • generalize from implications to equivalences

conceptual clustering

  • find clusters with dual characterizations

constructive induction

  • build features that mutually reinforce each other
slide-12
SLIDE 12

Applications in Bioinformatics

(Gene) subsets galore!

  • Genes localized in the mitochondrion
  • Genes up-expressed two-fold or more in heat stress
  • Genes encoding for proteins forming the immunoglobin complex
  • Genes involved in glucose biosynthesis
  • Genes handpicked by Prof. Genie for further study
  • Genes clustered together by your favorite algorithm
  • · · ·
slide-13
SLIDE 13

How do redescriptions happen?

France Brazil Chile UK USA Russia Canada Argentina Cuba China

G R B Y

BY BY BY BY RG RG RG RG

slide-14
SLIDE 14

How do redescriptions happen?

France Brazil Chile UK USA Russia Canada Argentina Cuba China

G R B Y

BY BY BY BY RG RG RG RG

slide-15
SLIDE 15

A game on Karnaugh maps

BY BY BY BY RG RG RG RG BY BY BY BY RG RG RG RG

slide-16
SLIDE 16

A game on Karnaugh maps

BY BY BY BY RG RG RG RG BY BY BY BY RG RG RG RG

slide-17
SLIDE 17

A game on Karnaugh maps

BY BY BY BY RG RG RG RG BY BY BY BY RG RG RG RG

slide-18
SLIDE 18

A game on Karnaugh maps

BY BY BY BY RG RG RG RG BY BY BY BY RG RG RG RG

slide-19
SLIDE 19

A game on Karnaugh maps

BY BY BY BY RG RG RG RG BY BY BY BY RG RG RG RG

slide-20
SLIDE 20

Reading off a redescription

BY BY BY BY RG RG RG RG BY BY BY BY RG RG RG RG

slide-21
SLIDE 21

Reading off a redescription

BY BY BY BY RG RG RG RG BY BY BY BY RG RG RG RG

(BY RG ∨ BY RG ∨ BY RG ∨ BY RG)

slide-22
SLIDE 22

Reading off a redescription

BY BY BY BY RG RG RG RG BY BY BY BY RG RG RG RG

(BY RG ∨ BY RG ∨ BY RG ∨ BY RG) ⇔ (BY RG ∨ BY RG ∨ BY RG ∨ BY RG)

slide-23
SLIDE 23

Reading off a redescription

BY BY BY BY RG RG RG RG BY BY BY BY RG RG RG RG

(BY ) ⇔ (RG)

slide-24
SLIDE 24

Redescriptions help reason about sets

France Brazil Chile UK USA Russia Canada Argentina Cuba China

G R B Y

Q: How can B be made equal to R? Ans: Subtract Y from B; intersect G with R, yielding BY ⇔ RG.

slide-25
SLIDE 25

Some Definitions

Given a collection of objects O and descriptors D:

  • A redescription X ⇐

⇒ Y (X, Y ⊆ D) holds when – X ∩ Y = ∅ and – X and Y induce the same set of objects.

slide-26
SLIDE 26

Some Definitions

Given a collection of objects O and descriptors D:

  • A redescription X ⇐

⇒ Y (X, Y ⊆ D) holds when – X ∩ Y = ∅ and – X and Y induce the same set of objects.

  • A conditional redescription X ⇐

⇒ Y |Z (Z ⊆ D) holds when – X ∩ Y = X ∩ Z = Y ∩ Z = ∅ and – X ∩ Z and Y ∩ Z induce the same set of objects.

slide-27
SLIDE 27

Some Definitions

Given a collection of objects O and descriptors D:

  • A redescription X ⇐

⇒ Y (X, Y ⊆ D) holds when – X ∩ Y = ∅ and – X and Y induce the same set of objects.

  • A conditional redescription X ⇐

⇒ Y |Z (Z ⊆ D) holds when – X ∩ Y = X ∩ Z = Y ∩ Z = ∅ and – X ∩ Z and Y ∩ Z induce the same set of objects.

  • A redescription X ⇐

⇒ Y is a non-redundant redescription iff there does not exist another redescription X′ ⇐ ⇒ Y ′ for the same set of

  • bjects, such that X′ ⊆ X and Y ′ ⊆ Y
slide-28
SLIDE 28

Connections to Association Rule Mining

BY BY BY BY RG RG RG RG

Objects = Transactions Descriptors = Items

slide-29
SLIDE 29

Connections to Association Rule Mining

BY BY BY BY RG RG RG RG

Objects = Transactions Descriptors = Items Colored cell = closed itemset (e.g., BY RG)

slide-30
SLIDE 30

Connections to Association Rule Mining

BY BY BY BY RG RG RG RG

Objects = Transactions Descriptors = Items Reducible cluster of colored cells = closed itemset (e.g., BY R)

slide-31
SLIDE 31

Connections to Association Rule Mining

BY BY BY BY RG RG RG RG

Objects = Transactions Descriptors = Items Reducible cluster of mixed cells = non-closed itemset (e.g., BY )

slide-32
SLIDE 32

Adapting association mining algorithms

Mining redescriptions reduces to:

  • mining closed itemsets (descriptor sets)
  • obtain submatrices reducible to these closed sets (generators)

Object Descriptors

  • 1

d1d2d4d5d6

  • 2

d2d3d5d7

  • 3

d1d2d4d5d6

  • 4

d1d2d3d5d6d7

  • 5

d1d2d3d4d5d6d7

  • 6

d2d3d4

slide-33
SLIDE 33

Lattice of Closed Sets

dset: d1 d2 d3 d4 d5 d6 d7 mingen: d1 d3 d4, d3 d4 d5, d3 d4 d6, d4 d7 dset: d2 d3 d4 mingen: d3 d4 dset: d1 d2 d3 d5 d6 d7 mingen: d1 d3, d1 d7, d3 d6, d6 d7 dset: d1 d2 d4 d5 d6 mingen: d1 d4, d4 d5, d4 d6 dset: d2 d3 d5 d7 mingen: d3 d5, d7 dset: d1 d2 d5 d6 mingen: d1, d6 dset: d2 d3 mingen: d3 dset: d2 d5 mingen: d5 dset: d2 d4 mingen: d4 dset: d2 mingen: d2

  • bjset: o5
  • bjset: o5 o6
  • bjset: o4 o5
  • bjset: o1 o3 o5
  • bjset: o1 o3 o4 o5
  • bjset: o2 o4 o5
  • bjset: o2 o4 o5 o6
  • bjset: o1 o2 o3 o4 o5
  • bjset: o1 o3 o5 o6
  • bjset: o1 o2 o3 o4 o5 o6
slide-34
SLIDE 34

Lattice of Closed Sets

dset: d1 d2 d3 d4 d5 d6 d7 mingen: d1 d3 d4, d3 d4 d5, d3 d4 d6, d4 d7 dset: d2 d3 d4 mingen: d3 d4 dset: d1 d2 d3 d5 d6 d7 mingen: d1 d3, d1 d7, d3 d6, d6 d7 dset: d1 d2 d4 d5 d6 mingen: d1 d4, d4 d5, d4 d6 dset: d2 d3 d5 d7 mingen: d3 d5, d7 dset: d1 d2 d5 d6 mingen: d1, d6 dset: d2 d3 mingen: d3 dset: d2 d5 mingen: d5 dset: d2 d4 mingen: d4 dset: d2 mingen: d2

  • bjset: o5
  • bjset: o5 o6
  • bjset: o4 o5
  • bjset: o1 o3 o5
  • bjset: o1 o3 o4 o5
  • bjset: o2 o4 o5
  • bjset: o2 o4 o5 o6
  • bjset: o1 o2 o3 o4 o5
  • bjset: o1 o3 o5 o6
  • bjset: o1 o2 o3 o4 o5 o6

d1 => d5; d6 => d5

slide-35
SLIDE 35

Lattice of Closed Sets

dset: d1 d2 d3 d4 d5 d6 d7 mingen: d1 d3 d4, d3 d4 d5, d3 d4 d6, d4 d7 dset: d2 d3 d4 mingen: d3 d4 dset: d1 d2 d3 d5 d6 d7 mingen: d1 d3, d1 d7, d3 d6, d6 d7 dset: d1 d2 d4 d5 d6 mingen: d1 d4, d4 d5, d4 d6 dset: d2 d3 d5 d7 mingen: d3 d5, d7 dset: d1 d2 d5 d6 mingen: d1, d6 dset: d2 d3 mingen: d3 dset: d2 d5 mingen: d5 dset: d2 d4 mingen: d4 dset: d2 mingen: d2

  • bjset: o5
  • bjset: o5 o6
  • bjset: o4 o5
  • bjset: o1 o3 o5
  • bjset: o1 o3 o4 o5
  • bjset: o2 o4 o5
  • bjset: o2 o4 o5 o6
  • bjset: o1 o2 o3 o4 o5
  • bjset: o1 o3 o5 o6
  • bjset: o1 o2 o3 o4 o5 o6
slide-36
SLIDE 36

Up Closed and Personal

d6 d1 d1 d2 d5 d2 d5 d6 d1 d5 d6 d1 d2 d6 d1 d2 d5 d6 d1 d2 d1 d5 d1 d6 d2 d6 d5 d6

slide-37
SLIDE 37

Up Closed and Personal

d6 d1 d1 d2 d5 d2 d5 d6 d1 d5 d6 d1 d2 d6 d1 d2 d5 d6 d1 d2 d1 d5 d1 d6 d2 d6 d5 d6 d1 <=> d6

slide-38
SLIDE 38

Finding Minimal Generators

dset: d1 d2 d3 d4 d5 d6 d7 mingen: d1 d3 d4, d3 d4 d5, d3 d4 d6, d4 d7 dset: d2 d3 d4 mingen: d3 d4 dset: d1 d2 d3 d5 d6 d7 mingen: d1 d3, d1 d7, d3 d6, d6 d7 dset: d1 d2 d4 d5 d6 mingen: d1 d4, d4 d5, d4 d6 dset: d2 d3 d5 d7 mingen: d3 d5, d7 dset: d1 d2 d5 d6 mingen: d1, d6 dset: d2 d3 mingen: d3 dset: d2 d5 mingen: d5 dset: d2 d4 mingen: d4 dset: d2 mingen: d2

  • bjset: o5
  • bjset: o5 o6
  • bjset: o4 o5
  • bjset: o1 o3 o5
  • bjset: o1 o3 o4 o5
  • bjset: o2 o4 o5
  • bjset: o2 o4 o5 o6
  • bjset: o1 o2 o3 o4 o5
  • bjset: o1 o3 o5 o6
  • bjset: o1 o2 o3 o4 o5 o6
slide-39
SLIDE 39

Finding Minimal Generators

dset: d1 d2 d3 d4 d5 d6 d7 mingen: d1 d3 d4, d3 d4 d5, d3 d4 d6, d4 d7 dset: d2 d3 d4 mingen: d3 d4 dset: d1 d2 d3 d5 d6 d7 mingen: d1 d3, d1 d7, d3 d6, d6 d7 dset: d1 d2 d4 d5 d6 mingen: d1 d4, d4 d5, d4 d6 dset: d2 d3 d5 d7 mingen: d3 d5, d7 dset: d1 d2 d5 d6 mingen: d1, d6 dset: d2 d3 mingen: d3 dset: d2 d5 mingen: d5 dset: d2 d4 mingen: d4 dset: d2 mingen: d2

  • bjset: o5
  • bjset: o5 o6
  • bjset: o4 o5
  • bjset: o1 o3 o5
  • bjset: o1 o3 o4 o5
  • bjset: o2 o4 o5
  • bjset: o2 o4 o5 o6
  • bjset: o1 o2 o3 o4 o5
  • bjset: o1 o3 o5 o6
  • bjset: o1 o2 o3 o4 o5 o6
slide-40
SLIDE 40

Finding Minimal Generators

dset: d1 d2 d3 d4 d5 d6 d7 mingen: d1 d3 d4, d3 d4 d5, d3 d4 d6, d4 d7 dset: d2 d3 d4 mingen: d3 d4 dset: d1 d2 d3 d5 d6 d7 mingen: d1 d3, d1 d7, d3 d6, d6 d7 dset: d1 d2 d4 d5 d6 mingen: d1 d4, d4 d5, d4 d6 dset: d2 d3 d5 d7 mingen: d3 d5, d7 dset: d1 d2 d5 d6 mingen: d1, d6 dset: d2 d3 mingen: d3 dset: d2 d5 mingen: d5 dset: d2 d4 mingen: d4 dset: d2 mingen: d2

  • bjset: o5
  • bjset: o5 o6
  • bjset: o4 o5
  • bjset: o1 o3 o5
  • bjset: o1 o3 o4 o5
  • bjset: o2 o4 o5
  • bjset: o2 o4 o5 o6
  • bjset: o1 o2 o3 o4 o5
  • bjset: o1 o3 o5 o6
  • bjset: o1 o2 o3 o4 o5 o6

Diff = {d1,d6}

slide-41
SLIDE 41

Not so fast...

Datasets are 50% dense!

  • cannot rely on pruning to help handle large datasets

Solution approach

  • CHARM-L: Mining with constraints

– Only expand lattice around objects/descriptors of interest

slide-42
SLIDE 42

Exploring Gene Sets in Bioinformatics

Vocabularies

  • GO functional categories (BIO, CEL, and MOL)
  • Expression range buckets in specific microarray experiments
  • Gene clusters
slide-43
SLIDE 43

Interactive Exploration w/ CHARM-L

What is the relationship between ...

  • d183 (ORFs ≥ 5 expressed in 15 minutes of heat shock)
  • d184 (ORFs ≥ 5 expressed in 20 minutes of heat shock)

Answer:

  • d183 − d388 − d460 − d515 ⇔ d184 − d309

– d388: (GO MOL mannose transporter) – d460: (GO CEL external protective structure) – d515: (GO BIO fructose metabolism) – d309: (GO MOL molecular function unknown)

slide-44
SLIDE 44

Another example

What is the relationship between ...

  • d141 (ORFs ≥ 2 expressed in 10 minutes of heat shock)
  • d184 (ORFs ≥ 5 expressed in 20 minutes of heat shock)

Answer:

  • d141 − d515 − d608 ⇔ d184|d183

– d515: (GO BIO fructose metabolism) – d608: (ORFS ≥ 4 expressed in histone depletion) – d183: (ORFs ≥ 5 expressed in 15 minutes of heat shock)

slide-45
SLIDE 45

Performance Results

0.001 0.01 0.1 1 10 100 0.5 0.4 0.3 0.25 0.2 Time (s) Minimum Support (%) G1 Total Lattice Mingen Rules 10 15 20 25 30 35 40 0.5 0.4 0.3 0.25 0.2 Dset Length Minimum Support (%) G1 1e-05 0.0001 0.001 0.01 0.1 1 10 100 1000 0.2 0.15 0.1 0.05 Time (s) Minimum Support (%) G3 Total Lattice Mingen Rules 10 15 20 25 30 35 40 45 50 0.2 0.15 0.1 0.05 Dset Length Minimum Support (%) G3

slide-46
SLIDE 46

Recap

Redescriptions help reason about set collections

  • Conjunctive forms handle set intersections and negations
  • Empowers biologist to create and work with vocabularies

Algorithmic innovations

  • Lattice mining, finding minimal generators, constraint propagation
  • Established connections to boolean formula manipulation
slide-47
SLIDE 47

Future Work

Story telling

  • Find a sequence of redescriptions connecting disjoint sets X and Y

Schema matching

  • X ⊆ O1, Y ⊆ O2, O1 and O2 are related by relation R

Generalized boolean expressions

  • Mine redescriptions in more expressive forms
slide-48
SLIDE 48

Acknowledgements

Collaborators

  • Deept Kumar (Virginia Tech)
  • Laxmi Parida (IBM TJ Watson)

Funding

  • NSF CAREER IIS-0092978, DOE Career DE-FG02-02ER25538, NSF

grants EIA-0103708 and EMT-0432098 (Zaki)

  • NSF grants IBN-0219332 and EIA-0103660 (Ramakrishnan)
slide-49
SLIDE 49

Questions? For related work, see:

  • N. Ramakrishnan et al., Turning CARTwheels: An Alternating Algorithm for

Mining Redescriptions, in Proceedings of KDD’04, pages 266–275, 2004.

  • L. Parida and N. Ramakrishnan, Redescription Mining: Structure Theory and

Algorithms, in Proceedings of AAAI’05, pages 837-844, July 2005.

Contact:

Naren Ramakrishnan Department of Computer Science Virginia Tech, Blacksburg, VA 24061 naren@cs.vt.edu http://www.cs.vt.edu/˜naren