Concurrence Topology: A Tool for Describing High-Order Statistical - - PowerPoint PPT Presentation

concurrence topology a tool for describing high order
SMART_READER_LITE
LIVE PREVIEW

Concurrence Topology: A Tool for Describing High-Order Statistical - - PowerPoint PPT Presentation

Concurrence Topology: A Tool for Describing High-Order Statistical Dependence in Data Steven P. Ellis (mostly joint work with Arno Klein) 6/9/14 Abstract Data analytic methods possessing the following three features are desirable: (1)


slide-1
SLIDE 1

“Concurrence Topology:” A Tool for Describing High-Order Statistical Dependence in Data

Steven P. Ellis (mostly joint work with Arno Klein) 6/9/’14

slide-2
SLIDE 2

Abstract

Data analytic methods possessing the following three features are desirable: (1) The method describes ”high-order dependence” among variables. (2) It does so with few preconceptions. And (3) it can handle at least dozens, maybe hundreds of variables. However, if approached in a naive fashion, data analysis having these three features triggers a ”combinatorial explosion”: The output from the analysis can include thousands, maybe millions of numbers. Few methods exist possessing all three features yet which avoid the combinatorial explosion. Ellis has devised a data analytic method he calls ”Concurrence Topology (CT)” which does so.

slide-3
SLIDE 3

Abstract, continued

CT takes an apparently radically new approach to solving solving this problem. It starts by translating data into a ”filtration”, a series of ”shapes”. The shapes in the series are called ”frames”. A filtration is like a building. The frames are like floors of the building. But while the floors of a building are two-dimensional, the frames of a filtration can have dimension much higher than two. A filtration can have holes that are like elevator shafts in a building. Such holes indicate relatively weak or negative association among the variables. CT uses computational algebraic topology to describe the pattern

  • f holes. Normally, there are no more than a few dozen

holes, so CT avoids the combinatorial explosion. Often

  • ne can identify small groups of variables that are closely

associated with a given hole. This process facilitates interpretation of the hole.

slide-4
SLIDE 4

Abstract, continued

A limitation of CT is that, so far, it only works with binary data. But quantitative data can always be binarized. Ellis wrote software in R (available upon request) implementing CT. A paper, written by Arno Klein and Ellis, introducing CT and demonstrating it on fMRI data has been accepted by a topology journal.

slide-5
SLIDE 5

◮ Free R code exists that implements the procedures described

in this talk.

◮ Reference: S.P. Ellis, A. Klein (2014) “Describing high-order

statistical dependence using ‘concurrence topology’, with application to functional MRI brain data,” Homology, Homotopy, and Applications, 16, 245–264.

slide-6
SLIDE 6

CONCERNED WITH DATA ANALYSIS CHARACTERIZED BY THREE FEATURES

slide-7
SLIDE 7

INGREDIENT 1: HIGH-ORDER DEPENDENCE

◮ A statistic that can be computed from a multivariate sample

by looking at only k variables at a time, but which cannot be

  • btained by looking at fewer than k variables at a time

reflects “kth order dependence” among the variables.

◮ “High-order dependence” means dependence of order at least

3.

slide-8
SLIDE 8

Examples:

◮ The list of means of 10 variables reflects “first order

dependence”.

◮ A correlation matrix of 10 variables reflects second order

dependence.

◮ A simple network model reflects second order dependence. ◮ Factor analysis reflects second order dependence.

slide-9
SLIDE 9

Regression

◮ Least squares estimates of the coefficients in a regression

model Y = β0 + β1X1 + β2X2 + β3X3 +β4X4 + β5X5 + error reflects second order dependence.

◮ The least squares estimate of the interaction β12 in the

regression model Y = β0 + β1X1 + β2X2 +β12X1 : X2 + error reflects third order dependence.

◮ Interactions in regression models can be important. ◮ This suggests that looking at dependence of order higher than

2 might be useful in general.

slide-10
SLIDE 10

Three data sets identical (statistically) up to 2nd order, but not at third order.

I II III x y z x y z x y z 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

slide-11
SLIDE 11

SUMMARIZING

◮ Typically, there are very many ways variables can be

dependent.

◮ Any summary of dependence cannot capture every sort of

dependence.

◮ Correlation.

slide-12
SLIDE 12

INGREDIENT 2: “AGNOSTIC” STATISTICS

◮ Typically, formulating a regression model involves choices.

◮ Which variable should be the response (dependent) variable? ◮ Which variables should be the predictors (independent

variables)?

◮ Which variables should be included in interactions?

◮ Ditto for path analysis ◮ If you have prior knowledge to guide you, then regression

modeling (or path analysis) is a powerful way to learn from data.

◮ A more data-driven approach is “agnostic analysis”:

◮ Treating all variables the same a priori.

◮ Example: Factor analysis is a second order agnostic analysis

method.

slide-13
SLIDE 13

Group variables

◮ I’m mostly interested in “unsupervised” methods. ◮ But if there is a variable that specifies classes or groups that

the data come from, then one might not want to treat it like any old variable.

◮ Output of unsupervised methods can be used as part of input

to supervised methods.

◮ Give examples later.

slide-14
SLIDE 14

INGREDIENT 3: “LARGE” NUMBER OF VARIABLES

◮ In this talk “large number” means “dozens”, maybe a

hundred or so.

slide-15
SLIDE 15

“COMBINATORIAL EXPLOSION”

◮ The three features constitute an “explosive mixture”. ◮ Prima facie, agnostically describing kth-order dependence in a

data set means examining all combinations of k variables at a time.

◮ If there are many variables and k > 2, the number of

combinations can be huge.

◮ Sometimes the collection of all these combinations can be

regarded as a “haystack” in which we’re searching for “needles”.

slide-16
SLIDE 16

“COMBINATORIAL EXPLOSION:” EXAMPLE

Analysis of seventh-order dependence among the regions of the brain “default mode network” in an fMRI data set.

◮ 32 variables. ◮ Naive agnostic seventh order analysis of 32 variables means

looking at 32

7

  • = 3, 365, 856 combinations of 7 variables.

◮ E.g., ≥ 3, 365, 856 terms in a “log linear model”.

◮ Data contained only 6,144 numbers.

slide-17
SLIDE 17

“COMBINATORIAL EXPLOSION,” continued

◮ Computing and interpreting many many combinations is very

challenging.

◮ With many combinations, looking at individual combinations

  • f k variables is not helpful:

◮ Difficult to interpret a torrent of numbers. ◮ When there are many groups of k variables, behavior of

individual groups is unlikely to be reproducible.

◮ (Multiple comparisons)

slide-18
SLIDE 18

SOME METHODS THAT AGNOSTICALLY CAPTURE HIGH ORDER DEPENDENCE IN MANY VARIABLES

slide-19
SLIDE 19

“Unsupervised” methods:

◮ There seem to be few established unsupervised methods that

capture high order dependence.

◮ Independent Components Analysis ◮ Tensor based methods:

◮ “Parallel factor analysis” ◮ “Tucker 3” ◮ Only go up to third order dependence?

slide-20
SLIDE 20

“Supervised” methods

◮ Many machine learning classification methods tap into high

  • rder dependence.
slide-21
SLIDE 21

Experimental methods.

slide-22
SLIDE 22

CONCURRENCE TOPOLOGY (CT)

◮ Apparently new “unsupervised” method for high-order

agnostic analysis of dependence among dozens (hundreds?) of variables.

◮ CT is radically different from methods mentioned above. ◮ Since there are few methods there’s no need to choose among

them: “Use all of them.”

◮ So comparing methods to see which is best is not urgent.

slide-23
SLIDE 23

◮ The germ of the idea for CT came from a theoretical

neuroscience talk I heard by the mathematician Carina Curto.

slide-24
SLIDE 24

CONCURRENCE TOPOLOGY (CT), continued

◮ CT is often able to extract a moderate number of high order

statistics from a combinatorial explosion.

◮ CT detects certain forms of negative or weak association

among the variables.

slide-25
SLIDE 25

TOPOLOGY

slide-26
SLIDE 26

TOPOL

slide-27
SLIDE 27

TOPOLOGY, continued

◮ Topology is study of qualitative aspects of shapes. ◮ Quantitative aspects of shapes such as length, angle, area,

volume, curvature are only loosely connected to topology.

◮ Famously, topology can’t tell the difference between a donut

and a coffee cup.

◮ Topology does pay attenton to holes in shapes (like hole in

donut or in the handle of coffee cup).

◮ Topology ignores details.

◮ That’s good: There’s a combinatorial explosion of details. We

have to ignore practically all of them.

◮ That’s bad: Sometimes the details are important. ◮ “Needles” are details. ◮ But often we can recover details from a CT analysis.

slide-28
SLIDE 28

ANALOGY FOR CT

◮ Consider this hypothetical histogram.

2 4 6 8 10 12 14 A B C

Persistence in a histogram

slide-29
SLIDE 29

ANALOGY, continued

◮ Y axis is “count” or “frequency”.

◮ It’s discrete: 1, 2, 3, . . . .

◮ Cut the histogram at various heights.

◮ Suffices to do it at whole number heights ◮ “Frequency levels” ◮ Dark line segments show intersections of horizontal lines with

the histogram.

◮ As horizontal line moves downward, sometimes a gap appears

in the intersection.

◮ A gap is “born”.

◮ At a lower level the gap might be filled in.

◮ The gap “dies”.

◮ Difference in 2 heights is the “lifespan” of the gap.

slide-30
SLIDE 30

PERSISTENCE

◮ The phenomenon of birth and death of gaps is “persistence”. ◮ Can plot it (“persistence plot”)

2 4 6 8 10 12 2 4 6 8 10 12 death A B C

Persistence plot in dimension 0 for histogram

slide-31
SLIDE 31

CONCURRENCE TOPOLOGY RUDIMENTS

◮ In CT a group of dichotomous, i.e. “0-1”, variables is

represented as a series of shapes (“filtration”).

◮ A filtration can be thought of as a “building”. ◮ Various shapes in the filtration are like “floors” in the

building.

◮ But the floors of a building are 2-D, while the shapes in a

filtration can be high-D.

◮ The building is made out of “bricks”(simplices), one brick per

“observation”.

◮ time point in fMRI data ◮ subject in psychological scale data

◮ Each observation (subject or time point) contributes one

“brick”.

◮ A “brick” represents a “concurrence”: The variables that are

“1” in the observation.

slide-32
SLIDE 32

50 100 150 10 20 30 40 50 60 70 concurrence ROI

Concurrence plot for sub84371

slide-33
SLIDE 33

HOLES

◮ Holes in the filtration indicate relative weakness or negativity

in joint distribution of variables.

◮ Holes in filtration are like “stairwells” or “atriums” in the

building.

◮ They can span several “floors”.

◮ The number of “floors” spanned by a hole is the “lifespan” or

“persistence” of the hole.

◮ In concurrence topology one learns about data by studying the

pattern of holes in the data’s filtration.

slide-34
SLIDE 34

CANNOT ANALYZE HOLES BY VISUAL INSPECTION

◮ Dimension of filtration is usually too high. ◮ Use computational topology to study pattern of holes.

◮ Branch of topology that studies holes in shapes is “homology

theory”.

◮ Technical word for “hole” is “homology class”.

slide-35
SLIDE 35

DIMENSION OF HOLES

◮ Homology classes come in different “dimensions” : 0, 1, 2, . . . ◮ Examples:

◮ Gap between Earth and Moon. ◮ Bagel ◮ Basketball ◮ Higher dimensional holes. ◮ Later we’ll look at homology in fMRI data in dimensions 0

through 5.

slide-36
SLIDE 36

“PERSISTENT HOMOLOGY”

◮ Finding “stairwells” and their spans means computing the

“persistent homology” of the filtration (“building”).

◮ Have done this for up to 74 variables (fMRI data). ◮ May be possible for a few hundred, but not thousands, of

variables.

◮ A hole of dimension d has to do with statistical dependence of

  • rder at least d + 2.
slide-37
SLIDE 37

SYNTHETIC EXAMPLE

◮ Test pattern for software.

1 2 3 4 5 6 7

  • freq. lev. 1

1 2 3 4 5 6 7

  • freq. lev. 2

1 2 3 4 5 6 7

  • freq. lev. 3

1 2 3 4 5 6 7

  • freq. lev. 4

1 2 3 4 5 6 7

  • freq. lev. 5

1 2 3 4 5 6

  • freq. lev. 6

1 2 3 4 5 6

  • freq. lev. 7

4

  • freq. lev. 8

Test filtration ’yh52’

slide-38
SLIDE 38

PERSISTENT PLOT FOR SYNTHETIC EXAMPLE

1 2 3 4 5 6 1 2 3 4 5 6 birth death

Persistence plot for the filtration ’yh52’ in dimension 1

slide-39
SLIDE 39

CT GOES BEYOND NETWORK OR GRAPHICAL MODELS

◮ A simple network or graphical model connects pairs of nodes

by lines.

◮ A solid triangle connects three nodes. ◮ In real data examples, filtration might include “triangles”

(“simplices”) connecting 60 or more nodes.

◮ Not useful to try to interpret CT as a generalized network

method.

slide-40
SLIDE 40

SECRET (?) OF CT’S SUCCESS

◮ CT finds interesting structure in data without getting

  • verwhelmed by a combinatorial explosion. How is it able to

do this?

◮ A homology class (hole) is a global phenomenon that requires

the “cooperation” of all the variables.

◮ This makes holes in the filtration very rare. ◮ “Very rare” compared, not to the “population” of all data

sets, but rare compared to the size of the combinatorial explosion.

◮ So data sets with holes appear to be rather common. ◮ But the number of holes one gets is manageable: Maybe a

dozen or so.

slide-41
SLIDE 41

EXAMPLE: DMN IN DIMENSION 4

◮ fMRI data, default mode network (32 regions). ◮ Arno Klein and I looked at homology in dimension 4

(corresponds to 6th- or higher-order dependence).

◮ Combinatorial explosion: There are

906,192 ways to choose 6 regions out of 32.

◮ Median number of holes in dimension 4 is 2, max is 18. ◮ This represents a tremendous reduction in size compared to

size of the combinatorial explosion.

◮ (Turns out that presence of homology in dimension 4 – 6th-

and higher-order dependence – discriminates an ADHD group from controls.)

slide-42
SLIDE 42

fMRI

◮ “fMRI” stands for “functional magnetic resonance imaging”.

◮ It images the functioning of the brain of a living person. ◮ Contrasted to a “structural MRI” which images (at higher

resolution) the anatomy of the brain.

◮ Active areas of the brain require more oxygen than do inactive

  • nes.

◮ This generates a “Blood-Oxygen-Level Dependence” (BOLD)

signal that an MR machine can detect.

◮ An fMRI image of the brain can be taken about once every 2

seconds.

◮ Spatial resolution is about 3 × 3 × 5 mm3. ◮ Presumption is that high BOLD values in a brain region

indicate that the region is active.

◮ So activity of different parts of the brain over time can be

recorded.

slide-43
SLIDE 43

“This exciting technology has revolutionized the scientific study of the mind.”

slide-44
SLIDE 44

“FUNCTIONAL CONNECTIVITY”

◮ Means coordination of activity in different brain regions. ◮ Abnormal functional connectivity is believed to be important

in Attention Deficit Hyperactivity Disorder (ADHD).

◮ True functional connectivity is probably reflected in observed

joint variation in BOLD among various brain regions.

◮ Ergo one can learn about functional connectivity by studying

the statistical dependence of BOLD among brain regions.

◮ Interplay of activity in two regions is like a telephone call. ◮ Mightn’t the brain make use of “conference calls” involving

more than two regions?

◮ Makes sense to study joint variation of groups of, not just 2,

but 3, 4, etc. regions in fMRI data.

slide-45
SLIDE 45

DATA SET

◮ Publicly available fMRI data of NYU provenance. (Arno found

it.)

◮ Resting state. ◮ 25 ADHD subjects. ◮ 41 healthy controls. ◮ Once processed by Arno, data included for every subject BOLD

values in 92 brain regions at 192 time points.

slide-46
SLIDE 46

WE APPLIED CT TO EACH SUBJECT’S fMRI DATA SEPARATELY.

◮ Looked at homology up to dimension 5.

◮ Pertains to dependence (connectivity) of orders of seven or

more.

◮ Like fitting a LS regression model with one or more sixth-order

interactions.

slide-47
SLIDE 47

“TIME DOMAIN”

◮ Dichotomize BOLD in each region separately.

◮ At 80th percentile.

◮ Take dichotomized BOLD values in all regions in each time

point to be an “observation”.

◮ Time dependence is ignored in doing this.

slide-48
SLIDE 48

FOURIER DOMAIN

◮ Brings temporal dependence back into the picture.

time BOLD

Excerpt from 'Left-Pallidum'

time BOLD

freq = 0.63

time BOLD

freq = 1.26

time BOLD

freq = 1.88

time BOLD

freq = 2.51

time BOLD

freq = 3.14

slide-49
SLIDE 49

“PERIODOGRAM”

PERIODOGRAM

slide-50
SLIDE 50

THIS OPERATION CAPTURES THE TEMPORAL STRUCTURE OF THE BOLD TIME SERIES.

slide-51
SLIDE 51

SIMILAR PERIODOGRAMS SUGGEST FUNCTIONAL CONNECTIVITY

◮ Suppose the BOLD activity curves of regions A and B are the

same.

◮ EXCEPT they are shifted relative to each other in time.

◮ E.g., at all time points t, BOLD of region B at time t is the

same as that of A at time t − 2.

◮ Strong relationship would not be apparent in a time domain

analysis.

◮ In time domain we only look at simultaneous activity.

◮ But the periodograms of A and B would be exactly the same!

slide-52
SLIDE 52

IN FOURIER DOMAIN DEFINE CONCURRENCES WITHIN EACH FREQUENCY.

◮ In each region classify each frequency as “active” or not

depending on whether periodogram for that region exceeds or not a given threshold at that frequency.

◮ Take dichotomized periodograms at the same Fourier

frequency to be an “observation”.

◮ Define concurrence in Fourier domain just as in time domain,

but instead of time points, use frequencies.

◮ So in Fourier domain, A and B will be in exactly the same

concurrences, reflecting their tight association.

slide-53
SLIDE 53

AVERAGE PERSISTENCE PLOTS

slide-54
SLIDE 54

Whole Brain, Time Domain

10 20 30 10 20 30 birth death 0.1 0.1 0.3 0.5 1

adhd in dim 0

5 10 15 5 10 15 birth death 0.1 0.2 0.6 1.3 2.5 4.6

adhd in dim 1

2 4 6 8 10 12 14 2 4 6 8 10 12 14 birth death 0.2 0.6 1.5 3.2 6.3 11.4

adhd in dim 2

10 20 30 10 20 30 birth death

control in dim 0

5 10 15 5 10 15 birth death

control in dim 1

2 4 6 8 10 12 14 2 4 6 8 10 12 14 birth death

control in dim 2

slide-55
SLIDE 55

Whole Brain, Fourier Domain

2 4 6 8 10 2 4 6 8 10 birth death 0.1 0.1 0.2 0.4 0.6 0.8

adhd in dim 0

1 2 3 4 5 6 1 2 3 4 5 6 birth death 0.1 0.3 0.6 1.1 1.7 2.5 3.6

adhd in dim 1

1 2 3 4 1 2 3 4 birth death 0.1 0.1 0.3 0.5 0.8 1.2 1.7

adhd in dim 2

2 4 6 8 10 2 4 6 8 10 birth death

control in dim 0

1 2 3 4 5 6 1 2 3 4 5 6 birth death

control in dim 1

1 2 3 4 1 2 3 4 birth death

control in dim 2

slide-56
SLIDE 56

Default Mode Network, Time Domain

10 20 30 10 20 30 birth death

0.04 0.12 0.28 0.55 0.95

adhd in dim 0

5 10 20 5 10 15 20 birth death

0.03 0.11 0.25 0.5 0.86

adhd in dim 1

5 10 15 5 10 15 birth death

0.16 0.54 1.28 2.49 4.31

adhd in dim 2

4 8 12 2 4 6 8 10 12 birth death

0.09 0.29 0.69 1.34 2.32

adhd in dim 3

4 8 12 2 4 6 8 10 12 birth death

0.08 0.28 0.66 1.29 2.22

adhd in dim 4

2 4 6 8 2 4 6 8 10 birth death

0.03 0.11 0.26 0.51 0.88

adhd in dim 5

10 20 30 10 20 30 birth death

control in dim 0

5 10 20 5 10 15 20 birth death

control in dim 1

5 10 15 5 10 15 birth death

control in dim 2

4 8 12 2 4 6 8 10 12 birth death

control in dim 3

4 8 12 2 4 6 8 10 12 birth death

control in dim 4

2 4 6 8 2 4 6 8 10 birth death

control in dim 5

slide-57
SLIDE 57

Default Mode Network, Fourier Domain

2 4 6 8 2 4 6 8 10 birth death

0.04 0.12 0.28 0.55 0.95

adhd in dim 0

2 4 6 1 2 3 4 5 6 birth death

0.03 0.11 0.25 0.5 0.86

adhd in dim 1

1 2 3 4 5 1 2 3 4 5 birth death

0.16 0.54 1.28 2.49 4.31

adhd in dim 2

1 2 3 4 1 2 3 4 birth death

0.09 0.29 0.69 1.34 2.32

adhd in dim 3

0.5 0.5 1.5 2.5 0.5 0.0 0.5 1.0 1.5 2.0 2.5 birth death

0.08 0.28 0.66 1.29 2.22

adhd in dim 4

0.5 0.5 1.0 1.5 0.5 0.0 0.5 1.0 1.5 birth death

0.03 0.11 0.26 0.51 0.88

adhd in dim 5

2 4 6 8 2 4 6 8 10 birth death

control in dim 0

2 4 6 1 2 3 4 5 6 birth death

control in dim 1

1 2 3 4 5 1 2 3 4 5 birth death

control in dim 2

1 2 3 4 1 2 3 4 birth death

control in dim 3

0.5 0.5 1.5 2.5 0.5 0.0 0.5 1.0 1.5 2.0 2.5 birth death

control in dim 4

0.5 0.5 1.0 1.5 0.5 0.0 0.5 1.0 1.5 birth death

control in dim 5

slide-58
SLIDE 58

SOME FINDINGS BASED ON PERSISTENCE PLOTS

◮ Differences between groups in whole brain, Fourier domain,

dimensions 1 and 2 (especially dimension 1).

◮ Differences between groups in DMN, time domain, dimensions

4 and 5 (especially dimension 4).

◮ 64.0% of ADHD subjects had any homology in time domain in

the DMN in dimension 4 compared to 92.6% of controls.

slide-59
SLIDE 59

“LOCALIZATION”

◮ Homology classes (“holes”) involve all the variables. ◮ Holes can often be “localized” by identifying groups of

variables (“short cycles”) most closely associated with them.

◮ Short cycles can be examined to see if they’re interesting, i.e.,

if they’re “needles” in a combinatorial “haystack”.

slide-60
SLIDE 60

1 2 3 4 5 6 7

  • freq. lev. 1

1 2 3 4 5 6 7

  • freq. lev. 2

1 2 3 4 5 6 7

  • freq. lev. 3

1 2 3 4 5 6 7

  • freq. lev. 4

1 2 3 4 5 6 7

  • freq. lev. 5

1 2 3 4 5 6

  • freq. lev. 6

1 2 3 4 5 6

  • freq. lev. 7

4

  • freq. lev. 8

Test filtration ’yh52’ Thu Oct 4 11:01:46 2012

slide-61
SLIDE 61

CAVEAT

◮ I make no claim that with CT one can find all the “needles”

  • r even the most important ones.

◮ I only suggest that CT might find some of them, in itself an

important contribution.

◮ CT apparently provides a view of high order dependence

unavailable using any other method.

◮ Vice versa.

slide-62
SLIDE 62

LOCALIZATION: EXAMPLE

◮ In fMRI data we found interesting short cycles in dimension 1,

time domain, DMN.

◮ 3rd-order. ◮ “Short cycle” consists of a triple of regions. ◮ Each subject has a few hundred short cycles.

◮ Combinatorial explosion: 9,880 different possible triplets of

regions.

slide-63
SLIDE 63

PERSISTENCE PLOT FOR ONE SUBJECT IN DIMENSION 1, TIME DOMAIN, DMN

5 10 15 5 10 15 birth death

*

slide-64
SLIDE 64

SPECIAL SHORT CYCLES

◮ One short cycle in dimension 1 is found in 13 subjects.

ctx-lh-parsorbitalis + ctx-lh-rostralanteriorcingulate + ctx-rh-medialorbitofrontal

◮ This is statistically significant. ◮ Null hypothesis: All short cycles are equally likely to appear in

a subject’s data.

◮ 16 short cycles associated with the same class distinguish

ADHD from controls.

slide-65
SLIDE 65

PRODUCTS

◮ A persistence plot treats every persistent class (hole) in

isolation, whether between or within dimensions.

◮ In topology there are “products” that provide possible ways

that holes can be combined to produce other holes.

◮ If α and β are holes of dimensions p and q, resp., then what I

call the “join” of the two holes, if it exists, is a hole of dimension p + q + 1.

◮ Joining gives a relationship among up to three dimensions.

◮ The join of two holes may or may not be present in a shape. ◮ Presence or absence of joins might be another part of the

homological signature of a particular phenomenon, like disease group.

slide-66
SLIDE 66

INDEPENDENCE

◮ Surprisingly, joining is connected with statistical

independence, a fundamental notion in data analysis.

◮ Suppose one has nonoverlapping groups of variables in the left

hand and right hand.

◮ Suppose there’s negative association among the left hand

variables that shows up as a hole α

◮ Similarly, the right hand variables produce a hole β. ◮ So there’s dependence within each hand, but suppose that the

two groups of variables are independent of each other so there’s independence between the two groups of variables.

◮ Then over many observations of these variables, the join of α

and β will emerge!

◮ This fact generalizes to any number of groups of variables.

slide-67
SLIDE 67

SIMULATION RESULTS

◮ Two assumptions:

  • 1. Each group produces its own homology.
  • 2. The two groups are independent of each other.

◮ In preliminary simulation I found that if both assumptions

hold then one frequently observes joining.

◮ But if either assumption fails then one hardly ever observes

joining.

◮ I have preliminary findings of joining in real data.

slide-68
SLIDE 68

SOFTWARE

◮ I wrote R code that implements CT. ◮ It’s free and documented. ◮ I didn’t set out to write such software. ◮ I began tinkering and eventually tinkered my way up to

free-standing CT code.

◮ Good exercise for understanding computational homology. ◮ Advantage is that, since I have the source code and

understand it, it’s not too hard for me to implement nonstandard functionality like short cycles and join detection.

◮ Disadvantage: I believe other people have written faster

programs for computing persistent homology.

◮ (But there are lots of ways of speeding up my code!)

slide-69
SLIDE 69

COMPUTING HOMOLOGY CAN BE VERY INTENSIVE!

◮ Combinatorial explosion is a problem for the computation. ◮ Fortunately, topology ignores details. ◮ One tries to replace the filtration with a simpler one that

theory shows has the same persistence plots.