[PPT] - Foundations of Causal Discovery Frederick Eberhardt KDD Causality PowerPoint Presentation

SLIDE 1

Foundations of Causal Discovery

Frederick Eberhardt

KDD Causality Workshop 2016

SLIDE 2

Causal Discovery

2

data sample

x y z w

samples

SLIDE 3

Causal Discovery

2

data sample inference algorithm

➟

assumptions, e.g.

causal Markov
causal faithfulness
functional form
etc.

➟

x y z w

samples

SLIDE 4

Causal Discovery

2

data sample

➟

x y

z

w x y

z

w

equivalence classes inference algorithm

➟

assumptions, e.g.

causal Markov
causal faithfulness
functional form
etc.

➟

x y z w

samples

SLIDE 5

Causal Discovery

2

data sample

➟

x y

z

w x y

z

w

equivalence classes 0 0 ? a 0 0 0 0 0 0 0 0 b ? ? 0

x y z w x y

z

w

0 0 ? ? 0 0 0 ? ? 0 0 0 ? ? 0 0

x y z w x y

z

w

direct edges confounders

model specifications

➟

inference algorithm

➟

assumptions, e.g.

causal Markov
causal faithfulness
functional form
etc.

➟

x y z w

samples

SLIDE 6

Causal Discovery

2

data sample

➟

x y

z

w x y

z

w

equivalence classes 0 0 ? a 0 0 0 0 0 0 0 0 b ? ? 0

x y z w x y

z

w

0 0 ? ? 0 0 0 ? ? 0 0 0 ? ? 0 0

x y z w x y

z

w

direct edges confounders

model specifications

➟

inference algorithm

➟

assumptions, e.g.

causal Markov
causal faithfulness
functional form
etc.

➟

x y z w

samples

x y

z

w

truth (unknown)

➟

SLIDE 7

Causal Discovery

2

data sample

➟

x y

z

w x y

z

w

equivalence classes 0 0 ? a 0 0 0 0 0 0 0 0 b ? ? 0

x y z w x y

z

w

0 0 ? ? 0 0 0 ? ? 0 0 0 ? ? 0 0

x y z w x y

z

w

direct edges confounders

model specifications

➟

inference algorithm

➟

assumptions, e.g.

causal Markov
causal faithfulness
functional form
etc.

➟

x y z w

samples

x y

z

w

truth (unknown)

➟

SLIDE 8

Constraint-based Causal Discovery

3

x y

z

w

truth (unknown)

x y

z

w x y

z

w

equivalence classes

SLIDE 9

Constraint-based Causal Discovery

3

x y

z

w

truth (unknown) data sample

x y z w

samples

?

x y

z

w x y

z

w

equivalence classes

SLIDE 10

Constraint-based Causal Discovery

3

x y

z

w

truth (unknown) data sample

x y z w

samples

?

x y

z

w x y

z

w

equivalence classes constraints statistical inference

x ⊥ ⊥ y | {z, w}

probabilistic independence

SLIDE 11

Constraint-based Causal Discovery

3

x y

z

w

truth (unknown) data sample

x y z w

samples

?

x y

z

w x y

z

w

equivalence classes constraints statistical inference

x ⊥ ⊥ y | {z, w}

probabilistic independence

x ⊥ y | {z, w}

conditions

Markov
faithfulness

graphical connection

SLIDE 12

Constraint-based Causal Discovery

3

x y

z

w

truth (unknown) data sample

x y z w

samples

?

d-separation

x y

z

w x y

z

w

equivalence classes constraints statistical inference

x ⊥ ⊥ y | {z, w}

probabilistic independence

x ⊥ y | {z, w}

conditions

Markov
faithfulness

graphical connection

SLIDE 13

Constraint-based Causal Discovery

3

x y

z

w

truth (unknown) data sample

x y z w

samples

?

d-separation

x

y z w x y z w

equivalence classes constraints statistical inference

x ⊥ ⊥ y | {z, w}

probabilistic independence

x ⊥ y | {z, w}

conditions

Markov
faithfulness

graphical connection

SLIDE 14

Constraint-based Causal Discovery

3

x y

z

w

truth (unknown) data sample

x y z w

samples

?

d-separation distribution

P(W, X, Y, Z)

x

y z w x y z w

equivalence classes constraints statistical inference

x ⊥ ⊥ y | {z, w}

probabilistic independence

x ⊥ y | {z, w}

conditions

Markov
faithfulness

graphical connection

SLIDE 15

Causal Markov

4

x is independent of its non-descendents given its parents in the causal graph x

y z

w

v

u

SLIDE 16

Causal Markov

4

x is independent of its non-descendents given its parents in the causal graph x

y z

w

v

u Violations of Causal Markov

quantum mechanics
[unmeasured common causes]
[mixtures of populations]
[variables are not distinct, or too coarsely grained]

SLIDE 17

Causal Faithfulness

5

If x is independent of y given C in the probability distribution then x is d-separated from y given C in the graph.

SLIDE 18

Causal Faithfulness

5

If x is independent of y given C in the probability distribution then x is d-separated from y given C in the graph. Violations of Causal Faithfulness

canceling pathways
matching pennies cases
[small sample sizes and near violations
f faithfulness]

x

y z

a

b −ab

x y z

x-or

SLIDE 19

Assumptions

causal Markov: permits inference from probabilistic

dependence to causal connection

6

SLIDE 20

Assumptions

causal Markov: permits inference from probabilistic

dependence to causal connection

causal faithfulness: permits inference from probabilistic

independence to causal separation

6

x y

z

SLIDE 21

Assumptions

causal Markov: permits inference from probabilistic

dependence to causal connection

causal faithfulness: permits inference from probabilistic

independence to causal separation

causal sufficiency: there are no unmeasured common causes

6

x y

z

x y

z l2 l1

SLIDE 22

Assumptions

causal Markov: permits inference from probabilistic

dependence to causal connection

causal faithfulness: permits inference from probabilistic

independence to causal separation

causal sufficiency: there are no unmeasured common causes
acylicity: no variable is an (indirect) cause of itself

6

x y

z

x y

z l2 l1

x y

z

SLIDE 23

Assumptions

causal Markov: permits inference from probabilistic

dependence to causal connection

causal faithfulness: permits inference from probabilistic

independence to causal separation

causal sufficiency: there are no unmeasured common causes
acylicity: no variable is an (indirect) cause of itself

7

All graphs in an equivalence class have:

same adjacencies (“skeleton”)
same unshielded colliders

[Verma & Pearl 1990, Frydenberg 1990]

SLIDE 24

Assumptions

causal Markov: permits inference from probabilistic

dependence to causal connection

causal faithfulness: permits inference from probabilistic

independence to causal separation

causal sufficiency: there are no unmeasured common causes
acylicity: no variable is an (indirect) cause of itself

7

x

y

z

unshielded collider

All graphs in an equivalence class have:

same adjacencies (“skeleton”)
same unshielded colliders

[Verma & Pearl 1990, Frydenberg 1990]

SLIDE 25

Markov
faithfulness
acyclicity
causal sufficiency

assumptions equivalence class

same adjacencies (“skeleton”)
same unshielded colliders

SLIDE 26

x y

z

x y

z

x

y z

x

y z x y

z

x y

z

x y

z

x y

z

x

y

z

x

y

z

x

y

z x

y

z

x

y z x y z x y z

x

y

z

x y

z x y z

x y

z

x

y z x

y z

x

y

z

x y

z

x y

z x y z

Markov
faithfulness
acyclicity
causal sufficiency

assumptions equivalence class

same adjacencies (“skeleton”)
same unshielded colliders

SLIDE 27

x y

z

x y

z

x

y z

x

y z x y

z

x y

z

x y

z

x y

z

x

y

z

x

y

z

x

y

z x

y

z

x

y z x y z x y z

x

y

z

x y

z x y z

x y

z

x

y z x

y z

x

y

z

x y

z

x y

z x y z

Markov
faithfulness
acyclicity
causal sufficiency

assumptions equivalence class

same adjacencies (“skeleton”)
same unshielded colliders

SLIDE 28

x y

z

x

y z x y

z

x y

z

x y

z

x y

z

x

y

z x

y

z

x

y z x y

z

x

y z x y

z

x

y

z

x

y

z

x y

z

x y

z x y

z

x y z

x y

z

x

y z x

y z

x

y

z

x y

z

x y

z x y z

SLIDE 29

x y

z

x

y z x y

z

x y

z

x y

z

x y

z

x

y

z x

y

z

x

y z x y

z

x

y z x y

z

x

y

z

x

y

z

x y

z

x y

z x y

z

x y z

x y

z

x

y z x

y z

x

y

z

x y

z

x y

z x y z

x ⊥ ⊥ y

SLIDE 30

x y

z

x

y z x y

z

x y

z

x y

z

x y

z

x

y

z x

y

z

x

y z x y

z

x

y z x y

z

x

y

z

x

y

z

x y

z

x y

z x y

z

x y z

x y

z

x

y z x

y z

x

y

z

x y

z

x y

z x y z

x ⊥ ⊥ y

SLIDE 31

x y

z

x

y z x y

z

x y

z

x y

z

x y

z

x

y

z x

y

z

x

y z x y

z

x

y z x y

z

x

y

z

x

y

z

x y

z

x y

z x y

z

x y z

x y

z

x

y z x

y z

x

y

z

x y

z

x y

z x y z

x ⊥ ⊥ y

x6? ?z

SLIDE 32

x y

z

x

y z x y

z

x y

z

x y

z

x y

z

x

y

z x

y

z

x

y z x y

z

x

y z x y

z

x

y

z

x

y

z

x y

z

x y

z x y

z

x y z

x y

z

x

y z x

y z

x

y

z

x y

z

x y

z x y z

x ⊥ ⊥ y

x6? ?z

SLIDE 33

x y

z

x

y z x y

z

x y

z

x y

z

x y

z

x

y

z x

y

z

x

y z x y

z

x

y z x y

z

x

y

z

x

y

z

x y

z

x y

z x y

z

x y z

x y

z

x

y z x

y z

x

y

z

x y

z

x y

z x y z

x ⊥ ⊥ y

x6? ?z y6? ?z

SLIDE 34

x y

z

x

y z x y

z

x y

z

x y

z

x y

z

x

y

z x

y

z

x

y z x y

z

x

y z x y

z

x

y

z

x

y

z

x y

z

x y

z x y

z

x y z

x y

z

x

y z x

y z

x

y

z

x y

z

x y

z x y z

x ⊥ ⊥ y

x6? ?z y6? ?z

SLIDE 35

x y

z

x

y z x y

z

x y

z

x y

z

x y

z

x

y

z x

y

z

x

y z x y

z

x

y z x y

z

x

y

z

x

y

z

x y

z

x y

z x y

z

x y z

x y

z

x

y z x

y z

x

y

z

x y

z

x y

z x y z

x ⊥ ⊥ y

x6? ?z y6? ?z

}

sufficient to determine the equivalence class, in this case, a unique causal graph

SLIDE 36

x y

z

x

y z x y

z

x y

z

x y

z

x y

z

x

y

z x

y

z

x

y z x y

z

x

y z x y

z

x

y

z

x

y

z

x y

z

x y

z x y

z

x y z

x y

z

x

y z x

y z

x

y

z

x y

z

x y

z x y z

x ⊥ ⊥ y

x6? ?z y6? ?z

}

sufficient to determine the equivalence class, in this case, a unique causal graph

For linear Gaussian and for multinomial causal relations, an algorithm that identifies the Markov equivalence class of the true model is complete.

(Pearl & Geiger 1988, Meek 1995)

SLIDE 37

Staying in business

Weaken the assumptions (and increase the equivalence class)
allow for unmeasured common causes
allow for cycles
weaken faithfulness

10

SLIDE 38

Staying in business

Weaken the assumptions (and increase the equivalence class)
allow for unmeasured common causes
allow for cycles
weaken faithfulness
Exclude the limitations (and reduce the equivalence class)
restrict to non-Gaussian error distributions
restrict to non-linear causal relations
restrict to specific discrete parameterizations

10

SLIDE 39

Staying in business

Weaken the assumptions (and increase the equivalence class)
allow for unmeasured common causes
allow for cycles
weaken faithfulness
Exclude the limitations (and reduce the equivalence class)
restrict to non-Gaussian error distributions
restrict to non-linear causal relations
restrict to specific discrete parameterizations
Include more general data collection set-ups (and see how

assumptions can be adjusted and what equivalence class results)

experimental evidence
multiple (overlapping) data sets
relational data

10

SLIDE 40

Staying in business

Weaken the assumptions (and increase the equivalence class)
allow for unmeasured common causes
allow for cycles
weaken faithfulness
Exclude the limitations (and reduce the equivalence class)
restrict to non-Gaussian error distributions
restrict to non-linear causal relations
restrict to specific discrete parameterizations
Include more general data collection set-ups (and see how

assumptions can be adjusted and what equivalence class results)

experimental evidence
multiple (overlapping) data sets
relational data

10

Zhalama talk Tank talk

SLIDE 41

Limitations

11

For linear Gaussian and for multinomial causal relations, an algorithm that identifies the Markov equivalence class of the true model is complete.

(Pearl & Geiger 1988, Meek 1995)

SLIDE 42

Linear non-Gaussian method (LiNGaM)

Linear causal relations:
Assumptions:
causal Markov
causal sufficiency
acyclicity

12

xi = X

xj∈Pa(xi)

ijxj + ✏j

[Shimizu et al., 2006]

SLIDE 43

Linear non-Gaussian method (LiNGaM)

Linear causal relations:
Assumptions:
causal Markov
causal sufficiency
acyclicity
If non-Gaussian, then the true graph is uniquely identifiable

from the joint distribution.

12

xi = X

xj∈Pa(xi)

ijxj + ✏j

✏j ∼

[Shimizu et al., 2006]

SLIDE 44

Two variable case

13

x y

✏y y = x + ✏y ✏x

True model

SLIDE 45

Two variable case

13

x y

✏y x ⊥ ⊥ ✏y y = x + ✏y ✏x

True model

SLIDE 46

Two variable case

13

x y

✏y x ⊥ ⊥ ✏y y = x + ✏y ✏x

True model

x y

Backwards model

x = ✓y + ˜ ✏x ˜ ✏x ˜ ✏y

SLIDE 47

Two variable case

13

x y

✏y x ⊥ ⊥ ✏y y = x + ✏y ✏x

True model

x y

Backwards model

x = ✓y + ˜ ✏x ˜ ✏x ˜ ✏y y ⊥ ⊥ ˜ ✏x

SLIDE 48

Two variable case

13

x y

✏y x ⊥ ⊥ ✏y y = x + ✏y ✏x

True model

x y

Backwards model

x = ✓y + ˜ ✏x ˜ ✏x ˜ ✏y y ⊥ ⊥ ˜ ✏x

˜ ✏x = x − ✓y = x − ✓(x + ✏y) = (1 − ✓)x − ✓✏y

SLIDE 49

Two variable case

13

x y

✏y x ⊥ ⊥ ✏y y = x + ✏y ✏x

True model

x y

Backwards model

x = ✓y + ˜ ✏x ˜ ✏x ˜ ✏y y ⊥ ⊥ ˜ ✏x

˜ ✏x = x − ✓y = x − ✓(x + ✏y) = (1 − ✓)x − ✓✏y

SLIDE 50

Why Normals are unusual

14

y = x + ✏y

Forwards model For backwards model

✏y

x y

✏x

?

˜ ✏x = (1 − ✓)x − ✓✏y

SLIDE 51

Why Normals are unusual

14

y = x + ✏y

Theorem 1 (Darmois-Skitovich) Let X1, . . . , Xn be independent, non-degenerate random variables. If for two linear combinations l1 = a1X1 + . . . + anXn, ai 6= 0 l2 = b1X1 + . . . + bnXn, bi 6= 0 are independent, then each Xi is normally distributed. Forwards model For backwards model

✏y

x y

✏x

?

˜ ✏x = (1 − ✓)x − ✓✏y

SLIDE 52

15

algorithm/ assumption Markov faithfulness causal sufficiency acyclicity parametric assumption

utput

SLIDE 53

15

algorithm/ assumption Markov faithfulness causal sufficiency acyclicity parametric assumption

utput

PC / GES ✓ ✓ ✓ ✓ ✗ Markov equivalence FCI ✓ ✓ ✗ ✓ ✗ PAG CCD ✓ ✓ ✓ ✗ ✗ PAG

SLIDE 54

15

algorithm/ assumption Markov faithfulness causal sufficiency acyclicity parametric assumption

utput

PC / GES ✓ ✓ ✓ ✓ ✗ Markov equivalence FCI ✓ ✓ ✗ ✓ ✗ PAG CCD ✓ ✓ ✓ ✗ ✗ PAG LiNGaM ✓ ✗ ✓ ✓ linear non- Gaussian unique DAG lvLiNGaM ✓ ✓ ✗ ✓ linear non- Gaussian set of DAGs cyclic LiNGaM ✓ ~ ✓ ✗ linear non- Gaussian set of graphs

SLIDE 55

Limitations

16

For linear Gaussian and for multinomial causal relations, an algorithm that identifies the Markov equivalence class of the true model is complete.

(Pearl & Geiger 1988, Meek 1995)

SLIDE 56

Limitations

17

For linear Gaussian and for multinomial causal relations, an algorithm that identifies the Markov equivalence class of the true model is complete.

(Pearl & Geiger 1988, Meek 1995)

SLIDE 57

Bivariate Linear Gaussian case

18

5

5

5

5

5

5
3

3

a b c

p(y | x) p(x | y)

x y y x

y = x + ✏y x = ✏x

✏x, ✏y ∼ indep. Gaussian

True model

(graphics from Hoyer et al. 2009)

Forwards (true) model Backwards model

SLIDE 58

Bivariate Linear Gaussian case

18

5

5

5

5

5

5
3

3

a b c

p(y | x) p(x | y)

x y y x

y = x + ✏y x = ✏x

✏x, ✏y ∼ indep. Gaussian

True model

(graphics from Hoyer et al. 2009)

Forwards (true) model Backwards model

SLIDE 59

Continuous additive noise models

19

xj = fj(pa(xj)) + ✏j

SLIDE 60

Continuous additive noise models

If is linear, then non-Gaussian errors are required for

identifiability

19

xj = fj(pa(xj)) + ✏j fj(.)

SLIDE 61

Continuous additive noise models

If is linear, then non-Gaussian errors are required for

identifiability

➡ What if the errors are Gaussian, but is non-linear?

19

xj = fj(pa(xj)) + ✏j fj(.) fj(.)

SLIDE 62

Continuous additive noise models

If is linear, then non-Gaussian errors are required for

identifiability

➡ What if the errors are Gaussian, but is non-linear? ➡ More generally, under what circumstances is the causal

structure represented by this class of models identifiable?

19

xj = fj(pa(xj)) + ✏j fj(.) fj(.)

SLIDE 63

Bivariate non-linear Gaussian additive noise model

20

5

5

5

5

5

5
3

3

d e f

p(y | x) p(x | y)

y x y x

y = x + x3 + ✏y x = ✏x

✏x, ✏y ∼ indep. Gaussian

x = g(y) + ˜ ✏x

y \ ⊥ ⊥˜ ✏x

True model

(graphics from Hoyer et al. 2009)

Forwards (true) model Backwards model

SLIDE 64

Bivariate non-linear Gaussian additive noise model

20

5

5

5

5

5

5
3

3

d e f

p(y | x) p(x | y)

y x y x

y = x + x3 + ✏y x = ✏x

✏x, ✏y ∼ indep. Gaussian

x = g(y) + ˜ ✏x

y \ ⊥ ⊥˜ ✏x

True model

(graphics from Hoyer et al. 2009)

Forwards (true) model Backwards model

SLIDE 65

Bivariate non-linear Gaussian additive noise model

20

5

5

5

5

5

5
3

3

d e f

p(y | x) p(x | y)

y x y x

y = x + x3 + ✏y x = ✏x

✏x, ✏y ∼ indep. Gaussian

x = g(y) + ˜ ✏x

y \ ⊥ ⊥˜ ✏x

True model

(graphics from Hoyer et al. 2009)

Forwards (true) model Backwards model

SLIDE 66

Bivariate non-linear Gaussian additive noise model

20

5

5

5

5

5

5
3

3

d e f

p(y | x) p(x | y)

y x y x

y = x + x3 + ✏y x = ✏x

✏x, ✏y ∼ indep. Gaussian

x = g(y) + ˜ ✏x

y \ ⊥ ⊥˜ ✏x

True model

(graphics from Hoyer et al. 2009)

Forwards (true) model Backwards model

SLIDE 67

General non-linear additive noise models

Hoyer Condition (HC): Technical condition on the relation between the function, the noise distribution and the parent distribution that, if satisfied, permits a backward model.

21

SLIDE 68

General non-linear additive noise models

Hoyer Condition (HC): Technical condition on the relation between the function, the noise distribution and the parent distribution that, if satisfied, permits a backward model.

If the error terms are Gaussian, then the only functional form that

satisfies HC is linearity, otherwise the model is identifiable.

21

SLIDE 69

General non-linear additive noise models

Hoyer Condition (HC): Technical condition on the relation between the function, the noise distribution and the parent distribution that, if satisfied, permits a backward model.

If the error terms are Gaussian, then the only functional form that

satisfies HC is linearity, otherwise the model is identifiable.

If the errors are non-Gaussian, then there are (rather contrived)

functions that satisfy HC, but in general identifiability is guaranteed.

21

SLIDE 70

General non-linear additive noise models

Hoyer Condition (HC): Technical condition on the relation between the function, the noise distribution and the parent distribution that, if satisfied, permits a backward model.

If the error terms are Gaussian, then the only functional form that

satisfies HC is linearity, otherwise the model is identifiable.

If the errors are non-Gaussian, then there are (rather contrived)

functions that satisfy HC, but in general identifiability is guaranteed.

this generalizes to multiple variables (assuming minimality*)!

21

SLIDE 71

General non-linear additive noise models

Hoyer Condition (HC): Technical condition on the relation between the function, the noise distribution and the parent distribution that, if satisfied, permits a backward model.

If the error terms are Gaussian, then the only functional form that

satisfies HC is linearity, otherwise the model is identifiable.

If the errors are non-Gaussian, then there are (rather contrived)

functions that satisfy HC, but in general identifiability is guaranteed.

this generalizes to multiple variables (assuming minimality*)!
extension to discrete additive noise models

21

SLIDE 72

General non-linear additive noise models

Hoyer Condition (HC): Technical condition on the relation between the function, the noise distribution and the parent distribution that, if satisfied, permits a backward model.

If the error terms are Gaussian, then the only functional form that

satisfies HC is linearity, otherwise the model is identifiable.

If the errors are non-Gaussian, then there are (rather contrived)

functions that satisfy HC, but in general identifiability is guaranteed.

this generalizes to multiple variables (assuming minimality*)!
extension to discrete additive noise models
If the function is linear, but the error terms non-Gaussian, then one

can’t fit a linear backwards model (Lingam), but there are cases where one can fit a non-linear backwards model

21

SLIDE 73

22

algorithm/ assumptions Markov faithfulness causal sufficiency acyclicity parametric assumption

utput

PC / GES ✓ ✓ ✓ ✓ ✗ Markov equivalence FCI ✓ ✓ ✗ ✓ ✗ PAG CCD ✓ ✓ ✓ ✗ ✗ PAG LiNGaM ✓ ✗ ✓ ✓ linear non- Gaussian unique DAG lvLiNGaM ✓ ✓ ✗ ✓ linear non- Gaussian set of DAGs cyclic LiNGaM ✓ ~ ✓ ✗ linear non- Gaussian set of graphs

SLIDE 74

22

algorithm/ assumptions Markov faithfulness causal sufficiency acyclicity parametric assumption

utput

PC / GES ✓ ✓ ✓ ✓ ✗ Markov equivalence FCI ✓ ✓ ✗ ✓ ✗ PAG CCD ✓ ✓ ✓ ✗ ✗ PAG LiNGaM ✓ ✗ ✓ ✓ linear non- Gaussian unique DAG lvLiNGaM ✓ ✓ ✗ ✓ linear non- Gaussian set of DAGs cyclic LiNGaM ✓ ~ ✓ ✗ linear non- Gaussian set of graphs non-linear additive noise ✓ minimality ✓ ✓ non-linear additive noise unique DAG

SLIDE 75

Experiments, Background Knowledge and all the other Jazz

23

SLIDE 76

Experiments, Background Knowledge and all the other Jazz

how to integrate data from experiments?

23

SLIDE 77

Experiments, Background Knowledge and all the other Jazz

how to integrate data from experiments?

23

x y

z l2 l1

SLIDE 78

Experiments, Background Knowledge and all the other Jazz

how to integrate data from experiments?

23

x y

z l2 l1

⇒

SLIDE 79

Experiments, Background Knowledge and all the other Jazz

how to integrate data from experiments?

23

x y

z l2 l1

⇒

how to include background knowledge?

SLIDE 80

Experiments, Background Knowledge and all the other Jazz

how to integrate data from experiments?

23

x y

z l2 l1

⇒

how to include background knowledge?

x

w

pathways

SLIDE 81

Experiments, Background Knowledge and all the other Jazz

how to integrate data from experiments?

23

x y

z l2 l1

⇒

how to include background knowledge?

x

w

pathways

x

w y z

<

tier orderings

SLIDE 82

Experiments, Background Knowledge and all the other Jazz

how to integrate data from experiments?

23

x y

z l2 l1

⇒

how to include background knowledge?

x

w

pathways

x

w y z

<

tier orderings

x

y

z w

“priors”

SLIDE 83

Experiments, Background Knowledge and all the other Jazz

how to integrate data from experiments?

23

x y

z l2 l1

⇒

how to include background knowledge?

x

w

pathways

x

w y z

<

tier orderings

x

y

z w

“priors”

specific search space restrictions

SLIDE 84

Experiments, Background Knowledge and all the other Jazz

how to integrate data from experiments?

23

x y

z l2 l1

⇒

how to include background knowledge?

x

w

pathways

x

w y z

<

tier orderings

x

y

z w

“priors”

specific search space restrictions

biological settings

SLIDE 85

Experiments, Background Knowledge and all the other Jazz

how to integrate data from experiments?

23

x y

z l2 l1

⇒

how to include background knowledge?

x

w

pathways

x

w y z

<

tier orderings

x

y

z w

“priors”

specific search space restrictions

time

subsampled time series biological settings

SLIDE 86

Experiments, Background Knowledge and all the other Jazz

how to integrate data from experiments?

23

x y

z l2 l1

⇒

how to include background knowledge?

x

w

pathways

x

w y z

<

tier orderings

x

y

z w

“priors”

specific search space restrictions

time

subsampled time series biological settings

Tank talk

SLIDE 87

High-Level

24

SLIDE 88

High-Level

24

x y z w

samples

x y w

samples

data sample

SLIDE 89

High-Level

24

x y z w

samples

x y w

samples

data sample

}

(in)dependence constraints

x6? ?y|C||J

SLIDE 90

High-Level

24

x y z w

samples

x y w

samples

data sample assumptions, e.g.

causal Markov
causal faithfulness
etc.

}

(in)dependence constraints

x6? ?y|C||J

SLIDE 91

High-Level

24

x y z w

samples

x y w

samples

data sample assumptions, e.g.

causal Markov
causal faithfulness
etc.

background knowledge, e.g.

pathways
tier ordering
“priors”
etc.

}

(in)dependence constraints

x6? ?y|C||J

SLIDE 92

High-Level

24

x y z w

samples

x y w

samples

data sample assumptions, e.g.

causal Markov
causal faithfulness
etc.

background knowledge, e.g.

pathways
tier ordering
“priors”
etc.

setting

time series
internal latent structures
etc.

}

(in)dependence constraints

x6? ?y|C||J

SLIDE 93

High-Level

24

x y z w

samples

x y w

samples

data sample assumptions, e.g.

causal Markov
causal faithfulness
etc.

background knowledge, e.g.

pathways
tier ordering
“priors”
etc.

setting

time series
internal latent structures
etc.

}

(in)dependence constraints

x6? ?y|C||J

Encode these as logical constraints on the underlying graph structure

SLIDE 94

High-Level

24

x y z w

samples

x y w

samples

data sample assumptions, e.g.

causal Markov
causal faithfulness
etc.

background knowledge, e.g.

pathways
tier ordering
“priors”
etc.

setting

time series
internal latent structures
etc.

}

(in)dependence constraints

x6? ?y|C||J

Encode these as logical constraints on the underlying graph structure

(max) SAT-solver

SLIDE 95

SAT

based Causal Discovery
Formulate the independence

constraints in propositional logic

25

x ⊥ ⊥ y ⇐ ⇒ ¬A ∧ ¬B . . . A = ‘x → y is present’

SLIDE 96

SAT

based Causal Discovery
Formulate the independence

constraints in propositional logic

Encode the constraints into
ne formula.

25

x ⊥ ⊥ y ⇐ ⇒ ¬A ∧ ¬B . . . A = ‘x → y is present’ ¬A ∧ ¬B ∧ ¬(C ∧ D) ∧ ¬...

SLIDE 97

SAT

based Causal Discovery
Formulate the independence

constraints in propositional logic

Encode the constraints into
ne formula.
Find satisfying assignments

using a SAT-solver

25

x ⊥ ⊥ y ⇐ ⇒ ¬A ∧ ¬B . . . A = ‘x → y is present’ ¬A ∧ ¬B ∧ ¬(C ∧ D) ∧ ¬... ⇐ ⇒

x y

z

A = false B = false ...

SLIDE 98

SAT

based Causal Discovery
Formulate the independence

constraints in propositional logic

Encode the constraints into
ne formula.
Find satisfying assignments

using a SAT-solver

25

x ⊥ ⊥ y ⇐ ⇒ ¬A ∧ ¬B . . . A = ‘x → y is present’

➡ very general setting (allows for cycles and latents) and trivially

complete

¬A ∧ ¬B ∧ ¬(C ∧ D) ∧ ¬... ⇐ ⇒

x y

z

A = false B = false ...

SLIDE 99

SAT

based Causal Discovery
Formulate the independence

constraints in propositional logic

Encode the constraints into
ne formula.
Find satisfying assignments

using a SAT-solver

25

x ⊥ ⊥ y ⇐ ⇒ ¬A ∧ ¬B . . . A = ‘x → y is present’

➡ very general setting (allows for cycles and latents) and trivially

complete

➡ BUT: erroneous test results induce conflicting constraints:

UNsatisfiable

¬A ∧ ¬B ∧ ¬(C ∧ D) ∧ ¬... ⇐ ⇒

x y

z

A = false B = false ...

SLIDE 100

Conflicts and Errors

Statistical independence tests produce errors

➡ Conflict: no graph can produce the set of constraints

26

x y

z

constraints

x ⊥ ⊥ y x 6? ? z y 6? ? z x ⊥ ⊥ y | z

x

y

z

x 6? ? y | z

z

x 6? ? y

SLIDE 101

Conflicts and Errors

Statistical independence tests produce errors

➡ Conflict: no graph can produce the set of constraints

26

x y

z

constraints

x ⊥ ⊥ y x 6? ? z y 6? ? z x ⊥ ⊥ y | z

x

y

z

weight

3000 2500 500 250

SLIDE 102

Conflicts and Errors

Statistical independence tests produce errors

➡ Conflict: no graph can produce the set of constraints

26

x y

z

constraints

x ⊥ ⊥ y x 6? ? z y 6? ? z x ⊥ ⊥ y | z

x

y

z

weight

3000 2500 500 250

Sridhar talk

SLIDE 103

Constraint Satisfaction Approach

INPUT: (in)dependence constraints weighted according to

reliability

OUTPUT: a graph G that minimizes the cost

27

min

G

X w(k)

k : constraint k is not satisfied by G

[Hyttinen et al. 2014]

SLIDE 104

Constraint Satisfaction Approach

INPUT: (in)dependence constraints weighted according to

reliability

OUTPUT: a graph G that minimizes the cost

27

min

G

X w(k)

k : constraint k is not satisfied by G

Answer Set Programming (ASP)
solver used: Clingo
finds globally optimal weighted maxSAT solution

[Hyttinen et al. 2014]

SLIDE 105

Constraint Satisfaction Approach

INPUT: (in)dependence constraints weighted according to

reliability

OUTPUT: a graph G that minimizes the cost

27

min

G

X w(k)

k : constraint k is not satisfied by G

What are suitable weights?

Answer Set Programming (ASP)
solver used: Clingo
finds globally optimal weighted maxSAT solution

[Hyttinen et al. 2014]

SLIDE 106

Weighting Schemes

Constant weights
unit weights for all constraint

28

[Hyttinen et al. 2014]

SLIDE 107

Weighting Schemes

Constant weights
unit weights for all constraint
Hard dependencies
only treat rejections of the null-hypothesis as hard constraints, in line with

classical statistics

give dependences infinite weight, maximize the independences (unit weight)

in light of these dependences

28

[Hyttinen et al. 2014]

SLIDE 108

Weighting Schemes

Constant weights
unit weights for all constraint
Hard dependencies
only treat rejections of the null-hypothesis as hard constraints, in line with

classical statistics

give dependences infinite weight, maximize the independences (unit weight)

in light of these dependences

Log weights
obtain the probability of an (in)dependence and weigh it according to the

log of the probability

Model selection with Bayes rule:

28

x6? ?y|C P(x|C)P(y|x, C) x ⊥ ⊥y|C P(x|C)P(y|C)

vs.

[Hyttinen et al. 2014]

SLIDE 109

Simulation 1: no cycles, no latents, linear Gaussian

cPC returns a fully determined output only 58/200 times at its
ptimum

29

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.85 0.90 0.95 TPR FPR TPR

●
Bayesian score-based (exact)

SAT

based

cPC t e s t s

n

l y PC

TPR vs. FPR of all

d-separation constraints

f the true graph for a

varying p-value cut-off

observational data set,

6 observed variables, average degree 2; 500 samples, 200 models, linear Gaussian parameterization

[Hyttinen et al. 2014]

SLIDE 110

Simulation 2: no cycles, but latents

cFCI only returns unambiguous results 61/200 times at its
ptimum

30

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.75 0.80 0.85 0.90 FPR TPR

SAT: log weights

cFCI t e s t s

n

l y FCI

[Hyttinen et al. 2014]

SLIDE 111

Simulation 3: cycles and latents

31

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.75 0.80 0.85 0.90 TPR FPR TPR

●
hard deps [sec. 4.1]

constant weights [sec. 4.2] log−weights [sec. 4.3] test only

SAT: log weights S A T : h a r d d e p s S A T : c

n

s t a n t w e i g h t s tests only

[Hyttinen et al. 2014]

SLIDE 112

Background Knowledge

32

x

w

SLIDE 113

Background Knowledge

32

x

w

⇐ ⇒

x6? ?w||x

SLIDE 114

Background Knowledge

32

x

w

⇐ ⇒

x6? ?w||x

z

SLIDE 115

Background Knowledge

32

x

w

⇐ ⇒

x6? ?w||x

z

weight = 0.8

SLIDE 116

Background Knowledge

32

x

w

⇐ ⇒

x6? ?w||x

z

weight = 0.8

x

w y z

<

⇐ ⇒

(x > z) ∧ (x > w) ∧ (y > z) ∧ (y > w)

SLIDE 117

Background Knowledge

32

x

w

⇐ ⇒

x6? ?w||x

z

weight = 0.8

“priors”

⇐ ⇒

specific probabilities for each

graph

soft sparsity constraint
…

x

w y z

<

⇐ ⇒

(x > z) ∧ (x > w) ∧ (y > z) ∧ (y > w)

SLIDE 118

33

assumption/ algorithm PC / GES FCI CCD LiNGaM lvLiNGaM cyclic LiNGaM non-linear additive noise maxSAT Markov ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ faithfulness ✓ ✓ ✓ ✗ ✓ ~ minimality ✓ causal sufficiency ✓ ✗ ✓ ✓ ✗ ✓ ✓ ✗ acyclicity ✓ ✓ ✗ ✓ ✓ ✗ ✓ ✗* parametric assumption ✗ ✗ ✗ linear non- Gaussian linear non- Gaussian linear non- Gaussian non-linear additive noise ✗

SLIDE 119

33

assumption/ algorithm PC / GES FCI CCD LiNGaM lvLiNGaM cyclic LiNGaM non-linear additive noise maxSAT Markov ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ faithfulness ✓ ✓ ✓ ✗ ✓ ~ minimality ✓ causal sufficiency ✓ ✗ ✓ ✓ ✗ ✓ ✓ ✗ acyclicity ✓ ✓ ✗ ✓ ✓ ✗ ✓ ✗* parametric assumption ✗ ✗ ✗ linear non- Gaussian linear non- Gaussian linear non- Gaussian non-linear additive noise ✗

utput

Markov equivalence PAG PAG unique DAG set of DAGs set of graphs unique DAG query based

SLIDE 120

Simulation 4: Scalability

up to 10 variables and only a few overlapping data sets for now

34 20 40 60 80 100 20 40 60 80 100 instances (sorted for each line) solving time per instance (s)

20

40 60 80 solving time per instance (s)

100 80 60 40 20 0

log weights constant weights hard deps

[Hyttinen et al. 2014]

SLIDE 121

35

(max) SAT-solver

x y

z

w x y

z

w

etc.

➟ ➟ ➟

Output of Causal Search Algorithms

SLIDE 122

35

(max) SAT-solver

x y

z

w x y

z

w

etc.

➟ ➟ ➟

Output of Causal Search Algorithms

}

equivalence class?

SLIDE 123

35

(max) SAT-solver

Output of Causal Search Algorithms

Query:

➟

SLIDE 124

35

(max) SAT-solver

Output of Causal Search Algorithms

Query:

list the structures in the

equivalence class

➟

SLIDE 125

35

(max) SAT-solver

Output of Causal Search Algorithms

Query:

list the structures in the

equivalence class

what structural features are

determined?

edges, confounders
ancestral relations
pathways

➟

SLIDE 126

35

(max) SAT-solver

Output of Causal Search Algorithms

Query:

list the structures in the

equivalence class

what structural features are

determined?

edges, confounders
ancestral relations
pathways
what are the highest scoring

equivalence classes?

➟

SLIDE 127

35

(max) SAT-solver

Output of Causal Search Algorithms

Query:

list the structures in the

equivalence class

what structural features are

determined?

edges, confounders
ancestral relations
pathways
what are the highest scoring

equivalence classes?

Response:

enumeration of solutions
“backbone” of the SAT
instance
…

➟ ➟

SLIDE 128

36

(max) SAT-solver

x y

z

w x y

z

w

etc.

➟ ➟ ➟

Computing Causal Effects

}

P(y|do(x)) ?

equivalence class?

SLIDE 129

36

(max) SAT-solver

x y

z

w x y

z

w

etc.

➟ ➟ ➟

Computing Causal Effects

}

P(y|do(x)) ?

equivalence class?

Grant talk

SLIDE 130

37

P(y|do(x)) ?

equivalence class

search in the equivalence class over the possible applications
f the do-calculus rules by querying the satisfaction of their d-

separation conditions

[Hyttinen et al. 2015]

SLIDE 131

37

P(y|do(x)) ?

equivalence class

search in the equivalence class over the possible applications
f the do-calculus rules by querying the satisfaction of their d-

separation conditions

P(y|do(x), z, w) = P(y|do(x), w) if Y ⊥ ⊥ Z|X, W||X

P(y|do(x), do(z), w) = P(y|do(x), z, w) if Y ⊥ ⊥ IZ|X, Z, W||X

P(y|do(x), do(z), w) = P(y|do(x), w) if Y ⊥ ⊥ IZ|X, W||X

do-calculus

Rule 1 (insertion/deletion of observations) Rule 2 (action/observation exchange) Rule 3 (insertion/deletion of actions)

[Hyttinen et al. 2015]

SLIDE 132

37

P(y|do(x)) ?

equivalence class

search in the equivalence class over the possible applications
f the do-calculus rules by querying the satisfaction of their d-

separation conditions

P(y|do(x), z, w) = P(y|do(x), w) if Y ⊥ ⊥ Z|X, W||X

P(y|do(x), do(z), w) = P(y|do(x), z, w) if Y ⊥ ⊥ IZ|X, Z, W||X

P(y|do(x), do(z), w) = P(y|do(x), w) if Y ⊥ ⊥ IZ|X, W||X

do-calculus

Rule 1 (insertion/deletion of observations) Rule 2 (action/observation exchange) Rule 3 (insertion/deletion of actions)

[Hyttinen et al. 2015]

SLIDE 133

High-Level

38

x y z w

samples

x y w

samples

data sample assumptions, e.g.

causal Markov
causal faithfulness
etc.

background knowledge, e.g.

pathways
tier ordering
“priors”
etc.

setting

time series
internal latent structures
etc.

}

(in)dependence constraints

x6? ?y|C||J

Encode these as logical constraints on the underlying graph structure

(max) SAT-solver

SLIDE 134

High-Level

38

setting

time series
internal latent structures
etc.

Encode these as logical constraints on the underlying graph structure

(max) SAT-solver

SLIDE 135

High-Level

38

setting

time series
internal latent structures
etc.

Encode these as logical constraints on the underlying graph structure

(max) SAT-solver

QUERY?

SLIDE 136

High-Level

38

setting

time series
internal latent structures
etc.

Encode these as logical constraints on the underlying graph structure

(max) SAT-solver

QUERY?

P(y | do(x))

x

w

x w

?

SLIDE 137

Just getting started…

39

SLIDE 138

Just getting started…

application

39

[Stekhoven et al. 2012]

SLIDE 139

Just getting started…

application

39

multi-scale causal analysis:

micro- to macro-variables

[Stekhoven et al. 2012] [Chalupka et al. 2016]

SLIDE 140

Just getting started…

application

39

multi-scale causal analysis:

micro- to macro-variables

time-series and dynamics

[Stekhoven et al. 2012] [Chalupka et al. 2016]

time

SLIDE 141

[Maier et al. 2013]

Just getting started…

application

39

multi-scale causal analysis:

micro- to macro-variables

time-series and dynamics
violations of the Markov

property: non-causal relations

[Stekhoven et al. 2012] [Chalupka et al. 2016]

time

SLIDE 142

[Maier et al. 2013]

Just getting started…

application

39

multi-scale causal analysis:

micro- to macro-variables

time-series and dynamics
violations of the Markov

property: non-causal relations

[Stekhoven et al. 2012] [Chalupka et al. 2016]

Sokolova talk Blondel talk

time

SLIDE 143

References

Limitations

Verma & Pearl, Equivalence and synthesis of causal models, UAI 1990.
Frydenberg, The chain graph Markov property, Scandinavian Journal of Statistics 1990.
Geiger & Pearl, On the logic of influence diagrams, UAI 1988.
Meek, Strong completeness and faithfulness in Bayesian networks, UAI 1995.

LiNGaM

Shimizu et al, A linear non-Gaussian acyclic model for causal discovery, JMLR, 2006.
Hoyer et al., Estimation of causal effects using linear non-Gaussian causal models with hidden variables, IJAR 2008.
Lacerda et al., Discoverying cyclic causal models by Independent Component Analysis, UAI 2008.

Additive noise models

Hoyer et al., Nonlinear causal discovery with additive noise models, NIPS 2009.
Mooij et al., Regression by dependence minimization and its application to causal inference, ICML 2009.
Peters et al., Causal inference on discrete data using additive noise models, IEEE..., 2011.
Peters et al., Identifiability of causal graphs using functional models, UAI 2011.

SAT

based approaches
Triantafillou et al., Learning causal structure from overlapping variable sets, AISTATS 2010.
Claassen & Heskes, A logical characterization of constraint-based causal discovery, UAI 2011.
Hyttinen et al., Discovering cyclic causal models with latent variables: A SAT-based approach, UAI 2013.
Hyttinen et al., Constraint-based Causal Discovery: Conflict Resolution with Answer Set Programming, UAI 2014.
Hyttinen et al., Do-calculus when the true graph is unknown, UAI 2015.
Triantafillou & Tsamardinos, Constraint-based Causal Discovery from Multiple Interventions Over Overlapping Variable

Sets, JMLR 2015. Other references

Maier et al., A sound and complete algorithm for learning causal models from relational data, UAI 2013.
Chalupka et al., Unsupervised discovery of El Niño using causal feature learning on microlevel climate data, UAI 2016.
Stekhoven et al., Causal stability ranking, Bioinformatics 2012.

40

Thank you!

SLIDE 144

41