Foundations of Causal Discovery Frederick Eberhardt KDD Causality - - PowerPoint PPT Presentation

foundations of causal discovery
SMART_READER_LITE
LIVE PREVIEW

Foundations of Causal Discovery Frederick Eberhardt KDD Causality - - PowerPoint PPT Presentation

Foundations of Causal Discovery Frederick Eberhardt KDD Causality Workshop 2016 Causal Discovery data sample x y z w samples 2 Causal Discovery assumptions, e.g. causal Markov causal faithfulness functional form etc.


slide-1
SLIDE 1

Foundations of Causal Discovery

Frederick Eberhardt

KDD Causality Workshop 2016

slide-2
SLIDE 2

Causal Discovery

2

data sample

x y z w

samples

slide-3
SLIDE 3

Causal Discovery

2

data sample inference algorithm

assumptions, e.g.

  • causal Markov
  • causal faithfulness
  • functional form
  • etc.

x y z w

samples

slide-4
SLIDE 4

Causal Discovery

2

data sample

x y

z

w x y

z

w

equivalence classes inference algorithm

assumptions, e.g.

  • causal Markov
  • causal faithfulness
  • functional form
  • etc.

x y z w

samples

slide-5
SLIDE 5

Causal Discovery

2

data sample

x y

z

w x y

z

w

equivalence classes 0 0 ? a 0 0 0 0 0 0 0 0 b ? ? 0

x y z w x y

z

w

0 0 ? ? 0 0 0 ? ? 0 0 0 ? ? 0 0

x y z w x y

z

w

direct edges confounders

model specifications

inference algorithm

assumptions, e.g.

  • causal Markov
  • causal faithfulness
  • functional form
  • etc.

x y z w

samples

slide-6
SLIDE 6

Causal Discovery

2

data sample

x y

z

w x y

z

w

equivalence classes 0 0 ? a 0 0 0 0 0 0 0 0 b ? ? 0

x y z w x y

z

w

0 0 ? ? 0 0 0 ? ? 0 0 0 ? ? 0 0

x y z w x y

z

w

direct edges confounders

model specifications

inference algorithm

assumptions, e.g.

  • causal Markov
  • causal faithfulness
  • functional form
  • etc.

x y z w

samples

x y

z

w

truth (unknown)

slide-7
SLIDE 7

Causal Discovery

2

data sample

x y

z

w x y

z

w

equivalence classes 0 0 ? a 0 0 0 0 0 0 0 0 b ? ? 0

x y z w x y

z

w

0 0 ? ? 0 0 0 ? ? 0 0 0 ? ? 0 0

x y z w x y

z

w

direct edges confounders

model specifications

inference algorithm

assumptions, e.g.

  • causal Markov
  • causal faithfulness
  • functional form
  • etc.

x y z w

samples

x y

z

w

truth (unknown)

slide-8
SLIDE 8

Constraint-based Causal Discovery

3

x y

z

w

truth (unknown)

x y

z

w x y

z

w

equivalence classes

slide-9
SLIDE 9

Constraint-based Causal Discovery

3

x y

z

w

truth (unknown) data sample

x y z w

samples

?

x y

z

w x y

z

w

equivalence classes

slide-10
SLIDE 10

Constraint-based Causal Discovery

3

x y

z

w

truth (unknown) data sample

x y z w

samples

?

x y

z

w x y

z

w

equivalence classes constraints statistical inference

x ⊥ ⊥ y | {z, w}

probabilistic independence

slide-11
SLIDE 11

Constraint-based Causal Discovery

3

x y

z

w

truth (unknown) data sample

x y z w

samples

?

x y

z

w x y

z

w

equivalence classes constraints statistical inference

x ⊥ ⊥ y | {z, w}

probabilistic independence

x ⊥ y | {z, w}

conditions

  • Markov
  • faithfulness

graphical connection

slide-12
SLIDE 12

Constraint-based Causal Discovery

3

x y

z

w

truth (unknown) data sample

x y z w

samples

?

d-separation

x y

z

w x y

z

w

equivalence classes constraints statistical inference

x ⊥ ⊥ y | {z, w}

probabilistic independence

x ⊥ y | {z, w}

conditions

  • Markov
  • faithfulness

graphical connection

slide-13
SLIDE 13

Constraint-based Causal Discovery

3

x y

z

w

truth (unknown) data sample

x y z w

samples

?

d-separation

  • x

y z w x y z w

equivalence classes constraints statistical inference

x ⊥ ⊥ y | {z, w}

probabilistic independence

x ⊥ y | {z, w}

conditions

  • Markov
  • faithfulness

graphical connection

slide-14
SLIDE 14

Constraint-based Causal Discovery

3

x y

z

w

truth (unknown) data sample

x y z w

samples

?

d-separation distribution

P(W, X, Y, Z)

  • x

y z w x y z w

equivalence classes constraints statistical inference

x ⊥ ⊥ y | {z, w}

probabilistic independence

x ⊥ y | {z, w}

conditions

  • Markov
  • faithfulness

graphical connection

slide-15
SLIDE 15

Causal Markov

4

x is independent of its non-descendents given its parents in the causal graph x

y z

w

v

u

slide-16
SLIDE 16

Causal Markov

4

x is independent of its non-descendents given its parents in the causal graph x

y z

w

v

u Violations of Causal Markov

  • quantum mechanics
  • [unmeasured common causes]
  • [mixtures of populations]
  • [variables are not distinct, or too coarsely grained]
slide-17
SLIDE 17

Causal Faithfulness

5

If x is independent of y given C in the probability distribution then x is d-separated from y given C in the graph.

slide-18
SLIDE 18

Causal Faithfulness

5

If x is independent of y given C in the probability distribution then x is d-separated from y given C in the graph. Violations of Causal Faithfulness

  • canceling pathways
  • matching pennies cases
  • [small sample sizes and near violations
  • f faithfulness]

x

y z

a

b −ab

x y z

x-or

slide-19
SLIDE 19

Assumptions

  • causal Markov: permits inference from probabilistic

dependence to causal connection

6

slide-20
SLIDE 20

Assumptions

  • causal Markov: permits inference from probabilistic

dependence to causal connection

  • causal faithfulness: permits inference from probabilistic

independence to causal separation

6

x y

z

slide-21
SLIDE 21

Assumptions

  • causal Markov: permits inference from probabilistic

dependence to causal connection

  • causal faithfulness: permits inference from probabilistic

independence to causal separation

  • causal sufficiency: there are no unmeasured common causes

6

x y

z

x y

z l2 l1

slide-22
SLIDE 22

Assumptions

  • causal Markov: permits inference from probabilistic

dependence to causal connection

  • causal faithfulness: permits inference from probabilistic

independence to causal separation

  • causal sufficiency: there are no unmeasured common causes
  • acylicity: no variable is an (indirect) cause of itself

6

x y

z

x y

z l2 l1

x y

z

slide-23
SLIDE 23

Assumptions

  • causal Markov: permits inference from probabilistic

dependence to causal connection

  • causal faithfulness: permits inference from probabilistic

independence to causal separation

  • causal sufficiency: there are no unmeasured common causes
  • acylicity: no variable is an (indirect) cause of itself

7

All graphs in an equivalence class have:

  • same adjacencies (“skeleton”)
  • same unshielded colliders

[Verma & Pearl 1990, Frydenberg 1990]

slide-24
SLIDE 24

Assumptions

  • causal Markov: permits inference from probabilistic

dependence to causal connection

  • causal faithfulness: permits inference from probabilistic

independence to causal separation

  • causal sufficiency: there are no unmeasured common causes
  • acylicity: no variable is an (indirect) cause of itself

7

x

y

z

unshielded collider

All graphs in an equivalence class have:

  • same adjacencies (“skeleton”)
  • same unshielded colliders

[Verma & Pearl 1990, Frydenberg 1990]

slide-25
SLIDE 25
  • Markov
  • faithfulness
  • acyclicity
  • causal sufficiency

assumptions equivalence class

  • same adjacencies (“skeleton”)
  • same unshielded colliders
slide-26
SLIDE 26

x y

z

x y

z

x

y z

x

y z x y

z

x y

z

x y

z

x y

z

x

y

z

x

y

z

x

y

z x

y

z

x

y z x y z x y z

x

y

z

x y

z x y z

x y

z

x

y z x

y z

x

y

z

x y

z

x y

z x y z

  • Markov
  • faithfulness
  • acyclicity
  • causal sufficiency

assumptions equivalence class

  • same adjacencies (“skeleton”)
  • same unshielded colliders
slide-27
SLIDE 27

x y

z

x y

z

x

y z

x

y z x y

z

x y

z

x y

z

x y

z

x

y

z

x

y

z

x

y

z x

y

z

x

y z x y z x y z

x

y

z

x y

z x y z

x y

z

x

y z x

y z

x

y

z

x y

z

x y

z x y z

  • Markov
  • faithfulness
  • acyclicity
  • causal sufficiency

assumptions equivalence class

  • same adjacencies (“skeleton”)
  • same unshielded colliders
slide-28
SLIDE 28

x y

z

x

y z x y

z

x y

z

x y

z

x y

z

x

y

z x

y

z

x

y z x y

z

x

y z x y

z

x

y

z

x

y

z

x y

z

x y

z x y

z

x y z

x y

z

x

y z x

y z

x

y

z

x y

z

x y

z x y z

slide-29
SLIDE 29

x y

z

x

y z x y

z

x y

z

x y

z

x y

z

x

y

z x

y

z

x

y z x y

z

x

y z x y

z

x

y

z

x

y

z

x y

z

x y

z x y

z

x y z

x y

z

x

y z x

y z

x

y

z

x y

z

x y

z x y z

x ⊥ ⊥ y

slide-30
SLIDE 30

x y

z

x

y z x y

z

x y

z

x y

z

x y

z

x

y

z x

y

z

x

y z x y

z

x

y z x y

z

x

y

z

x

y

z

x y

z

x y

z x y

z

x y z

x y

z

x

y z x

y z

x

y

z

x y

z

x y

z x y z

x ⊥ ⊥ y

slide-31
SLIDE 31

x y

z

x

y z x y

z

x y

z

x y

z

x y

z

x

y

z x

y

z

x

y z x y

z

x

y z x y

z

x

y

z

x

y

z

x y

z

x y

z x y

z

x y z

x y

z

x

y z x

y z

x

y

z

x y

z

x y

z x y z

x ⊥ ⊥ y

x6? ?z

slide-32
SLIDE 32

x y

z

x

y z x y

z

x y

z

x y

z

x y

z

x

y

z x

y

z

x

y z x y

z

x

y z x y

z

x

y

z

x

y

z

x y

z

x y

z x y

z

x y z

x y

z

x

y z x

y z

x

y

z

x y

z

x y

z x y z

x ⊥ ⊥ y

x6? ?z

slide-33
SLIDE 33

x y

z

x

y z x y

z

x y

z

x y

z

x y

z

x

y

z x

y

z

x

y z x y

z

x

y z x y

z

x

y

z

x

y

z

x y

z

x y

z x y

z

x y z

x y

z

x

y z x

y z

x

y

z

x y

z

x y

z x y z

x ⊥ ⊥ y

x6? ?z y6? ?z

slide-34
SLIDE 34

x y

z

x

y z x y

z

x y

z

x y

z

x y

z

x

y

z x

y

z

x

y z x y

z

x

y z x y

z

x

y

z

x

y

z

x y

z

x y

z x y

z

x y z

x y

z

x

y z x

y z

x

y

z

x y

z

x y

z x y z

x ⊥ ⊥ y

x6? ?z y6? ?z

slide-35
SLIDE 35

x y

z

x

y z x y

z

x y

z

x y

z

x y

z

x

y

z x

y

z

x

y z x y

z

x

y z x y

z

x

y

z

x

y

z

x y

z

x y

z x y

z

x y z

x y

z

x

y z x

y z

x

y

z

x y

z

x y

z x y z

x ⊥ ⊥ y

x6? ?z y6? ?z

}

sufficient to determine the equivalence class, in this case, a unique causal graph

slide-36
SLIDE 36

x y

z

x

y z x y

z

x y

z

x y

z

x y

z

x

y

z x

y

z

x

y z x y

z

x

y z x y

z

x

y

z

x

y

z

x y

z

x y

z x y

z

x y z

x y

z

x

y z x

y z

x

y

z

x y

z

x y

z x y z

x ⊥ ⊥ y

x6? ?z y6? ?z

}

sufficient to determine the equivalence class, in this case, a unique causal graph

For linear Gaussian and for multinomial causal relations, an algorithm that identifies the Markov equivalence class of the true model is complete.

(Pearl & Geiger 1988, Meek 1995)

slide-37
SLIDE 37

Staying in business

  • Weaken the assumptions (and increase the equivalence class)
  • allow for unmeasured common causes
  • allow for cycles
  • weaken faithfulness

10

slide-38
SLIDE 38

Staying in business

  • Weaken the assumptions (and increase the equivalence class)
  • allow for unmeasured common causes
  • allow for cycles
  • weaken faithfulness
  • Exclude the limitations (and reduce the equivalence class)
  • restrict to non-Gaussian error distributions
  • restrict to non-linear causal relations
  • restrict to specific discrete parameterizations

10

slide-39
SLIDE 39

Staying in business

  • Weaken the assumptions (and increase the equivalence class)
  • allow for unmeasured common causes
  • allow for cycles
  • weaken faithfulness
  • Exclude the limitations (and reduce the equivalence class)
  • restrict to non-Gaussian error distributions
  • restrict to non-linear causal relations
  • restrict to specific discrete parameterizations
  • Include more general data collection set-ups (and see how

assumptions can be adjusted and what equivalence class results)

  • experimental evidence
  • multiple (overlapping) data sets
  • relational data

10

slide-40
SLIDE 40

Staying in business

  • Weaken the assumptions (and increase the equivalence class)
  • allow for unmeasured common causes
  • allow for cycles
  • weaken faithfulness
  • Exclude the limitations (and reduce the equivalence class)
  • restrict to non-Gaussian error distributions
  • restrict to non-linear causal relations
  • restrict to specific discrete parameterizations
  • Include more general data collection set-ups (and see how

assumptions can be adjusted and what equivalence class results)

  • experimental evidence
  • multiple (overlapping) data sets
  • relational data

10

Zhalama talk Tank talk

slide-41
SLIDE 41

Limitations

11

For linear Gaussian and for multinomial causal relations, an algorithm that identifies the Markov equivalence class of the true model is complete.

(Pearl & Geiger 1988, Meek 1995)

slide-42
SLIDE 42

Linear non-Gaussian method (LiNGaM)

  • Linear causal relations:
  • Assumptions:
  • causal Markov
  • causal sufficiency
  • acyclicity

12

xi = X

xj∈Pa(xi)

ijxj + ✏j

[Shimizu et al., 2006]

slide-43
SLIDE 43

Linear non-Gaussian method (LiNGaM)

  • Linear causal relations:
  • Assumptions:
  • causal Markov
  • causal sufficiency
  • acyclicity
  • If non-Gaussian, then the true graph is uniquely identifiable

from the joint distribution.

12

xi = X

xj∈Pa(xi)

ijxj + ✏j

✏j ∼

[Shimizu et al., 2006]

slide-44
SLIDE 44

Two variable case

13

x y

✏y y = x + ✏y ✏x

True model

slide-45
SLIDE 45

Two variable case

13

x y

✏y x ⊥ ⊥ ✏y y = x + ✏y ✏x

True model

slide-46
SLIDE 46

Two variable case

13

x y

✏y x ⊥ ⊥ ✏y y = x + ✏y ✏x

True model

x y

Backwards model

x = ✓y + ˜ ✏x ˜ ✏x ˜ ✏y

slide-47
SLIDE 47

Two variable case

13

x y

✏y x ⊥ ⊥ ✏y y = x + ✏y ✏x

True model

x y

Backwards model

x = ✓y + ˜ ✏x ˜ ✏x ˜ ✏y y ⊥ ⊥ ˜ ✏x

slide-48
SLIDE 48

Two variable case

13

x y

✏y x ⊥ ⊥ ✏y y = x + ✏y ✏x

True model

x y

Backwards model

x = ✓y + ˜ ✏x ˜ ✏x ˜ ✏y y ⊥ ⊥ ˜ ✏x

˜ ✏x = x − ✓y = x − ✓(x + ✏y) = (1 − ✓)x − ✓✏y

slide-49
SLIDE 49

Two variable case

13

x y

✏y x ⊥ ⊥ ✏y y = x + ✏y ✏x

True model

x y

Backwards model

x = ✓y + ˜ ✏x ˜ ✏x ˜ ✏y y ⊥ ⊥ ˜ ✏x

˜ ✏x = x − ✓y = x − ✓(x + ✏y) = (1 − ✓)x − ✓✏y

slide-50
SLIDE 50

Why Normals are unusual

14

y = x + ✏y

Forwards model For backwards model

✏y

x y

✏x

?

˜ ✏x = (1 − ✓)x − ✓✏y

slide-51
SLIDE 51

Why Normals are unusual

14

y = x + ✏y

Theorem 1 (Darmois-Skitovich) Let X1, . . . , Xn be independent, non-degenerate random variables. If for two linear combinations l1 = a1X1 + . . . + anXn, ai 6= 0 l2 = b1X1 + . . . + bnXn, bi 6= 0 are independent, then each Xi is normally distributed. Forwards model For backwards model

✏y

x y

✏x

?

˜ ✏x = (1 − ✓)x − ✓✏y

slide-52
SLIDE 52

15

algorithm/ assumption Markov faithfulness causal sufficiency acyclicity parametric assumption

  • utput
slide-53
SLIDE 53

15

algorithm/ assumption Markov faithfulness causal sufficiency acyclicity parametric assumption

  • utput

PC / GES ✓ ✓ ✓ ✓ ✗ Markov equivalence FCI ✓ ✓ ✗ ✓ ✗ PAG CCD ✓ ✓ ✓ ✗ ✗ PAG

slide-54
SLIDE 54

15

algorithm/ assumption Markov faithfulness causal sufficiency acyclicity parametric assumption

  • utput

PC / GES ✓ ✓ ✓ ✓ ✗ Markov equivalence FCI ✓ ✓ ✗ ✓ ✗ PAG CCD ✓ ✓ ✓ ✗ ✗ PAG LiNGaM ✓ ✗ ✓ ✓ linear non- Gaussian unique DAG lvLiNGaM ✓ ✓ ✗ ✓ linear non- Gaussian set of DAGs cyclic LiNGaM ✓ ~ ✓ ✗ linear non- Gaussian set of graphs

slide-55
SLIDE 55

Limitations

16

For linear Gaussian and for multinomial causal relations, an algorithm that identifies the Markov equivalence class of the true model is complete.

(Pearl & Geiger 1988, Meek 1995)

slide-56
SLIDE 56

Limitations

17

For linear Gaussian and for multinomial causal relations, an algorithm that identifies the Markov equivalence class of the true model is complete.

(Pearl & Geiger 1988, Meek 1995)

slide-57
SLIDE 57

Bivariate Linear Gaussian case

18

5

  • 5

5

  • 5

5

  • 5
  • 3

3

a b c

p(y | x) p(x | y)

x y y x

y = x + ✏y x = ✏x

✏x, ✏y ∼ indep. Gaussian

True model

(graphics from Hoyer et al. 2009)

Forwards (true) model Backwards model

slide-58
SLIDE 58

Bivariate Linear Gaussian case

18

5

  • 5

5

  • 5

5

  • 5
  • 3

3

a b c

p(y | x) p(x | y)

x y y x

y = x + ✏y x = ✏x

✏x, ✏y ∼ indep. Gaussian

True model

(graphics from Hoyer et al. 2009)

Forwards (true) model Backwards model

slide-59
SLIDE 59

Continuous additive noise models

19

xj = fj(pa(xj)) + ✏j

slide-60
SLIDE 60

Continuous additive noise models

  • If is linear, then non-Gaussian errors are required for

identifiability

19

xj = fj(pa(xj)) + ✏j fj(.)

slide-61
SLIDE 61

Continuous additive noise models

  • If is linear, then non-Gaussian errors are required for

identifiability

➡ What if the errors are Gaussian, but is non-linear?

19

xj = fj(pa(xj)) + ✏j fj(.) fj(.)

slide-62
SLIDE 62

Continuous additive noise models

  • If is linear, then non-Gaussian errors are required for

identifiability

➡ What if the errors are Gaussian, but is non-linear? ➡ More generally, under what circumstances is the causal

structure represented by this class of models identifiable?

19

xj = fj(pa(xj)) + ✏j fj(.) fj(.)

slide-63
SLIDE 63

Bivariate non-linear Gaussian additive noise model

20

5

  • 5

5

  • 5

5

  • 5
  • 3

3

d e f

p(y | x) p(x | y)

y x y x

y = x + x3 + ✏y x = ✏x

✏x, ✏y ∼ indep. Gaussian

x = g(y) + ˜ ✏x

y \ ⊥ ⊥˜ ✏x

True model

(graphics from Hoyer et al. 2009)

Forwards (true) model Backwards model

slide-64
SLIDE 64

Bivariate non-linear Gaussian additive noise model

20

5

  • 5

5

  • 5

5

  • 5
  • 3

3

d e f

p(y | x) p(x | y)

y x y x

y = x + x3 + ✏y x = ✏x

✏x, ✏y ∼ indep. Gaussian

x = g(y) + ˜ ✏x

y \ ⊥ ⊥˜ ✏x

True model

(graphics from Hoyer et al. 2009)

Forwards (true) model Backwards model

slide-65
SLIDE 65

Bivariate non-linear Gaussian additive noise model

20

5

  • 5

5

  • 5

5

  • 5
  • 3

3

d e f

p(y | x) p(x | y)

y x y x

y = x + x3 + ✏y x = ✏x

✏x, ✏y ∼ indep. Gaussian

x = g(y) + ˜ ✏x

y \ ⊥ ⊥˜ ✏x

True model

(graphics from Hoyer et al. 2009)

Forwards (true) model Backwards model

slide-66
SLIDE 66

Bivariate non-linear Gaussian additive noise model

20

5

  • 5

5

  • 5

5

  • 5
  • 3

3

d e f

p(y | x) p(x | y)

y x y x

y = x + x3 + ✏y x = ✏x

✏x, ✏y ∼ indep. Gaussian

x = g(y) + ˜ ✏x

y \ ⊥ ⊥˜ ✏x

True model

(graphics from Hoyer et al. 2009)

Forwards (true) model Backwards model

slide-67
SLIDE 67

General non-linear additive noise models

Hoyer Condition (HC): Technical condition on the relation between the function, the noise distribution and the parent distribution that, if satisfied, permits a backward model.

21

slide-68
SLIDE 68

General non-linear additive noise models

Hoyer Condition (HC): Technical condition on the relation between the function, the noise distribution and the parent distribution that, if satisfied, permits a backward model.

  • If the error terms are Gaussian, then the only functional form that

satisfies HC is linearity, otherwise the model is identifiable.

21

slide-69
SLIDE 69

General non-linear additive noise models

Hoyer Condition (HC): Technical condition on the relation between the function, the noise distribution and the parent distribution that, if satisfied, permits a backward model.

  • If the error terms are Gaussian, then the only functional form that

satisfies HC is linearity, otherwise the model is identifiable.

  • If the errors are non-Gaussian, then there are (rather contrived)

functions that satisfy HC, but in general identifiability is guaranteed.

21

slide-70
SLIDE 70

General non-linear additive noise models

Hoyer Condition (HC): Technical condition on the relation between the function, the noise distribution and the parent distribution that, if satisfied, permits a backward model.

  • If the error terms are Gaussian, then the only functional form that

satisfies HC is linearity, otherwise the model is identifiable.

  • If the errors are non-Gaussian, then there are (rather contrived)

functions that satisfy HC, but in general identifiability is guaranteed.

  • this generalizes to multiple variables (assuming minimality*)!

21

slide-71
SLIDE 71

General non-linear additive noise models

Hoyer Condition (HC): Technical condition on the relation between the function, the noise distribution and the parent distribution that, if satisfied, permits a backward model.

  • If the error terms are Gaussian, then the only functional form that

satisfies HC is linearity, otherwise the model is identifiable.

  • If the errors are non-Gaussian, then there are (rather contrived)

functions that satisfy HC, but in general identifiability is guaranteed.

  • this generalizes to multiple variables (assuming minimality*)!
  • extension to discrete additive noise models

21

slide-72
SLIDE 72

General non-linear additive noise models

Hoyer Condition (HC): Technical condition on the relation between the function, the noise distribution and the parent distribution that, if satisfied, permits a backward model.

  • If the error terms are Gaussian, then the only functional form that

satisfies HC is linearity, otherwise the model is identifiable.

  • If the errors are non-Gaussian, then there are (rather contrived)

functions that satisfy HC, but in general identifiability is guaranteed.

  • this generalizes to multiple variables (assuming minimality*)!
  • extension to discrete additive noise models
  • If the function is linear, but the error terms non-Gaussian, then one

can’t fit a linear backwards model (Lingam), but there are cases where one can fit a non-linear backwards model

21

slide-73
SLIDE 73

22

algorithm/ assumptions Markov faithfulness causal sufficiency acyclicity parametric assumption

  • utput

PC / GES ✓ ✓ ✓ ✓ ✗ Markov equivalence FCI ✓ ✓ ✗ ✓ ✗ PAG CCD ✓ ✓ ✓ ✗ ✗ PAG LiNGaM ✓ ✗ ✓ ✓ linear non- Gaussian unique DAG lvLiNGaM ✓ ✓ ✗ ✓ linear non- Gaussian set of DAGs cyclic LiNGaM ✓ ~ ✓ ✗ linear non- Gaussian set of graphs

slide-74
SLIDE 74

22

algorithm/ assumptions Markov faithfulness causal sufficiency acyclicity parametric assumption

  • utput

PC / GES ✓ ✓ ✓ ✓ ✗ Markov equivalence FCI ✓ ✓ ✗ ✓ ✗ PAG CCD ✓ ✓ ✓ ✗ ✗ PAG LiNGaM ✓ ✗ ✓ ✓ linear non- Gaussian unique DAG lvLiNGaM ✓ ✓ ✗ ✓ linear non- Gaussian set of DAGs cyclic LiNGaM ✓ ~ ✓ ✗ linear non- Gaussian set of graphs non-linear additive noise ✓ minimality ✓ ✓ non-linear additive noise unique DAG

slide-75
SLIDE 75

Experiments, Background Knowledge and all the other Jazz

23

slide-76
SLIDE 76

Experiments, Background Knowledge and all the other Jazz

  • how to integrate data from experiments?

23

slide-77
SLIDE 77

Experiments, Background Knowledge and all the other Jazz

  • how to integrate data from experiments?

23

x y

z l2 l1

slide-78
SLIDE 78

Experiments, Background Knowledge and all the other Jazz

  • how to integrate data from experiments?

23

x y

z l2 l1

slide-79
SLIDE 79

Experiments, Background Knowledge and all the other Jazz

  • how to integrate data from experiments?

23

x y

z l2 l1

  • how to include background knowledge?
slide-80
SLIDE 80

Experiments, Background Knowledge and all the other Jazz

  • how to integrate data from experiments?

23

x y

z l2 l1

  • how to include background knowledge?

x

w

pathways

slide-81
SLIDE 81

Experiments, Background Knowledge and all the other Jazz

  • how to integrate data from experiments?

23

x y

z l2 l1

  • how to include background knowledge?

x

w

pathways

x

w y z

<

tier orderings

slide-82
SLIDE 82

Experiments, Background Knowledge and all the other Jazz

  • how to integrate data from experiments?

23

x y

z l2 l1

  • how to include background knowledge?

x

w

pathways

x

w y z

<

tier orderings

x

y

z w

“priors”

slide-83
SLIDE 83

Experiments, Background Knowledge and all the other Jazz

  • how to integrate data from experiments?

23

x y

z l2 l1

  • how to include background knowledge?

x

w

pathways

x

w y z

<

tier orderings

x

y

z w

“priors”

  • specific search space restrictions
slide-84
SLIDE 84

Experiments, Background Knowledge and all the other Jazz

  • how to integrate data from experiments?

23

x y

z l2 l1

  • how to include background knowledge?

x

w

pathways

x

w y z

<

tier orderings

x

y

z w

“priors”

  • specific search space restrictions

biological settings

slide-85
SLIDE 85

Experiments, Background Knowledge and all the other Jazz

  • how to integrate data from experiments?

23

x y

z l2 l1

  • how to include background knowledge?

x

w

pathways

x

w y z

<

tier orderings

x

y

z w

“priors”

  • specific search space restrictions

time

subsampled time series biological settings

slide-86
SLIDE 86

Experiments, Background Knowledge and all the other Jazz

  • how to integrate data from experiments?

23

x y

z l2 l1

  • how to include background knowledge?

x

w

pathways

x

w y z

<

tier orderings

x

y

z w

“priors”

  • specific search space restrictions

time

subsampled time series biological settings

Tank talk

slide-87
SLIDE 87

High-Level

24

slide-88
SLIDE 88

High-Level

24

x y z w

samples

x y w

samples

data sample

slide-89
SLIDE 89

High-Level

24

x y z w

samples

x y w

samples

data sample

}

(in)dependence constraints

x6? ?y|C||J

slide-90
SLIDE 90

High-Level

24

x y z w

samples

x y w

samples

data sample assumptions, e.g.

  • causal Markov
  • causal faithfulness
  • etc.

}

(in)dependence constraints

x6? ?y|C||J

slide-91
SLIDE 91

High-Level

24

x y z w

samples

x y w

samples

data sample assumptions, e.g.

  • causal Markov
  • causal faithfulness
  • etc.

background knowledge, e.g.

  • pathways
  • tier ordering
  • “priors”
  • etc.

}

(in)dependence constraints

x6? ?y|C||J

slide-92
SLIDE 92

High-Level

24

x y z w

samples

x y w

samples

data sample assumptions, e.g.

  • causal Markov
  • causal faithfulness
  • etc.

background knowledge, e.g.

  • pathways
  • tier ordering
  • “priors”
  • etc.

setting

  • time series
  • internal latent structures
  • etc.

}

(in)dependence constraints

x6? ?y|C||J

slide-93
SLIDE 93

High-Level

24

x y z w

samples

x y w

samples

data sample assumptions, e.g.

  • causal Markov
  • causal faithfulness
  • etc.

background knowledge, e.g.

  • pathways
  • tier ordering
  • “priors”
  • etc.

setting

  • time series
  • internal latent structures
  • etc.

}

(in)dependence constraints

x6? ?y|C||J

Encode these as logical constraints on the underlying graph structure

slide-94
SLIDE 94

High-Level

24

x y z w

samples

x y w

samples

data sample assumptions, e.g.

  • causal Markov
  • causal faithfulness
  • etc.

background knowledge, e.g.

  • pathways
  • tier ordering
  • “priors”
  • etc.

setting

  • time series
  • internal latent structures
  • etc.

}

(in)dependence constraints

x6? ?y|C||J

Encode these as logical constraints on the underlying graph structure

(max) SAT-solver

slide-95
SLIDE 95

SAT

  • based Causal Discovery
  • Formulate the independence

constraints in propositional logic

25

x ⊥ ⊥ y ⇐ ⇒ ¬A ∧ ¬B . . . A = ‘x → y is present’

slide-96
SLIDE 96

SAT

  • based Causal Discovery
  • Formulate the independence

constraints in propositional logic

  • Encode the constraints into
  • ne formula.

25

x ⊥ ⊥ y ⇐ ⇒ ¬A ∧ ¬B . . . A = ‘x → y is present’ ¬A ∧ ¬B ∧ ¬(C ∧ D) ∧ ¬...

slide-97
SLIDE 97

SAT

  • based Causal Discovery
  • Formulate the independence

constraints in propositional logic

  • Encode the constraints into
  • ne formula.
  • Find satisfying assignments

using a SAT-solver

25

x ⊥ ⊥ y ⇐ ⇒ ¬A ∧ ¬B . . . A = ‘x → y is present’ ¬A ∧ ¬B ∧ ¬(C ∧ D) ∧ ¬... ⇐ ⇒

x y

z

A = false B = false ...

slide-98
SLIDE 98

SAT

  • based Causal Discovery
  • Formulate the independence

constraints in propositional logic

  • Encode the constraints into
  • ne formula.
  • Find satisfying assignments

using a SAT-solver

25

x ⊥ ⊥ y ⇐ ⇒ ¬A ∧ ¬B . . . A = ‘x → y is present’

➡ very general setting (allows for cycles and latents) and trivially

complete

¬A ∧ ¬B ∧ ¬(C ∧ D) ∧ ¬... ⇐ ⇒

x y

z

A = false B = false ...

slide-99
SLIDE 99

SAT

  • based Causal Discovery
  • Formulate the independence

constraints in propositional logic

  • Encode the constraints into
  • ne formula.
  • Find satisfying assignments

using a SAT-solver

25

x ⊥ ⊥ y ⇐ ⇒ ¬A ∧ ¬B . . . A = ‘x → y is present’

➡ very general setting (allows for cycles and latents) and trivially

complete

➡ BUT: erroneous test results induce conflicting constraints:

UNsatisfiable

¬A ∧ ¬B ∧ ¬(C ∧ D) ∧ ¬... ⇐ ⇒

x y

z

A = false B = false ...

slide-100
SLIDE 100

Conflicts and Errors

  • Statistical independence tests produce errors

➡ Conflict: no graph can produce the set of constraints

26

x y

z

constraints

x ⊥ ⊥ y x 6? ? z y 6? ? z x ⊥ ⊥ y | z

x

y

z

z

x 6? ? y | z

z

x 6? ? y

slide-101
SLIDE 101

Conflicts and Errors

  • Statistical independence tests produce errors

➡ Conflict: no graph can produce the set of constraints

26

x y

z

constraints

x ⊥ ⊥ y x 6? ? z y 6? ? z x ⊥ ⊥ y | z

x

y

z

weight

3000 2500 500 250

slide-102
SLIDE 102

Conflicts and Errors

  • Statistical independence tests produce errors

➡ Conflict: no graph can produce the set of constraints

26

x y

z

constraints

x ⊥ ⊥ y x 6? ? z y 6? ? z x ⊥ ⊥ y | z

x

y

z

weight

3000 2500 500 250

Sridhar talk

slide-103
SLIDE 103

Constraint Satisfaction Approach

  • INPUT: (in)dependence constraints weighted according to

reliability

  • OUTPUT: a graph G that minimizes the cost

27

min

G

X w(k)

k : constraint k is not satisfied by G

[Hyttinen et al. 2014]

slide-104
SLIDE 104

Constraint Satisfaction Approach

  • INPUT: (in)dependence constraints weighted according to

reliability

  • OUTPUT: a graph G that minimizes the cost

27

min

G

X w(k)

k : constraint k is not satisfied by G

  • Answer Set Programming (ASP)
  • solver used: Clingo
  • finds globally optimal weighted maxSAT solution

[Hyttinen et al. 2014]

slide-105
SLIDE 105

Constraint Satisfaction Approach

  • INPUT: (in)dependence constraints weighted according to

reliability

  • OUTPUT: a graph G that minimizes the cost

27

min

G

X w(k)

k : constraint k is not satisfied by G

What are suitable weights?

  • Answer Set Programming (ASP)
  • solver used: Clingo
  • finds globally optimal weighted maxSAT solution

[Hyttinen et al. 2014]

slide-106
SLIDE 106

Weighting Schemes

  • Constant weights
  • unit weights for all constraint

28

[Hyttinen et al. 2014]

slide-107
SLIDE 107

Weighting Schemes

  • Constant weights
  • unit weights for all constraint
  • Hard dependencies
  • only treat rejections of the null-hypothesis as hard constraints, in line with

classical statistics

  • give dependences infinite weight, maximize the independences (unit weight)

in light of these dependences

28

[Hyttinen et al. 2014]

slide-108
SLIDE 108

Weighting Schemes

  • Constant weights
  • unit weights for all constraint
  • Hard dependencies
  • only treat rejections of the null-hypothesis as hard constraints, in line with

classical statistics

  • give dependences infinite weight, maximize the independences (unit weight)

in light of these dependences

  • Log weights
  • obtain the probability of an (in)dependence and weigh it according to the

log of the probability

  • Model selection with Bayes rule:

28

x6? ?y|C P(x|C)P(y|x, C) x ⊥ ⊥y|C P(x|C)P(y|C)

vs.

[Hyttinen et al. 2014]

slide-109
SLIDE 109

Simulation 1: no cycles, no latents, linear Gaussian

  • cPC returns a fully determined output only 58/200 times at its
  • ptimum

29

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.85 0.90 0.95 TPR FPR TPR

  • Bayesian score-based (exact)

SAT

  • based

cPC t e s t s

  • n

l y PC

  • TPR vs. FPR of all

d-separation constraints

  • f the true graph for a

varying p-value cut-off

  • observational data set,

6 observed variables, average degree 2; 500 samples, 200 models, linear Gaussian parameterization

[Hyttinen et al. 2014]

slide-110
SLIDE 110

Simulation 2: no cycles, but latents

  • cFCI only returns unambiguous results 61/200 times at its
  • ptimum

30

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.75 0.80 0.85 0.90 FPR TPR

  • SAT: log weights

cFCI t e s t s

  • n

l y FCI

[Hyttinen et al. 2014]

slide-111
SLIDE 111

Simulation 3: cycles and latents

31

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.75 0.80 0.85 0.90 TPR FPR TPR

  • hard deps [sec. 4.1]

constant weights [sec. 4.2] log−weights [sec. 4.3] test only

SAT: log weights S A T : h a r d d e p s S A T : c

  • n

s t a n t w e i g h t s tests only

[Hyttinen et al. 2014]

slide-112
SLIDE 112

Background Knowledge

32

x

w

slide-113
SLIDE 113

Background Knowledge

32

x

w

⇐ ⇒

x6? ?w||x

slide-114
SLIDE 114

Background Knowledge

32

x

w

⇐ ⇒

x6? ?w||x

z

z

slide-115
SLIDE 115

Background Knowledge

32

x

w

⇐ ⇒

x6? ?w||x

z

z

weight = 0.8

slide-116
SLIDE 116

Background Knowledge

32

x

w

⇐ ⇒

x6? ?w||x

z

z

weight = 0.8

x

w y z

<

⇐ ⇒

(x > z) ∧ (x > w) ∧ (y > z) ∧ (y > w)

slide-117
SLIDE 117

Background Knowledge

32

x

w

⇐ ⇒

x6? ?w||x

z

z

weight = 0.8

“priors”

⇐ ⇒

  • specific probabilities for each

graph

  • soft sparsity constraint

x

w y z

<

⇐ ⇒

(x > z) ∧ (x > w) ∧ (y > z) ∧ (y > w)

slide-118
SLIDE 118

33

assumption/ algorithm PC / GES FCI CCD LiNGaM lvLiNGaM cyclic LiNGaM non-linear additive noise maxSAT Markov ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ faithfulness ✓ ✓ ✓ ✗ ✓ ~ minimality ✓ causal sufficiency ✓ ✗ ✓ ✓ ✗ ✓ ✓ ✗ acyclicity ✓ ✓ ✗ ✓ ✓ ✗ ✓ ✗* parametric assumption ✗ ✗ ✗ linear non- Gaussian linear non- Gaussian linear non- Gaussian non-linear additive noise ✗

slide-119
SLIDE 119

33

assumption/ algorithm PC / GES FCI CCD LiNGaM lvLiNGaM cyclic LiNGaM non-linear additive noise maxSAT Markov ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ faithfulness ✓ ✓ ✓ ✗ ✓ ~ minimality ✓ causal sufficiency ✓ ✗ ✓ ✓ ✗ ✓ ✓ ✗ acyclicity ✓ ✓ ✗ ✓ ✓ ✗ ✓ ✗* parametric assumption ✗ ✗ ✗ linear non- Gaussian linear non- Gaussian linear non- Gaussian non-linear additive noise ✗

  • utput

Markov equivalence PAG PAG unique DAG set of DAGs set of graphs unique DAG query based

slide-120
SLIDE 120

Simulation 4: Scalability

  • up to 10 variables and only a few overlapping data sets for now

34 20 40 60 80 100 20 40 60 80 100 instances (sorted for each line) solving time per instance (s)

  • 20

40 60 80 solving time per instance (s)

  • 100 80 60 40 20 0

log weights constant weights hard deps

[Hyttinen et al. 2014]

slide-121
SLIDE 121

35

(max) SAT-solver

x y

z

w x y

z

w

etc.

➟ ➟ ➟

Output of Causal Search Algorithms

slide-122
SLIDE 122

35

(max) SAT-solver

x y

z

w x y

z

w

etc.

➟ ➟ ➟

Output of Causal Search Algorithms

}

equivalence class?

slide-123
SLIDE 123

35

(max) SAT-solver

Output of Causal Search Algorithms

Query:

slide-124
SLIDE 124

35

(max) SAT-solver

Output of Causal Search Algorithms

Query:

  • list the structures in the

equivalence class

slide-125
SLIDE 125

35

(max) SAT-solver

Output of Causal Search Algorithms

Query:

  • list the structures in the

equivalence class

  • what structural features are

determined?

  • edges, confounders
  • ancestral relations
  • pathways

slide-126
SLIDE 126

35

(max) SAT-solver

Output of Causal Search Algorithms

Query:

  • list the structures in the

equivalence class

  • what structural features are

determined?

  • edges, confounders
  • ancestral relations
  • pathways
  • what are the highest scoring

equivalence classes?

slide-127
SLIDE 127

35

(max) SAT-solver

Output of Causal Search Algorithms

Query:

  • list the structures in the

equivalence class

  • what structural features are

determined?

  • edges, confounders
  • ancestral relations
  • pathways
  • what are the highest scoring

equivalence classes?

Response:

  • enumeration of solutions
  • “backbone” of the SAT
  • instance

➟ ➟

slide-128
SLIDE 128

36

(max) SAT-solver

x y

z

w x y

z

w

etc.

➟ ➟ ➟

Computing Causal Effects

}

P(y|do(x)) ?

equivalence class?

slide-129
SLIDE 129

36

(max) SAT-solver

x y

z

w x y

z

w

etc.

➟ ➟ ➟

Computing Causal Effects

}

P(y|do(x)) ?

equivalence class?

Grant talk

slide-130
SLIDE 130

37

P(y|do(x)) ?

equivalence class

  • search in the equivalence class over the possible applications
  • f the do-calculus rules by querying the satisfaction of their d-

separation conditions

[Hyttinen et al. 2015]

slide-131
SLIDE 131

37

P(y|do(x)) ?

equivalence class

  • search in the equivalence class over the possible applications
  • f the do-calculus rules by querying the satisfaction of their d-

separation conditions

P(y|do(x), z, w) = P(y|do(x), w) if Y ⊥ ⊥ Z|X, W||X

P(y|do(x), do(z), w) = P(y|do(x), z, w) if Y ⊥ ⊥ IZ|X, Z, W||X

P(y|do(x), do(z), w) = P(y|do(x), w) if Y ⊥ ⊥ IZ|X, W||X

do-calculus

Rule 1 (insertion/deletion of observations) Rule 2 (action/observation exchange) Rule 3 (insertion/deletion of actions)

[Hyttinen et al. 2015]

slide-132
SLIDE 132

37

P(y|do(x)) ?

equivalence class

  • search in the equivalence class over the possible applications
  • f the do-calculus rules by querying the satisfaction of their d-

separation conditions

P(y|do(x), z, w) = P(y|do(x), w) if Y ⊥ ⊥ Z|X, W||X

P(y|do(x), do(z), w) = P(y|do(x), z, w) if Y ⊥ ⊥ IZ|X, Z, W||X

P(y|do(x), do(z), w) = P(y|do(x), w) if Y ⊥ ⊥ IZ|X, W||X

do-calculus

Rule 1 (insertion/deletion of observations) Rule 2 (action/observation exchange) Rule 3 (insertion/deletion of actions)

[Hyttinen et al. 2015]

slide-133
SLIDE 133

High-Level

38

x y z w

samples

x y w

samples

data sample assumptions, e.g.

  • causal Markov
  • causal faithfulness
  • etc.

background knowledge, e.g.

  • pathways
  • tier ordering
  • “priors”
  • etc.

setting

  • time series
  • internal latent structures
  • etc.

}

(in)dependence constraints

x6? ?y|C||J

Encode these as logical constraints on the underlying graph structure

(max) SAT-solver

slide-134
SLIDE 134

High-Level

38

setting

  • time series
  • internal latent structures
  • etc.

Encode these as logical constraints on the underlying graph structure

(max) SAT-solver

slide-135
SLIDE 135

High-Level

38

setting

  • time series
  • internal latent structures
  • etc.

Encode these as logical constraints on the underlying graph structure

(max) SAT-solver

QUERY?

slide-136
SLIDE 136

High-Level

38

setting

  • time series
  • internal latent structures
  • etc.

Encode these as logical constraints on the underlying graph structure

(max) SAT-solver

QUERY?

P(y | do(x))

x

w

x w

?

slide-137
SLIDE 137

Just getting started…

39

slide-138
SLIDE 138

Just getting started…

  • application

39

[Stekhoven et al. 2012]

slide-139
SLIDE 139

Just getting started…

  • application

39

  • multi-scale causal analysis:

micro- to macro-variables

[Stekhoven et al. 2012] [Chalupka et al. 2016]

slide-140
SLIDE 140

Just getting started…

  • application

39

  • multi-scale causal analysis:

micro- to macro-variables

  • time-series and dynamics

[Stekhoven et al. 2012] [Chalupka et al. 2016]

time

slide-141
SLIDE 141

[Maier et al. 2013]

Just getting started…

  • application

39

  • multi-scale causal analysis:

micro- to macro-variables

  • time-series and dynamics
  • violations of the Markov

property: non-causal relations

[Stekhoven et al. 2012] [Chalupka et al. 2016]

time

slide-142
SLIDE 142

[Maier et al. 2013]

Just getting started…

  • application

39

  • multi-scale causal analysis:

micro- to macro-variables

  • time-series and dynamics
  • violations of the Markov

property: non-causal relations

[Stekhoven et al. 2012] [Chalupka et al. 2016]

Sokolova talk Blondel talk

time

slide-143
SLIDE 143

References

Limitations

  • Verma & Pearl, Equivalence and synthesis of causal models, UAI 1990.
  • Frydenberg, The chain graph Markov property, Scandinavian Journal of Statistics 1990.
  • Geiger & Pearl, On the logic of influence diagrams, UAI 1988.
  • Meek, Strong completeness and faithfulness in Bayesian networks, UAI 1995.

LiNGaM

  • Shimizu et al, A linear non-Gaussian acyclic model for causal discovery, JMLR, 2006.
  • Hoyer et al., Estimation of causal effects using linear non-Gaussian causal models with hidden variables, IJAR 2008.
  • Lacerda et al., Discoverying cyclic causal models by Independent Component Analysis, UAI 2008.

Additive noise models

  • Hoyer et al., Nonlinear causal discovery with additive noise models, NIPS 2009.
  • Mooij et al., Regression by dependence minimization and its application to causal inference, ICML 2009.
  • Peters et al., Causal inference on discrete data using additive noise models, IEEE..., 2011.
  • Peters et al., Identifiability of causal graphs using functional models, UAI 2011.

SAT

  • based approaches
  • Triantafillou et al., Learning causal structure from overlapping variable sets, AISTATS 2010.
  • Claassen & Heskes, A logical characterization of constraint-based causal discovery, UAI 2011.
  • Hyttinen et al., Discovering cyclic causal models with latent variables: A SAT-based approach, UAI 2013.
  • Hyttinen et al., Constraint-based Causal Discovery: Conflict Resolution with Answer Set Programming, UAI 2014.
  • Hyttinen et al., Do-calculus when the true graph is unknown, UAI 2015.
  • Triantafillou & Tsamardinos, Constraint-based Causal Discovery from Multiple Interventions Over Overlapping Variable

Sets, JMLR 2015. Other references

  • Maier et al., A sound and complete algorithm for learning causal models from relational data, UAI 2013.
  • Chalupka et al., Unsupervised discovery of El Niño using causal feature learning on microlevel climate data, UAI 2016.
  • Stekhoven et al., Causal stability ranking, Bioinformatics 2012.

40

Thank you!

slide-144
SLIDE 144

41