[PPT] - High-dimensional causal inference, graphical modeling and structural PowerPoint Presentation

SLIDE 1

High-dimensional causal inference, graphical modeling and structural equation models

Peter B¨ uhlmann

Seminar f¨ ur Statistik, ETH Z¨ urich

SLIDE 2

cannot do confirmatory causal inference without randomized intervention experiments... but we can do better than proceeding naively

SLIDE 3

Goal

in genomics: if we would make an intervention at a single gene, what would be its effect on a phenotype of interest? want to infer/predict such effects without actually doing the intervention i.e. from observational data (from observations of a “steady-state system”) it doesn’t need to be genes can generalize to intervention at more than one variable/gene

SLIDE 4

Goal

in genomics: if we would make an intervention at a single gene, what would be its effect on a phenotype of interest? want to infer/predict such effects without actually doing the intervention i.e. from observational data (from observations of a “steady-state system”) it doesn’t need to be genes can generalize to intervention at more than one variable/gene

SLIDE 5

Policy making James Heckman: Nobel Prize Economics 2000 e.g.: “Pritzker Consortium on Early Childhood Development identifies when and how child intervention programs can be most influential”

SLIDE 6

Genomics

1. Flowering of Arabidopsis Thaliana

phenotype/response variable of interest: Y = days to bolting (flowering) “covariates” X = gene expressions from p = 21′326 genes remark: “gene expression”: process by which information from a gene is used in the synthesis of a functional gene product (e.g. protein) question: infer/predict the effect of knocking-out/knocking-down (or enhancing) a single gene (expression) on the phenotype/response variable Y?

SLIDE 7

2. Gene expressions of yeast

p = 5360 genes phenotype of interest: Y = expression of first gene “covariates” X = gene expressions from all other genes and then phenotype of interest: Y = expression of second gene “covariates” X = gene expressions from all other genes and so on infer/predict the effects of a single gene knock-down on all

ther genes

SLIDE 8

❀ consider the framework of an intervention effect = causal effect (mathematically defined ❀ see later)

SLIDE 9

Regression – the “statistical workhorse”: the wrong approach we could use linear model (fitted from n observational data) Y =

p

j=1

βjX (j) + ε, Var(X (j)) ≡ 1 for all j |βj| measures the effect of variable X (j) in terms of “association” i.e. change of Y as a function of X (j) when keeping all other variables X (k) fixed ❀ not very realistic for intervention problem if we change e.g. one gene, some others will also change and these others are not (cannot be) kept fixed

SLIDE 10

Regression – the “statistical workhorse”: the wrong approach we could use linear model (fitted from n observational data) Y =

p

j=1

βjX (j) + ε, Var(X (j)) ≡ 1 for all j |βj| measures the effect of variable X (j) in terms of “association” i.e. change of Y as a function of X (j) when keeping all other variables X (k) fixed ❀ not very realistic for intervention problem if we change e.g. one gene, some others will also change and these others are not (cannot be) kept fixed

SLIDE 11

and indeed:

False positives True positives 1,000 2,000 3,000 4,000 200 400 600 800 1,000 IDA Lasso Elastic−net Random

❀ can do much better than (penalized) regression!

SLIDE 12

and indeed:

False positives True positives 1,000 2,000 3,000 4,000 200 400 600 800 1,000 IDA Lasso Elastic−net Random

❀ can do much better than (penalized) regression!

SLIDE 13

Effects of single gene knock-downs on all other genes (yeast) (Maathuis, Colombo, Kalisch & PB, 2010)

p = 5360 genes (expression of genes)
231 gene knock downs ❀ 1.2 · 106 intervention effects
the truth is “known in good approximation”

(thanks to intervention experiments) goal: prediction of the true large intervention effects based on observational data with no knock-downs n = 63

bservational data

False positives True positives 1,000 2,000 3,000 4,000 200 400 600 800 1,000 IDA Lasso Elastic−net Random

SLIDE 14

A bit more specifically

◮ univariate response Y ◮ p-dimensional covariate X

question: what is the effect of setting the jth component of X to a certain value x: do(X (j) = x) ❀ this is a question of intervention type not the effect of X (j) on Y when keeping all other variables fixed (regression effect)

Reichenbach, 1956; Suppes, 1970; Rubin, Dawid, Holland, Pearl, Glymour, Scheines, Spirtes,...

SLIDE 15

Intervention calculus (a review)

“dynamic” notion of an effect: if we set a variable X (j) to a value x (intervention) ❀ some other variables X (k) (k = j) and maybe Y will change we want to quantify the “total” effect of X (j) on Y including “all changed” X (k) on Y a graph or influence diagram will be very useful

X1 X2 X3 X4 Y

SLIDE 16

for simplicity: just consider DAGs (Directed Acyclic Graphs) [with hidden variables (Spirtes, Glymour & Scheines (1993);

Colombo et al. (2012)) much more complicated and not validated

with real data] random variables are represented as nodes in the DAG assume a Markov condition, saying that X (j)|X (pa(j)) cond. independent of its non-descendant variables ❀ recursive factorization of joint distribution P(Y, X (1), . . . , X (p)) = P(Y|X (pa(Y)))

p

j=1

P(X (j)|X (pa(j))) for intervention calculus: use truncated factorization (e.g. Pearl)

SLIDE 17

for simplicity: just consider DAGs (Directed Acyclic Graphs) [with hidden variables (Spirtes, Glymour & Scheines (1993);

Colombo et al. (2012)) much more complicated and not validated

with real data] random variables are represented as nodes in the DAG assume a Markov condition, saying that X (j)|X (pa(j)) cond. independent of its non-descendant variables ❀ recursive factorization of joint distribution P(Y, X (1), . . . , X (p)) = P(Y|X (pa(Y)))

p

j=1

P(X (j)|X (pa(j))) for intervention calculus: use truncated factorization (e.g. Pearl)

SLIDE 18

assume Markov property for causal DAG: non-intervention

X(1) X(2) X(3) X(4) Y

intervention do(X (2) = x)

X(1) X(2) = x X(3) X(4) Y

P(Y, X (1), X (2), X (3), X (4)) = P(Y|X (1), X (3))× P(X (1)|X (2))× P(X (2)|X (3), X (4))× P(X (3))× P(X (4)) P(Y, X (1), X (3), X (4)|do(X (2) = x)) = P(Y|X (1), X (3))× P(X (1)|X (2) = x)× P(X (3))× P(X (4))

SLIDE 19

truncated factorization for do(X (2) = x): P(Y, X (1), X (3), X (4)|do(X (2) = x)) = P(Y|X (1), X (3))P(X (1)|X (2) = x)P(X (3))P(X (4))

P(Y|do(X (2) = x)) =

P(Y, X (1), X (3), X (4)|do(X (2) = x))dX (1)dX (3)dX (4)

SLIDE 20

the truncated factorization is a mathematical consequence of the Markov condition (with respect to the causal DAG) for the

bservational probability distribution P

(plus assumption that structural eqns. are “autonomous”)

SLIDE 21

the intervention distribution P(Y|do(X (2) = x)) can be calculated from

◮ observational data distribution

❀ need to estimate conditional distributions

◮ an influence diagram (causal DAG)

❀ need to estimate structure of a graph/influence diagram intervention effect: E[Y|do(X (2) = x)] =

yP(y|do(X (2) = x))dy

intervention effect at x0 : ∂ ∂x E[Y|do(X (2) = x)]|x=x0 in the Gaussian case: Y, X (1), . . . , X (p) ∼ Np+1(µ, Σ), ∂ ∂x E[Y|do(X (2) = x)]≡ θ2 for all x

SLIDE 22

The backdoor criterion (Pearl, 1993) we only need to know the local parental set: for Z = Xpa(X), if Y / ∈ pa(X) : P(Y|do(X = x)) =

Z

P(Y|X = x, Z)dP(Z) parental set might not be the minimal set but always suffices this is a consequence of the global Markov property: XA independent of XB|XS : A and B are d-separated by S

SLIDE 23

Gaussian case for Gaussian case: ∂ ∂x E[Y|do(X (j) = x)] ≡ θj for all x for Y / ∈ pa(j): θj is the regression parameter in Y = θjX (j) +

k∈pa(j)

θkX (k) + error

nly need parental set and regression

X(1) X(2) X(3) X(4) Y

j = 2, pa(j) = {3, 4}

SLIDE 24

when having no unmeasured confounder (variable): intervention effect (as defined) = causal effect recap: causal effect = effect from a randomized trial (but we want to infer it without a randomized study... because often we cannot do it, or it is too expensive) structural equation models provide a different (but closely related) route for quantifying intervention effects

SLIDE 25

when having no unmeasured confounder (variable): intervention effect (as defined) = causal effect recap: causal effect = effect from a randomized trial (but we want to infer it without a randomized study... because often we cannot do it, or it is too expensive) structural equation models provide a different (but closely related) route for quantifying intervention effects

SLIDE 26

when having no unmeasured confounder (variable): intervention effect (as defined) = causal effect recap: causal effect = effect from a randomized trial (but we want to infer it without a randomized study... because often we cannot do it, or it is too expensive) structural equation models provide a different (but closely related) route for quantifying intervention effects

SLIDE 27

Inferring intervention effects from observational distribution

main problem: inferring DAG from observational data impossible! can only infer equivalence class of DAGs (several DAGs can encode exactly the same conditional independence relationships) Example:

X X Y Y

X causes Y Y causes X a lot of work about identifiability:

Verma & Pearl (1991); Spirtes, Glymour & Scheines (1993); Tian & Pearl (2000–2002); Lauritzen & Richardson (2002); Shpitser & Pearl (2006–2011); vanderWeele & Robins (2007–2011); Drton, Foygel & Sullivant (2011);...

SLIDE 28

Markov equivalence class of DAGs P a family of probability distributions on Rp+1 Definition: M(D) = {P ∈ P; P is Markov w.r.t. DAG D} D ∼ D′ ⇔ M(D) = M(D′) the definition depends on the family P: typical models:

P = Gaussian distributions
P = nonparametric model/set of all distributions
P = set of distributions from

additive structural equation model (see later)

SLIDE 29

A graphical characterization (Frydenberg, 1990; Verma & Pearl, 1991) for P = {Gaussian distributions} or P = {nonparametric distributions} it holds: D ∼ D′ ⇐ ⇒ D and D′ have the same skeleton and the same v-structures

SLIDE 30

for a DAG D, we write its corresponding Markov equivalence class E(D) = {D′; D′ ∼ D}

SLIDE 31

Equivalence class of DAGs

22

Several DAGs can encode exactly the same conditional

independence relationships. Such DAGs form an equivalence class.

Example: unshielded triple

X1 ⊥ ⊥ X3 X1 ⊥ ⊥ X3|X2 X1 X2 X3 false true X1 X2 X3 false true X1 X2 X3 false true X1 X2 X3 true false no v-structure v-structure

All DAGs in an equivalence class have the same skeleton and the

same v-structures

An equivalence class can be uniquely represented by a completed

partially directed acyclic graph (CPDAG) CPDAG DAG 1 DAG 2 DAG 3 DAG 4

SLIDE 32

we cannot estimate causal/intervention effects from

bservational distribution

but we will be able to estimate lower bounds of causal effects conceptual “procedure”:

◮ probability distribution P from a DAG, generating the data

❀ true underlying equivalence class of DAGs (CPDAG)

◮ find all DAG-members of true equivalence class (CPDAG):

D1, . . . , Dm

◮ for every DAG-member Dr, and every variable X (j):

single intervention effect θr,j summarize them by Θ = {θr,j; r = 1, . . . , m; j = 1, . . . , p}

identifiable parameter

SLIDE 33

we cannot estimate causal/intervention effects from

bservational distribution

but we will be able to estimate lower bounds of causal effects conceptual “procedure”:

◮ probability distribution P from a DAG, generating the data

❀ true underlying equivalence class of DAGs (CPDAG)

◮ find all DAG-members of true equivalence class (CPDAG):

D1, . . . , Dm

◮ for every DAG-member Dr, and every variable X (j):

single intervention effect θr,j summarize them by Θ = {θr,j; r = 1, . . . , m; j = 1, . . . , p}

identifiable parameter

SLIDE 34

IDA (oracle version)

17

racle

CPDAG PC-algorithm DAG 1 DAG 2 . . . . . . DAG m do-calculus effect 1 effect 2 . . . . . . effect m multi-set Θ

SLIDE 35

If you want a single number for every variable ... instead of the multi-set Θ = {θr,j; r = 1, . . . , m; j = 1, . . . , p} minimal absolute value αj = min

r

|θr,j| (j = 1, . . . , p), |θtrue,j| ≥ αj minimal absolute effect αj is a lower bound for true absolute intervention effect

SLIDE 36

Computationally tractable algorithm

searching all DAGs is computationally infeasible if p is large (we actually can do this up to p ≈ 15 − 20) instead of finding all m DAGs within an equivalence class ❀ compute all intervention effects without finding all DAGs (Maathuis, Kalisch & PB, 2009) key idea: exploring local aspects of the graph is sufficient

SLIDE 37

33

data CPDAG PC-algorithm do-calculus effect 1 effect 2 . . . . . . effect q multi-set ΘL

the local ΘL = Θ up to multiplicities (Maathuis, Kalisch & PB, 2009)

SLIDE 38

Estimation from finite samples

notation: drop the Y-notation (Y = X (1), X (2), . . . , X (p)) difficult part: estimation of CPDAG (equivalence class of DAGs) ❀ ¨structural learning¨ P ⇒ CPDAG

equiv. class of DAGs

pcAlgo(dm = d, alpha = 0.05)

1 2 3 4 5 6 7 8 9 10

two main approaches:

◮ multiple testing of conditional dependencies:

PC-algorithm as prime example

◮ score-based methods: MLE as prime example

SLIDE 39

Estimation from finite samples

notation: drop the Y-notation (Y = X (1), X (2), . . . , X (p)) difficult part: estimation of CPDAG (equivalence class of DAGs) ❀ ¨structural learning¨ P ⇒ CPDAG

equiv. class of DAGs

pcAlgo(dm = d, alpha = 0.05)

1 2 3 4 5 6 7 8 9 10

two main approaches:

◮ multiple testing of conditional dependencies:

PC-algorithm as prime example

◮ score-based methods: MLE as prime example

SLIDE 40

Faithfulness assumption (necessary for conditional dependence testing approaches) a distribution P is called faithful to a DAG G if all conditional independencies can be inferred from the graph (can infer some conditional independencies from a Markov assumption; but we require here “all” conditional independencies) assuming faithfulness: ❀ can infer the CPDAG from a list of conditional (in-)dependence relations

SLIDE 41

Faithfulness assumption (necessary for conditional dependence testing approaches) a distribution P is called faithful to a DAG G if all conditional independencies can be inferred from the graph (can infer some conditional independencies from a Markov assumption; but we require here “all” conditional independencies) assuming faithfulness: ❀ can infer the CPDAG from a list of conditional (in-)dependence relations

SLIDE 42

What does it mean?

1

2 3

X (1) ← ε(1), X (2) ← αX (1) + ε(2), X (3) ← βX (1) + γX (2) + ε(3), ε(1), ε(2), ε(3) i.i.d. ∼ N(0, 1) enforce marginal independence of X (1) and X (3) β + αγ = 0, e.g. α = β = 1, γ = −1 Σ =   1 1 1 2 −1 −1 2   , Σ−1 =   3 −2 −1 −2 2 1 −1 1 1   . failure of faithfulness due to cancellation of coefficients

SLIDE 43

failure of exact faithfulness is “rare” (Lebesgue measure zero) but for statistical estimation (in the Gaussian case): “often” require strong faithfulness (Robins, Scheines, Spirtes &

Wasserman, 2003):

min

|ρ(i, j|S)|; ρ(i, j|S) = 0, i = j, |S| ≤ d
≥ τ,

τ ≍

d log(p)/n

(d is the maximal degree of the skeleton of the DAG)

SLIDE 44

... strong faithfulness can be rather severe (Uhler, Raskutti, PB & Yu, 2013) 3 nodes, full graph unfaithful distributions due to exact cancellation imagine a strip around it ❀ large volume! ⇒ strong faithfulness is restrictive in high dimensions

SLIDE 45

The PC-algorithm (Spirtes & Glymour, 1991)

◮ crucial assumption:

distribution P is faithful to the true underlying DAG

◮ less crucial but convenient:

Gaussian assumption for Y, X (1), . . . , X (p) ❀ can work with partial correlations

◮ input: ˆ

ΣMLE but we only need to consider many small sub-matrices of it (assuming sparsity of the graph)

◮ output: based on a clever data-dependent (random)

sequence of multiple tests estimated CPDAG

SLIDE 46

PC-algorithm: a rough outline for estimating the skeleton of underlying DAG

1. start with full graph
2. remove edge i − j if
Cor(X (i), X (j)) is small

(Fisher’s Z-transform and null-distribution of zero correlation)

3. partial correlations of
rder 1:

remove edge i − j if

Parcor(X (i), X (j)|X (k)) is

small for some k in the current neighborhood of i

r j (thanks to faithfulness)

stopped full graph partial correlation order 1 correlation screening

SLIDE 47

4. move-up to partial

correlations of order 2: remove edge i − j if partial correlation

Parcor(X (i), X (j)|X (k), X (ℓ))

is small for some k, ℓ in the current neighborhood

f i or j (thanks to

faithfulness)

5. until removal of edges is

not possible anymore, i.e. stop at minimal order

f partial correlation

where edge-removal becomes impossible

stopped full graph partial correlation order 1 correlation screening

additional step of the algorithm needed for estimating directions yields an estimate of the CPDAG (equivalence class of DAGs)

SLIDE 48

ne tuning parameter (cut-off parameter) α for truncation of

estimated Z-transformed partial correlations if the graph is “sparse” (few neighbors) ❀ few iterations only and only low-order partial correlations play a role and thus: the estimation algorithm works for p ≫ n problems

SLIDE 49

Computational complexity crudely bounded to be polynomial in p sparser underlying structure ❀ faster algorithm we can easily do the computations for sparse cases with p ≈ 104 ≈ 1-2 hrs CPU time

1.0 1.5 2.0 2.5 3.0 −1 1 2 3 log10(p) log10( Processor Time [s] ) E[N]=2 E[N]=8

SLIDE 50

IDA (Intervention calculus when DAG is Absent) (Maathuis, Colombo, Kalisch & PB, 2010)

1. PC-algorithm ❀
CPDAG
2. local algorithm ❀ ˆ

Θlocal

3. lower bounds for absolute causal effects ❀ ˆ

αj R-package: pcalg this is what we used in the yeast example to score the importance of the genes according to size of ˆ αj

SLIDE 51

Statistical theory (Kalisch & PB, 2007; Maathuis, Kalisch & PB, 2009) n i.i.d. observational data points; p variables high-dimensional setting where p ≫ n assumptions:

◮ X (1), . . . , X (p) ∼ Np(0, Σ) Markov and faithful to true DAG ◮ high-dimensionality: log(p) ≪ n ◮ sparsity: maximal degree d = maxj |ne(j)| ≪ n ◮ “coherence”: maximal (partial) correlations ≤ C < 1

max{|ρi,j|S|; i = j, |S| ≤ d} ≤ C < 1

◮ signal strength/strong faithfulness:

min{|ρi,j|S|; ρi,j|S = 0, i = j, |S| ≤ d} ≫

d log(p)/n

Then, for some suitable tuning param. and 0 < δ < 1: P[ CPDAG = true CPDAG] = 1 − O(exp(−Cn1−δ)) P[ˆ ΘL as set = Θ] = 1 − O(exp(−Cn1−δ)) (i.e. consistency of lower bounds for causal effects)

SLIDE 52

Main strategy of a proof we have to analyze an algorithm... ❀

◮ understand the population version:

in particular: can show that algorithm stops in d or d − 1 steps

◮ analysis of the noisy version: additional errors

use a union bound argument to control the overall error (seems very rough but leads to asymptotically “optimal” regime, e.g., for signal strength)

SLIDE 53

The role of “sparsity” in causal inference as usual: sparsity is necessary for accurate estimation in presence of noise but here: “sparsity” (so-called protectedness) is crucial for identifiability as well

X X Y Y

X causes Y Y causes X cannot tell from observational data the direction of the arrow the same situation arises with a full graph with more than 2 nodes ❀ causal identification really needs “sparsity” the better the “sparsity” the tighter the bounds for causal effects

SLIDE 54

Maximum likelihood estimation

R.A. Fisher Gaussian DAG is Gaussian linear structural equation model:

1

2 3

X (1) ← ε(1) X (2) ← β21X (1) + ε(2) X (3) ← β31X (1) + β32X (2) + ε(3) in general: X (j) ←

p

k=1

βjkX (k) + ε(j) (j = 1, . . . , p), βjk = 0 ⇔ edge k → j X = BX + ε, ε ∼ Np(0, diag(σ2

1, . . . , σ2 p)) in matrix notation

❀ reparameterization

2

SLIDE 55

ˆ Σ, ˆ D = argminΣ;D a DAG − ℓ(Σ, D; data) + λ|D| = argminB; {σ2

j ;j} − ℓ(B, {σ2

j ; j}; data) + λ

B0

ij I(Bij=0)

under the non-convex constraint that B corresponds to “no directed cycles” severe non-convex problem due to the “no directed cycle” constraint ( · 0-penalty rather than e.g. · 1 doesn’t make the problem much harder)

SLIDE 56

toy-example X (1) ← β1X (2) + ε1 X (2) ← β2X (1) + ε2

X1 X2 (0,0) beta1 beta2

non-convex parameter space! (no straightforward way to do convex relaxation)

SLIDE 57

computation:

◮ dynamic programming algorithm (up to p ≈ 20)

(Silander and Myllym¨

aki, 2006)

◮ greedy algorithms on equivalence classes

(Chickering, 2002; Hauser & PB, 2012) statistical properties of penalized MLE (for “large” p): (van de Geer & PB, 2013) no faithfulness assumption required to obtain the minimal edges I-MAP !

SLIDE 58

Successes in biology

Effects of single gene knock-downs on all other genes in yeast (Maathuis, Colombo, Kalisch & PB, 2010) n = 63

bservational data

False positives True positives 1,000 2,000 3,000 4,000 200 400 600 800 1,000 IDA Lasso Elastic−net Random

SLIDE 59

Arabidopsis thaliana (Stekhoven, Moraes, Sveinbj¨

rnsson, Hennig,

Maathuis & PB, 2012)

response Y: days to bolting (flowering) of the plant (aim: fast flowering plants) covariates X: gene-expression profile

bservational data with n = 47 and p = 21′326

❀ lower bound estimates ˆ αj for causal effect of every gene/variable on Y (using the PC-algorithm) apply stability selection (Meinshausen & PB, 2010) ❀ assigning uncertainties via control of PCER = E[V]/p ≤

1 2πthr−1 q2 p2

(per comparison error rate)

SLIDE 60

Causal gene ranking

summary median error Gene rank effect expression (PCER) name 1 AT2G45660 1 0.60 5.07 0.0017 AGL20 (SOC1) 2 AT4G24010 2 0.61 5.69 0.0021 ATCSLG1 3 AT1G15520 2 0.58 5.42 0.0017 PDR12 4 AT3G02920 5 0.58 7.44 0.0024 replication protein-related 5 AT5G43610 5 0.41 4.98 0.0101 ATSUC6 6 AT4G00650 7 0.48 5.56 0.0020 FRI 7 AT1G24070 8 0.57 6.13 0.0026 ATCSLA10 8 AT1G19940 9 0.53 5.13 0.0019 AtGH9B5 9 AT3G61170 9 0.51 5.12 0.0034 protein coding 10 AT1G32375 10 0.54 5.21 0.0031 protein coding 11 AT2G15320 10 0.50 5.57 0.0027 protein coding 12 AT2G28120 10 0.49 6.45 0.0026 protein coding 13 AT2G16510 13 0.50 10.7 0.0023 AVAP5 14 AT3G14630 13 0.48 4.87 0.0039 CYP72A9 15 AT1G11800 15 0.51 6.97 0.0028 protein coding 16 AT5G44800 16 0.32 6.55 0.0704 CHR4 17 AT3G50660 17 0.40 7.60 0.0059 DWF4 18 AT5G10140 19 0.30 10.3 0.0064 FLC 19 AT1G24110 20 0.49 4.66 0.0059 peroxidase, putative 20 AT1G27030 20 0.45 10.1 0.0059 unknown protein

biological validation by gene knockout experiments in progress.

red: biologically known genes responsible for flowering

SLIDE 61

performed validation experiment with mutants corresponding to these top 20 - 3 = 17 genes

◮ 14 mutants easily available ❀ only test for 14 genes ◮ more than usual: mutants showed low germination or

survival...

◮ 9 among the 14 mutants survived (sufficiently strongly), i.e.

9 mutants for which we have an outcome

◮ 3 among the 9 mutants (genes) showed a significant effect

for Y relative to the wildtype (non-mutated plant) ❀ that is: besides the three known genes, we find three additional genes which exhibit a significant difference in terms

f “time to flowering”

SLIDE 62

in short: bounds on causal effects (ˆ αj’s) based on observational data lead to interesting predictions for interventions in genomics (i.e. which genes would exhibit a large intervention effect) and these predictions have been validated using experiments

SLIDE 63

Fully identifiable cases

recap: if P = {Gaussian distributions}, then: cardinality of the Markov-equivalence class of a DAG D |E(D)| often > 1 same is true for P = {nonparametric distributions} but under some additional constraints, the Markov equivalence class becomes smaller or even consisting of one DAG only (i.e., identifiable)

SLIDE 64

structural equation model (SEM) Xj ← fj(Xpa(j), εj) (j = 1, . . . , p) e.g. Xj ←

k∈pa(j)

fjk(Xk) + εj (j = 1, . . . , p)

SLIDE 65

three types of identifiable SEMs where true DAG is identifiable from observational distribution:

◮ LiNGAM (Shimizu et al., 2006)

Xj ←

k∈pa(j)

BjkXk + εj (j = 1, . . . , p), with εj’s non-Gaussian as with independent component analysis (ICA) ❀ identifying Gaussian components is the hardest

◮ linear Gauss with equal error variances (Peters & PB, 2014)

Xj ←

k∈pa(j)

BjkXk + εj (j = 1, . . . , p), Var(εj) ≡ σ2 for all j

SLIDE 66

◮ nonlinear SEM with additive noise (Sch¨

lkopf and

co-workers, 2008–2014: Hoyer et al., 2009; Peters et al., 2013)

Xj ← fj(Xpa(j)) + εj (j = 1, . . . , p) toy example: p = 2 with DAG X (1) = X → Y = X (2)

●
●
●
●
●
0.0

0.5 1.0 1.5 2.0 2.5 3.0 −5 5 10 15 20 25 30 Y = X^3 + epsilon x y

●
0.0

0.5 1.0 1.5 2.0 2.5 3.0 −4 −2 2 4 errors x epsilon

●
●
●
●
−5

5 10 15 20 25 30 0.0 0.5 1.0 1.5 2.0 2.5 3.0 X= Y^{1/3} + eta y x

●
●
●
●
−5

5 10 15 20 25 30 −1.0 −0.5 0.0 0.5 errors y eta

truth: Y = X 3 + ε ε indep. of X reverse mod.: X = Y 1/3 + η η not indep. of Y

SLIDE 67

strategy to infer structure and model parameters (for identifiable models):

rder search (order among the variables)

& subsequent (sparse) regression

SLIDE 68

rder search: MLE for best permutation/order

◮ candidate order/permutation π (of {1, . . . , p}) ◮ regressions

Xπ(j) versus Xπ(j−1), Xπ(j−2), . . . , Xπ(1) (when sparse: only some of the regressors are selected) evaluation score for π: likelihood of all the regressions Xπ(j) vs. Xπ(j−1), Xπ(j−2), . . . , Xπ(1) ❀ p

j=1 p(Xπ(j)|Xπ(j−1), . . . , Xπ(1))

need a model for the regressions:

◮ functional form of regressions (e.g. additive, linear) ◮ distribution of error terms (e.g. Gaussian, nonparametric)

❀ can compute the likelihood with estimated parameters and search over the orderings/permutations

SLIDE 69

search over the orderings/permutations: there are p! permutations... trick: preliminary neighborhood selection aiming for an undirected graph containing the skeleton of the DAG (i.e. the Markov blanket or a superset thereof)

1 2 3 4 5

“superset skeleton” ❀ and do order search and estimation restricted to superset-skeleton

◮ much reduced search space ◮ can use unpenalized MLE (for best order/permutation

restricted to superset-skeleton)

that is: regularization in neighborhood selection suffices!

SLIDE 70

CAM: Causal Additive Model (PB, Peters & Ernest, 2013) Xj ←

k∈pa(j)

fjk(Xk) + εj (j = 1, . . . , n) εj ∼ N(0, σ2

j ) ◮ underlying DAG is identifiable from joint distribution ◮ statistically “feasible” for estimation due to additive

functions

◮ good empirical performance

log-likelihood equals (up to constant term):

p

j=1

log(ˆ σπ

j ), (ˆ

σπ

j )2 = n−1 n

i=1

(Xi,π(j) −

k<j

ˆ fj,π(k)(Xi,k))

SLIDE 71

fitting method

1. preliminary neighborhood selection

(for superset of skeleton of DAG or Markov blanket): “nodewise” sparse additive regression of

ne variable Xj against all others {Xk; k = j}

easy (using CV), works well and is established (Ravikumar et al. (2009), Meier, van de Geer & PB (2009),...)

2. estimate a best order by unpenalized MLE restricted to

the superset-skeleton

3. based on the estimated order ˆ

π: additive sparse model fitting restricted to the superset-skeleton ❀ edge pruning step 1:

1 2 3 4 5

step 2:

1 2 3 4 5

step 3:

1 2 3 4 5

for step 2: greedy search restricted to superset-skeleton

SLIDE 72

for step 2: greedy search restricted to superset-skeleton

– exhaustive computation is possible for trees (B¨ urge, MSC thesis 2014) – greedy methods are shown to find the optimum for restricted classes of DAGs (Peters, Balakrishnan and Wainwright, in progress)

unpenalized MLE for best order search: very convenient regularization is decoupled and only used in sparse additive regression

SLIDE 73

statistical consistency: high-dimensional p ≫ n setting (PB, Peters & Ernest, 2013) main assumptions:

◮ maximal degree of DAG is O(1)

(“sparse”)

◮ sufficiently large identifiability constant (w.r.t. Kullback-

Leibler divergence) between true and wrong ordering:

ξp := p−1 minπ /

∈Π0

Eθ0[− log(pπ

θ0(X))] − Eθ0[− log(pθ0(X))]

ξp ≫
log(p)/n

◮ exponential moments for Xj’s and certain smooth functions

thereof then: P[ˆ π ∈ Π0

set of true permutations

] → 1 (n → ∞)

◮ consistent recovery of true functions f 0 jk and DAG D0 ◮ consistent estimation of E[Y|do(X = x)]

SLIDE 74

remark: preliminary neighborhood selection

◮ yields empirical and computational improvements ◮ seems to improve the theoretical results

e.g. in linear Gaussian case ❀ less stringent conditions than ℓ0-regularized MLE (van de Geer & PB, 2013)

SLIDE 75

Empirical results to illustrate what can be achieved with CAM p = 100, n = 200 true model is CAM (additive SEM) with Gaussian error SHD: Structural Hamming Distance SID: Structural Intervention Distance (Peters & PB, 2013)

CAM

RESIT LiNGAM PC CPC GES 100 200 300 400 SHD to true DAG

p = 100

CAM

RESIT LiNGAM PC (lower) PC (upper) CPC (lower) CPC (upper) GES (lower) GES (upper) 500 1000 1500 SID to true DAG

p = 100

RESIT (Mooij et al. 2009) cannot be used for p = 100

CAM method is impressive where true functions are non-monotone and nonlinear (sampled from Gaussian proc.); for monotone functions: still good but less impressive gains

SLIDE 76

Empirical results to illustrate what can be achieved with CAM p = 100, n = 200 true model is CAM (additive SEM) with Gaussian error SHD: Structural Hamming Distance SID: Structural Intervention Distance (Peters & PB, 2013)

CAM

RESIT LiNGAM PC CPC GES 100 200 300 400 SHD to true DAG

p = 100

CAM

RESIT LiNGAM PC (lower) PC (upper) CPC (lower) CPC (upper) GES (lower) GES (upper) 500 1000 1500 SID to true DAG

p = 100

RESIT (Mooij et al. 2009) cannot be used for p = 100

CAM method is impressive where true functions are non-monotone and nonlinear (sampled from Gaussian proc.); for monotone functions: still good but less impressive gains

SLIDE 77

Gene expressions from isoprenoid pathways in Arabidopsis Thaliana (Wille et al., 2004) p = 39, n = 118 top 20 edges from CAM

DXPS1 DXPS2 DXPS3 DXR MCT CMK MECPS HDS HDR IPPI1 GPPS GPPS GGPPS 2,6,8,10,11,12 PPDS1 PPDS2 UPPS1 GGPPS1,5,9 DPPS2 Chloroplast (MEP pathway) AACT1 AACT2 HMGS HMGR1 MK MPDC1 IPPI2 FPPS1 FPPS2 DPPS1,3 GGPPS3,4 Cytoplasm (MVA pathway) HMGR2 MPDC2 Mitochondrion

stability selection

DXPS1 DXPS2 DXPS3 DXR MCT CMK MECPS HDS HDR IPPI1 GPPS GPPS GGPPS 2,6,8,10,11,12 PPDS1 PPDS2 UPPS1 GGPPS1,5,9 DPPS2 Chloroplast (MEP pathway) AACT1 AACT2 HMGS HMGR1 MK MPDC1 IPPI2 FPPS1 FPPS2 DPPS1,3 GGPPS3,4 Cytoplasm (MVA pathway) HMGR2 MPDC2 Mitochondrion

solid edges: estimated from data stability selection: expected no. of false positives ≤ 2 (Meinshausen & PB, 2010)

SLIDE 78

When knowing the order of the variables

a new approach... Theorem (Ernest & PB, 2014) For a general nonlinear SEM with functions fj ∈ L2(P)

and regularity condition on the joint density/distribution:

for all I ⊂ {1, . . . , p}: p(x) ≥ Mp(xI)p(xIc ) for some 0 < M ≤ 1

E[Y|do(X = x)] = g(x), g(x) = best additive L2-approx. of Y versus X, {Xk; k ∈ S(X)} e.g. S(X) = {k; k < jX}, jX = index corresponding to X that is E[Y|do(X = x)] = g(x), g(x) = best additive approx. of Y versus X, {Xj; j ∈ S(X)}, gapp = argminfjE[(Y − fjX (X) −

k∈S(X)

fk(Xk))2], gapp

jX (·) = g(·)

SLIDE 79

implication:

nly need to run additive model fitting even for an “arbitrary

nonlinear SEM” e.g. even if fj(Xpa(j), εj) is very complicated...! call it ord-additive modeling/fitting

very robust against model-misspecification !

misspecification could be dependent εj’s, i.e. due to hidden variables if we were to consider E[Y|do(X1 = x1, X2 = x2)] ❀ would need to fit first-order interaction model

SLIDE 80

implication:

nly need to run additive model fitting even for an “arbitrary

nonlinear SEM” e.g. even if fj(Xpa(j), εj) is very complicated...! call it ord-additive modeling/fitting

very robust against model-misspecification !

misspecification could be dependent εj’s, i.e. due to hidden variables if we were to consider E[Y|do(X1 = x1, X2 = x2)] ❀ would need to fit first-order interaction model

SLIDE 81

Swiss Air Ltd. ticket pricing and revenue (Ghielmetti et al., in progress) do not need to worry about complicated non-linearities!

SLIDE 82

estimation – in practice additive model fitting of Y versus X, {Xk; k ∈ S(X)}

Hastie & Tibshirani (≈ 1990); Wood (2006); ...

high-dimensional scenario when |S(X)| is large e.g. |S(X)| ≈ 100 − 5′000 and sample size n ≈ 50 consistency and optimal convergence rate results for sparse additive model fitting if – problem is sparse: maximal degree of true DAG is small – identifiability aka restricted eigenvalue conditions

Ravikumar et al. (2009); Meier, van de Geer & PB (2009)

additive model fitting is well-established and works well including high-dimensional setting

SLIDE 83

estimation – in practice additive model fitting of Y versus X, {Xk; k ∈ S(X)}

Hastie & Tibshirani (≈ 1990); Wood (2006); ...

high-dimensional scenario when |S(X)| is large e.g. |S(X)| ≈ 100 − 5′000 and sample size n ≈ 50 consistency and optimal convergence rate results for sparse additive model fitting if – problem is sparse: maximal degree of true DAG is small – identifiability aka restricted eigenvalue conditions

Ravikumar et al. (2009); Meier, van de Geer & PB (2009)

additive model fitting is well-established and works well including high-dimensional setting

SLIDE 84

MEP pathway in Arabidopsis Thaliana (Wille et al, 2004) p = 14 expressions of genes, n = 118

rder of variables: biological information w.r.t. up-/downstream

DXPS1 DXPS2 DXPS3 DXR MCT CMK MECPS HDS HDR IPPI1 GPPS GPPS GGPPS 2,6,8,10,11,12 PPDS1 PPDS2 Chloroplast (MEP pathway)

◮ rank directed gene pairs (X causes Y)

with

(ˆ

E[Y|do(X = x) − ˆ E[Y])2dx

◮ top 10 scoring directed gene pairs and

check their stability ❀ stability selection: E[false positives] ≤ 1 (Meinshausen & PB, 2010) do not need to worry about complicated nonlinearities!

SLIDE 85

rd-additive regression: a local operation

inferring E[Y|do(X = x)] when DAG (or order) is known:

◮ ord-additive regression Y versus X, pa(X): local ◮ integration of all directed paths from X to Y: global

0.0 0.5 1.0 1.5 2.0 0 (100%) 16 (96%) 32 (92%) 64 (84%) 128 (68%) 256 (34%) SHD to true DAG (percentage of correct edges) squared error Entire Path Partial Path Parent Adjustment 1 2 3 4 0 (100%) 4 (96%) 8 (92%) 16 (84%) 32 (68%) 64 (34%) SHD to true DAG (percentage of correct edges) squared error Entire Path Partial Path Parent Adjustment

nonsparse DAG sparse DAG

rd-additive regression is much more reliable and “robust”

when having imperfect knowledge of the DAG or the order of variables and computationally much faster

SLIDE 86

Conclusions

1. Beware of over-interpretation!

so far, based on current data: we can not reliably infer a causal network

1

despite theorems... (perturbation of the data yields unstable networks)

2. Causal inference relies on subtle uncheckable(!)

assumptions ❀ experimental validations are important (simple organisms in biology are great for pursuing this!) statistical (and other) inference is often not confirmatory

3. many technical issues in identifiability, high-dimensional

statistical inference and optimization

SLIDE 87

4. but there is potential:

for stable ranking/prediction of intervention/causal effects ... “causal inference from purely observed data could have practical value in the prioritization and design of perturbation experiments”

Editorial in Nature Methods (April 2010)

this can be very useful in computational biology and in this sense: “causal inference from observational data is much further developed than 30 years ago when it was thought to be impossible”

SLIDE 88

Thank you!

R-package: pcalg (Kalisch, M¨

achler, Colombo, Maathuis & PB, 2012)

References:

◮

Ernest, J. and B¨ uhlmann, P . (2014). On the role of additive regression for (high-dimensional) causal

inference. Preprint arXiv:1405.1868

◮

B¨ uhlmann, P ., Peters, J. and Ernest, J. (2013). CAM: Causal Additive Models, high-dimensional order search and penalized regression. Preprint arXiv:1310.1533

◮

Peters, J. and B¨ uhlmann, P . (2014). Identifiability of Gaussian structural equation models with equal error

variances. Biometrika 101, 219-228.

◮

Uhler, C., Raskutti, G., B¨ uhlmann, P . and Yu, B. (2013). Geometry of faithfulness assumption in causal

inference. Annals of Statistics 41, 436-463.

◮

van de Geer, S. and B¨ uhlmann, P . (2013). ℓ0-penalized maximum likelihood for sparse directed acyclic

graphs. Annals of Statistics 41, 536-567.

◮

Kalisch, M., M¨ achler, M., Colombo, D., Maathuis, M.H. and B¨ uhlmann, P . (2012). Causal inference using graphical models with the R package pcalg. Journal of Statistical Software 47 (11), 1-26.

◮

Stekhoven, D.J., Moraes, I., Sveinbj¨

rnsson, G., Hennig, L., Maathuis, M.H. and B¨

uhlmann, P . (2011). Causal stability ranking. Bioinformatics 28, 2819-2823.

◮

Hauser, A. and B¨ uhlmann, P . (2012). Characterization and greedy learning of interventional Markov equivalence classes of directed acyclic graphs. Journal of Machine Learning Research 13, 2409-2464.

◮

Maathuis, M.H., Colombo, D., Kalisch, M. and B¨ uhlmann, P . (2010). Predicting causal effects in large-scale systems from observational data. Nature Methods 7, 247-248.

◮

Maathuis, M.H., Kalisch, M. and B¨ uhlmann, P . (2009). Estimating high-dimensional intervention effects from

bservational data. Annals of Statistics 37, 3133-3164.