High-dimensional causal inference, graphical modeling and structural - - PowerPoint PPT Presentation
High-dimensional causal inference, graphical modeling and structural - - PowerPoint PPT Presentation
High-dimensional causal inference, graphical modeling and structural equation models Peter B uhlmann Seminar f ur Statistik, ETH Z urich cannot do confirmatory causal inference without randomized intervention experiments... but we can
cannot do confirmatory causal inference without randomized intervention experiments... but we can do better than proceeding naively
Goal
in genomics: if we would make an intervention at a single gene, what would be its effect on a phenotype of interest? want to infer/predict such effects without actually doing the intervention i.e. from observational data (from observations of a “steady-state system”) it doesn’t need to be genes can generalize to intervention at more than one variable/gene
Goal
in genomics: if we would make an intervention at a single gene, what would be its effect on a phenotype of interest? want to infer/predict such effects without actually doing the intervention i.e. from observational data (from observations of a “steady-state system”) it doesn’t need to be genes can generalize to intervention at more than one variable/gene
Policy making James Heckman: Nobel Prize Economics 2000 e.g.: “Pritzker Consortium on Early Childhood Development identifies when and how child intervention programs can be most influential”
Genomics
- 1. Flowering of Arabidopsis Thaliana
phenotype/response variable of interest: Y = days to bolting (flowering) “covariates” X = gene expressions from p = 21′326 genes remark: “gene expression”: process by which information from a gene is used in the synthesis of a functional gene product (e.g. protein) question: infer/predict the effect of knocking-out/knocking-down (or enhancing) a single gene (expression) on the phenotype/response variable Y?
- 2. Gene expressions of yeast
p = 5360 genes phenotype of interest: Y = expression of first gene “covariates” X = gene expressions from all other genes and then phenotype of interest: Y = expression of second gene “covariates” X = gene expressions from all other genes and so on infer/predict the effects of a single gene knock-down on all
- ther genes
❀ consider the framework of an intervention effect = causal effect (mathematically defined ❀ see later)
Regression – the “statistical workhorse”: the wrong approach we could use linear model (fitted from n observational data) Y =
p
- j=1
βjX (j) + ε, Var(X (j)) ≡ 1 for all j |βj| measures the effect of variable X (j) in terms of “association” i.e. change of Y as a function of X (j) when keeping all other variables X (k) fixed ❀ not very realistic for intervention problem if we change e.g. one gene, some others will also change and these others are not (cannot be) kept fixed
Regression – the “statistical workhorse”: the wrong approach we could use linear model (fitted from n observational data) Y =
p
- j=1
βjX (j) + ε, Var(X (j)) ≡ 1 for all j |βj| measures the effect of variable X (j) in terms of “association” i.e. change of Y as a function of X (j) when keeping all other variables X (k) fixed ❀ not very realistic for intervention problem if we change e.g. one gene, some others will also change and these others are not (cannot be) kept fixed
and indeed:
False positives True positives 1,000 2,000 3,000 4,000 200 400 600 800 1,000 IDA Lasso Elastic−net Random
❀ can do much better than (penalized) regression!
and indeed:
False positives True positives 1,000 2,000 3,000 4,000 200 400 600 800 1,000 IDA Lasso Elastic−net Random
❀ can do much better than (penalized) regression!
Effects of single gene knock-downs on all other genes (yeast) (Maathuis, Colombo, Kalisch & PB, 2010)
- p = 5360 genes (expression of genes)
- 231 gene knock downs ❀ 1.2 · 106 intervention effects
- the truth is “known in good approximation”
(thanks to intervention experiments) goal: prediction of the true large intervention effects based on observational data with no knock-downs n = 63
- bservational data
False positives True positives 1,000 2,000 3,000 4,000 200 400 600 800 1,000 IDA Lasso Elastic−net Random
A bit more specifically
◮ univariate response Y ◮ p-dimensional covariate X
question: what is the effect of setting the jth component of X to a certain value x: do(X (j) = x) ❀ this is a question of intervention type not the effect of X (j) on Y when keeping all other variables fixed (regression effect)
Reichenbach, 1956; Suppes, 1970; Rubin, Dawid, Holland, Pearl, Glymour, Scheines, Spirtes,...
Intervention calculus (a review)
“dynamic” notion of an effect: if we set a variable X (j) to a value x (intervention) ❀ some other variables X (k) (k = j) and maybe Y will change we want to quantify the “total” effect of X (j) on Y including “all changed” X (k) on Y a graph or influence diagram will be very useful
X1 X2 X3 X4 Y
for simplicity: just consider DAGs (Directed Acyclic Graphs) [with hidden variables (Spirtes, Glymour & Scheines (1993);
Colombo et al. (2012)) much more complicated and not validated
with real data] random variables are represented as nodes in the DAG assume a Markov condition, saying that X (j)|X (pa(j)) cond. independent of its non-descendant variables ❀ recursive factorization of joint distribution P(Y, X (1), . . . , X (p)) = P(Y|X (pa(Y)))
p
- j=1
P(X (j)|X (pa(j))) for intervention calculus: use truncated factorization (e.g. Pearl)
for simplicity: just consider DAGs (Directed Acyclic Graphs) [with hidden variables (Spirtes, Glymour & Scheines (1993);
Colombo et al. (2012)) much more complicated and not validated
with real data] random variables are represented as nodes in the DAG assume a Markov condition, saying that X (j)|X (pa(j)) cond. independent of its non-descendant variables ❀ recursive factorization of joint distribution P(Y, X (1), . . . , X (p)) = P(Y|X (pa(Y)))
p
- j=1
P(X (j)|X (pa(j))) for intervention calculus: use truncated factorization (e.g. Pearl)
assume Markov property for causal DAG: non-intervention
X(1) X(2) X(3) X(4) Y
intervention do(X (2) = x)
X(1) X(2) = x X(3) X(4) Y
P(Y, X (1), X (2), X (3), X (4)) = P(Y|X (1), X (3))× P(X (1)|X (2))× P(X (2)|X (3), X (4))× P(X (3))× P(X (4)) P(Y, X (1), X (3), X (4)|do(X (2) = x)) = P(Y|X (1), X (3))× P(X (1)|X (2) = x)× P(X (3))× P(X (4))
truncated factorization for do(X (2) = x): P(Y, X (1), X (3), X (4)|do(X (2) = x)) = P(Y|X (1), X (3))P(X (1)|X (2) = x)P(X (3))P(X (4))
P(Y|do(X (2) = x)) =
- P(Y, X (1), X (3), X (4)|do(X (2) = x))dX (1)dX (3)dX (4)
the truncated factorization is a mathematical consequence of the Markov condition (with respect to the causal DAG) for the
- bservational probability distribution P
(plus assumption that structural eqns. are “autonomous”)
the intervention distribution P(Y|do(X (2) = x)) can be calculated from
◮ observational data distribution
❀ need to estimate conditional distributions
◮ an influence diagram (causal DAG)
❀ need to estimate structure of a graph/influence diagram intervention effect: E[Y|do(X (2) = x)] =
- yP(y|do(X (2) = x))dy
intervention effect at x0 : ∂ ∂x E[Y|do(X (2) = x)]|x=x0 in the Gaussian case: Y, X (1), . . . , X (p) ∼ Np+1(µ, Σ), ∂ ∂x E[Y|do(X (2) = x)]≡ θ2 for all x
The backdoor criterion (Pearl, 1993) we only need to know the local parental set: for Z = Xpa(X), if Y / ∈ pa(X) : P(Y|do(X = x)) =
- Z
P(Y|X = x, Z)dP(Z) parental set might not be the minimal set but always suffices this is a consequence of the global Markov property: XA independent of XB|XS : A and B are d-separated by S
Gaussian case for Gaussian case: ∂ ∂x E[Y|do(X (j) = x)] ≡ θj for all x for Y / ∈ pa(j): θj is the regression parameter in Y = θjX (j) +
- k∈pa(j)
θkX (k) + error
- nly need parental set and regression
X(1) X(2) X(3) X(4) Y
j = 2, pa(j) = {3, 4}
when having no unmeasured confounder (variable): intervention effect (as defined) = causal effect recap: causal effect = effect from a randomized trial (but we want to infer it without a randomized study... because often we cannot do it, or it is too expensive) structural equation models provide a different (but closely related) route for quantifying intervention effects
when having no unmeasured confounder (variable): intervention effect (as defined) = causal effect recap: causal effect = effect from a randomized trial (but we want to infer it without a randomized study... because often we cannot do it, or it is too expensive) structural equation models provide a different (but closely related) route for quantifying intervention effects
when having no unmeasured confounder (variable): intervention effect (as defined) = causal effect recap: causal effect = effect from a randomized trial (but we want to infer it without a randomized study... because often we cannot do it, or it is too expensive) structural equation models provide a different (but closely related) route for quantifying intervention effects
Inferring intervention effects from observational distribution
main problem: inferring DAG from observational data impossible! can only infer equivalence class of DAGs (several DAGs can encode exactly the same conditional independence relationships) Example:
X X Y Y
X causes Y Y causes X a lot of work about identifiability:
Verma & Pearl (1991); Spirtes, Glymour & Scheines (1993); Tian & Pearl (2000–2002); Lauritzen & Richardson (2002); Shpitser & Pearl (2006–2011); vanderWeele & Robins (2007–2011); Drton, Foygel & Sullivant (2011);...
Markov equivalence class of DAGs P a family of probability distributions on Rp+1 Definition: M(D) = {P ∈ P; P is Markov w.r.t. DAG D} D ∼ D′ ⇔ M(D) = M(D′) the definition depends on the family P: typical models:
- P = Gaussian distributions
- P = nonparametric model/set of all distributions
- P = set of distributions from
additive structural equation model (see later)
A graphical characterization (Frydenberg, 1990; Verma & Pearl, 1991) for P = {Gaussian distributions} or P = {nonparametric distributions} it holds: D ∼ D′ ⇐ ⇒ D and D′ have the same skeleton and the same v-structures
for a DAG D, we write its corresponding Markov equivalence class E(D) = {D′; D′ ∼ D}
Equivalence class of DAGs
22
- Several DAGs can encode exactly the same conditional
independence relationships. Such DAGs form an equivalence class.
- Example: unshielded triple
X1 ⊥ ⊥ X3 X1 ⊥ ⊥ X3|X2 X1 X2 X3 false true X1 X2 X3 false true X1 X2 X3 false true X1 X2 X3 true false no v-structure v-structure
- All DAGs in an equivalence class have the same skeleton and the
same v-structures
- An equivalence class can be uniquely represented by a completed
partially directed acyclic graph (CPDAG) CPDAG DAG 1 DAG 2 DAG 3 DAG 4
we cannot estimate causal/intervention effects from
- bservational distribution
but we will be able to estimate lower bounds of causal effects conceptual “procedure”:
◮ probability distribution P from a DAG, generating the data
❀ true underlying equivalence class of DAGs (CPDAG)
◮ find all DAG-members of true equivalence class (CPDAG):
D1, . . . , Dm
◮ for every DAG-member Dr, and every variable X (j):
single intervention effect θr,j summarize them by Θ = {θr,j; r = 1, . . . , m; j = 1, . . . , p}
- identifiable parameter
we cannot estimate causal/intervention effects from
- bservational distribution
but we will be able to estimate lower bounds of causal effects conceptual “procedure”:
◮ probability distribution P from a DAG, generating the data
❀ true underlying equivalence class of DAGs (CPDAG)
◮ find all DAG-members of true equivalence class (CPDAG):
D1, . . . , Dm
◮ for every DAG-member Dr, and every variable X (j):
single intervention effect θr,j summarize them by Θ = {θr,j; r = 1, . . . , m; j = 1, . . . , p}
- identifiable parameter
IDA (oracle version)
17
- racle
CPDAG PC-algorithm DAG 1 DAG 2 . . . . . . DAG m do-calculus effect 1 effect 2 . . . . . . effect m multi-set Θ
If you want a single number for every variable ... instead of the multi-set Θ = {θr,j; r = 1, . . . , m; j = 1, . . . , p} minimal absolute value αj = min
r
|θr,j| (j = 1, . . . , p), |θtrue,j| ≥ αj minimal absolute effect αj is a lower bound for true absolute intervention effect
Computationally tractable algorithm
searching all DAGs is computationally infeasible if p is large (we actually can do this up to p ≈ 15 − 20) instead of finding all m DAGs within an equivalence class ❀ compute all intervention effects without finding all DAGs (Maathuis, Kalisch & PB, 2009) key idea: exploring local aspects of the graph is sufficient
33
data CPDAG PC-algorithm do-calculus effect 1 effect 2 . . . . . . effect q multi-set ΘL
the local ΘL = Θ up to multiplicities (Maathuis, Kalisch & PB, 2009)
Estimation from finite samples
notation: drop the Y-notation (Y = X (1), X (2), . . . , X (p)) difficult part: estimation of CPDAG (equivalence class of DAGs) ❀ ¨structural learning¨ P ⇒ CPDAG
- equiv. class of DAGs
pcAlgo(dm = d, alpha = 0.05)
1 2 3 4 5 6 7 8 9 10
two main approaches:
◮ multiple testing of conditional dependencies:
PC-algorithm as prime example
◮ score-based methods: MLE as prime example
Estimation from finite samples
notation: drop the Y-notation (Y = X (1), X (2), . . . , X (p)) difficult part: estimation of CPDAG (equivalence class of DAGs) ❀ ¨structural learning¨ P ⇒ CPDAG
- equiv. class of DAGs
pcAlgo(dm = d, alpha = 0.05)
1 2 3 4 5 6 7 8 9 10
two main approaches:
◮ multiple testing of conditional dependencies:
PC-algorithm as prime example
◮ score-based methods: MLE as prime example
Faithfulness assumption (necessary for conditional dependence testing approaches) a distribution P is called faithful to a DAG G if all conditional independencies can be inferred from the graph (can infer some conditional independencies from a Markov assumption; but we require here “all” conditional independencies) assuming faithfulness: ❀ can infer the CPDAG from a list of conditional (in-)dependence relations
Faithfulness assumption (necessary for conditional dependence testing approaches) a distribution P is called faithful to a DAG G if all conditional independencies can be inferred from the graph (can infer some conditional independencies from a Markov assumption; but we require here “all” conditional independencies) assuming faithfulness: ❀ can infer the CPDAG from a list of conditional (in-)dependence relations
What does it mean?
1
2 3
X (1) ← ε(1), X (2) ← αX (1) + ε(2), X (3) ← βX (1) + γX (2) + ε(3), ε(1), ε(2), ε(3) i.i.d. ∼ N(0, 1) enforce marginal independence of X (1) and X (3) β + αγ = 0, e.g. α = β = 1, γ = −1 Σ = 1 1 1 2 −1 −1 2 , Σ−1 = 3 −2 −1 −2 2 1 −1 1 1 . failure of faithfulness due to cancellation of coefficients
failure of exact faithfulness is “rare” (Lebesgue measure zero) but for statistical estimation (in the Gaussian case): “often” require strong faithfulness (Robins, Scheines, Spirtes &
Wasserman, 2003):
min
- |ρ(i, j|S)|; ρ(i, j|S) = 0, i = j, |S| ≤ d
- ≥ τ,
τ ≍
- d log(p)/n
(d is the maximal degree of the skeleton of the DAG)
... strong faithfulness can be rather severe (Uhler, Raskutti, PB & Yu, 2013) 3 nodes, full graph unfaithful distributions due to exact cancellation imagine a strip around it ❀ large volume! ⇒ strong faithfulness is restrictive in high dimensions
The PC-algorithm (Spirtes & Glymour, 1991)
◮ crucial assumption:
distribution P is faithful to the true underlying DAG
◮ less crucial but convenient:
Gaussian assumption for Y, X (1), . . . , X (p) ❀ can work with partial correlations
◮ input: ˆ
ΣMLE but we only need to consider many small sub-matrices of it (assuming sparsity of the graph)
◮ output: based on a clever data-dependent (random)
sequence of multiple tests estimated CPDAG
PC-algorithm: a rough outline for estimating the skeleton of underlying DAG
- 1. start with full graph
- 2. remove edge i − j if
- Cor(X (i), X (j)) is small
(Fisher’s Z-transform and null-distribution of zero correlation)
- 3. partial correlations of
- rder 1:
remove edge i − j if
- Parcor(X (i), X (j)|X (k)) is
small for some k in the current neighborhood of i
- r j (thanks to faithfulness)
stopped full graph partial correlation order 1 correlation screening
- 4. move-up to partial
correlations of order 2: remove edge i − j if partial correlation
- Parcor(X (i), X (j)|X (k), X (ℓ))
is small for some k, ℓ in the current neighborhood
- f i or j (thanks to
faithfulness)
- 5. until removal of edges is
not possible anymore, i.e. stop at minimal order
- f partial correlation
where edge-removal becomes impossible
stopped full graph partial correlation order 1 correlation screening
additional step of the algorithm needed for estimating directions yields an estimate of the CPDAG (equivalence class of DAGs)
- ne tuning parameter (cut-off parameter) α for truncation of
estimated Z-transformed partial correlations if the graph is “sparse” (few neighbors) ❀ few iterations only and only low-order partial correlations play a role and thus: the estimation algorithm works for p ≫ n problems
Computational complexity crudely bounded to be polynomial in p sparser underlying structure ❀ faster algorithm we can easily do the computations for sparse cases with p ≈ 104 ≈ 1-2 hrs CPU time
1.0 1.5 2.0 2.5 3.0 −1 1 2 3 log10(p) log10( Processor Time [s] ) E[N]=2 E[N]=8
IDA (Intervention calculus when DAG is Absent) (Maathuis, Colombo, Kalisch & PB, 2010)
- 1. PC-algorithm ❀
- CPDAG
- 2. local algorithm ❀ ˆ
Θlocal
- 3. lower bounds for absolute causal effects ❀ ˆ
αj R-package: pcalg this is what we used in the yeast example to score the importance of the genes according to size of ˆ αj
Statistical theory (Kalisch & PB, 2007; Maathuis, Kalisch & PB, 2009) n i.i.d. observational data points; p variables high-dimensional setting where p ≫ n assumptions:
◮ X (1), . . . , X (p) ∼ Np(0, Σ) Markov and faithful to true DAG ◮ high-dimensionality: log(p) ≪ n ◮ sparsity: maximal degree d = maxj |ne(j)| ≪ n ◮ “coherence”: maximal (partial) correlations ≤ C < 1
max{|ρi,j|S|; i = j, |S| ≤ d} ≤ C < 1
◮ signal strength/strong faithfulness:
min{|ρi,j|S|; ρi,j|S = 0, i = j, |S| ≤ d} ≫
- d log(p)/n
Then, for some suitable tuning param. and 0 < δ < 1: P[ CPDAG = true CPDAG] = 1 − O(exp(−Cn1−δ)) P[ˆ ΘL as set = Θ] = 1 − O(exp(−Cn1−δ)) (i.e. consistency of lower bounds for causal effects)
Main strategy of a proof we have to analyze an algorithm... ❀
◮ understand the population version:
in particular: can show that algorithm stops in d or d − 1 steps
◮ analysis of the noisy version: additional errors
use a union bound argument to control the overall error (seems very rough but leads to asymptotically “optimal” regime, e.g., for signal strength)
The role of “sparsity” in causal inference as usual: sparsity is necessary for accurate estimation in presence of noise but here: “sparsity” (so-called protectedness) is crucial for identifiability as well
X X Y Y
X causes Y Y causes X cannot tell from observational data the direction of the arrow the same situation arises with a full graph with more than 2 nodes ❀ causal identification really needs “sparsity” the better the “sparsity” the tighter the bounds for causal effects
Maximum likelihood estimation
R.A. Fisher Gaussian DAG is Gaussian linear structural equation model:
1
2 3
X (1) ← ε(1) X (2) ← β21X (1) + ε(2) X (3) ← β31X (1) + β32X (2) + ε(3) in general: X (j) ←
p
- k=1
βjkX (k) + ε(j) (j = 1, . . . , p), βjk = 0 ⇔ edge k → j X = BX + ε, ε ∼ Np(0, diag(σ2
1, . . . , σ2 p)) in matrix notation
❀ reparameterization
2
ˆ Σ, ˆ D = argminΣ;D a DAG − ℓ(Σ, D; data) + λ|D| = argminB; {σ2
j ;j} − ℓ(B, {σ2
j ; j}; data) + λ
B0
- ij I(Bij=0)
under the non-convex constraint that B corresponds to “no directed cycles” severe non-convex problem due to the “no directed cycle” constraint ( · 0-penalty rather than e.g. · 1 doesn’t make the problem much harder)
toy-example X (1) ← β1X (2) + ε1 X (2) ← β2X (1) + ε2
X1 X2 (0,0) beta1 beta2
non-convex parameter space! (no straightforward way to do convex relaxation)
computation:
◮ dynamic programming algorithm (up to p ≈ 20)
(Silander and Myllym¨
aki, 2006)
◮ greedy algorithms on equivalence classes
(Chickering, 2002; Hauser & PB, 2012) statistical properties of penalized MLE (for “large” p): (van de Geer & PB, 2013) no faithfulness assumption required to obtain the minimal edges I-MAP !
Successes in biology
Effects of single gene knock-downs on all other genes in yeast (Maathuis, Colombo, Kalisch & PB, 2010) n = 63
- bservational data
False positives True positives 1,000 2,000 3,000 4,000 200 400 600 800 1,000 IDA Lasso Elastic−net Random
Arabidopsis thaliana (Stekhoven, Moraes, Sveinbj¨
- rnsson, Hennig,
Maathuis & PB, 2012)
response Y: days to bolting (flowering) of the plant (aim: fast flowering plants) covariates X: gene-expression profile
- bservational data with n = 47 and p = 21′326
❀ lower bound estimates ˆ αj for causal effect of every gene/variable on Y (using the PC-algorithm) apply stability selection (Meinshausen & PB, 2010) ❀ assigning uncertainties via control of PCER = E[V]/p ≤
1 2πthr−1 q2 p2
(per comparison error rate)
Causal gene ranking
summary median error Gene rank effect expression (PCER) name 1 AT2G45660 1 0.60 5.07 0.0017 AGL20 (SOC1) 2 AT4G24010 2 0.61 5.69 0.0021 ATCSLG1 3 AT1G15520 2 0.58 5.42 0.0017 PDR12 4 AT3G02920 5 0.58 7.44 0.0024 replication protein-related 5 AT5G43610 5 0.41 4.98 0.0101 ATSUC6 6 AT4G00650 7 0.48 5.56 0.0020 FRI 7 AT1G24070 8 0.57 6.13 0.0026 ATCSLA10 8 AT1G19940 9 0.53 5.13 0.0019 AtGH9B5 9 AT3G61170 9 0.51 5.12 0.0034 protein coding 10 AT1G32375 10 0.54 5.21 0.0031 protein coding 11 AT2G15320 10 0.50 5.57 0.0027 protein coding 12 AT2G28120 10 0.49 6.45 0.0026 protein coding 13 AT2G16510 13 0.50 10.7 0.0023 AVAP5 14 AT3G14630 13 0.48 4.87 0.0039 CYP72A9 15 AT1G11800 15 0.51 6.97 0.0028 protein coding 16 AT5G44800 16 0.32 6.55 0.0704 CHR4 17 AT3G50660 17 0.40 7.60 0.0059 DWF4 18 AT5G10140 19 0.30 10.3 0.0064 FLC 19 AT1G24110 20 0.49 4.66 0.0059 peroxidase, putative 20 AT1G27030 20 0.45 10.1 0.0059 unknown protein
- biological validation by gene knockout experiments in progress.
red: biologically known genes responsible for flowering
performed validation experiment with mutants corresponding to these top 20 - 3 = 17 genes
◮ 14 mutants easily available ❀ only test for 14 genes ◮ more than usual: mutants showed low germination or
survival...
◮ 9 among the 14 mutants survived (sufficiently strongly), i.e.
9 mutants for which we have an outcome
◮ 3 among the 9 mutants (genes) showed a significant effect
for Y relative to the wildtype (non-mutated plant) ❀ that is: besides the three known genes, we find three additional genes which exhibit a significant difference in terms
- f “time to flowering”
in short: bounds on causal effects (ˆ αj’s) based on observational data lead to interesting predictions for interventions in genomics (i.e. which genes would exhibit a large intervention effect) and these predictions have been validated using experiments
Fully identifiable cases
recap: if P = {Gaussian distributions}, then: cardinality of the Markov-equivalence class of a DAG D |E(D)| often > 1 same is true for P = {nonparametric distributions} but under some additional constraints, the Markov equivalence class becomes smaller or even consisting of one DAG only (i.e., identifiable)
structural equation model (SEM) Xj ← fj(Xpa(j), εj) (j = 1, . . . , p) e.g. Xj ←
- k∈pa(j)
fjk(Xk) + εj (j = 1, . . . , p)
three types of identifiable SEMs where true DAG is identifiable from observational distribution:
◮ LiNGAM (Shimizu et al., 2006)
Xj ←
- k∈pa(j)
BjkXk + εj (j = 1, . . . , p), with εj’s non-Gaussian as with independent component analysis (ICA) ❀ identifying Gaussian components is the hardest
◮ linear Gauss with equal error variances (Peters & PB, 2014)
Xj ←
- k∈pa(j)
BjkXk + εj (j = 1, . . . , p), Var(εj) ≡ σ2 for all j
◮ nonlinear SEM with additive noise (Sch¨
- lkopf and
co-workers, 2008–2014: Hoyer et al., 2009; Peters et al., 2013)
Xj ← fj(Xpa(j)) + εj (j = 1, . . . , p) toy example: p = 2 with DAG X (1) = X → Y = X (2)
- ●
- ●
- ●
- ●
- ●
- 0.0
0.5 1.0 1.5 2.0 2.5 3.0 −5 5 10 15 20 25 30 Y = X^3 + epsilon x y
- ●
- 0.0
0.5 1.0 1.5 2.0 2.5 3.0 −4 −2 2 4 errors x epsilon
- ●
- ●
- ●
- ●
- −5
5 10 15 20 25 30 0.0 0.5 1.0 1.5 2.0 2.5 3.0 X= Y^{1/3} + eta y x
- ●
- ●
- ●
- ●
- −5
5 10 15 20 25 30 −1.0 −0.5 0.0 0.5 errors y eta
truth: Y = X 3 + ε ε indep. of X reverse mod.: X = Y 1/3 + η η not indep. of Y
strategy to infer structure and model parameters (for identifiable models):
- rder search (order among the variables)
& subsequent (sparse) regression
- rder search: MLE for best permutation/order
◮ candidate order/permutation π (of {1, . . . , p}) ◮ regressions
Xπ(j) versus Xπ(j−1), Xπ(j−2), . . . , Xπ(1) (when sparse: only some of the regressors are selected) evaluation score for π: likelihood of all the regressions Xπ(j) vs. Xπ(j−1), Xπ(j−2), . . . , Xπ(1) ❀ p
j=1 p(Xπ(j)|Xπ(j−1), . . . , Xπ(1))
need a model for the regressions:
◮ functional form of regressions (e.g. additive, linear) ◮ distribution of error terms (e.g. Gaussian, nonparametric)
❀ can compute the likelihood with estimated parameters and search over the orderings/permutations
search over the orderings/permutations: there are p! permutations... trick: preliminary neighborhood selection aiming for an undirected graph containing the skeleton of the DAG (i.e. the Markov blanket or a superset thereof)
1 2 3 4 5
“superset skeleton” ❀ and do order search and estimation restricted to superset-skeleton
◮ much reduced search space ◮ can use unpenalized MLE (for best order/permutation
restricted to superset-skeleton)
that is: regularization in neighborhood selection suffices!
CAM: Causal Additive Model (PB, Peters & Ernest, 2013) Xj ←
- k∈pa(j)
fjk(Xk) + εj (j = 1, . . . , n) εj ∼ N(0, σ2
j ) ◮ underlying DAG is identifiable from joint distribution ◮ statistically “feasible” for estimation due to additive
functions
◮ good empirical performance
log-likelihood equals (up to constant term):
p
- j=1
log(ˆ σπ
j ), (ˆ
σπ
j )2 = n−1 n
- i=1
(Xi,π(j) −
- k<j
ˆ fj,π(k)(Xi,k))
fitting method
- 1. preliminary neighborhood selection
(for superset of skeleton of DAG or Markov blanket): “nodewise” sparse additive regression of
- ne variable Xj against all others {Xk; k = j}
easy (using CV), works well and is established (Ravikumar et al. (2009), Meier, van de Geer & PB (2009),...)
- 2. estimate a best order by unpenalized MLE restricted to
the superset-skeleton
- 3. based on the estimated order ˆ
π: additive sparse model fitting restricted to the superset-skeleton ❀ edge pruning step 1:
1 2 3 4 5
step 2:
1 2 3 4 5
step 3:
1 2 3 4 5
for step 2: greedy search restricted to superset-skeleton
for step 2: greedy search restricted to superset-skeleton
– exhaustive computation is possible for trees (B¨ urge, MSC thesis 2014) – greedy methods are shown to find the optimum for restricted classes of DAGs (Peters, Balakrishnan and Wainwright, in progress)
unpenalized MLE for best order search: very convenient regularization is decoupled and only used in sparse additive regression
statistical consistency: high-dimensional p ≫ n setting (PB, Peters & Ernest, 2013) main assumptions:
◮ maximal degree of DAG is O(1)
(“sparse”)
◮ sufficiently large identifiability constant (w.r.t. Kullback-
Leibler divergence) between true and wrong ordering:
ξp := p−1 minπ /
∈Π0
- Eθ0[− log(pπ
θ0(X))] − Eθ0[− log(pθ0(X))]
- ξp ≫
- log(p)/n
◮ exponential moments for Xj’s and certain smooth functions
thereof then: P[ˆ π ∈ Π0
- set of true permutations
] → 1 (n → ∞)
◮ consistent recovery of true functions f 0 jk and DAG D0 ◮ consistent estimation of E[Y|do(X = x)]
remark: preliminary neighborhood selection
◮ yields empirical and computational improvements ◮ seems to improve the theoretical results
e.g. in linear Gaussian case ❀ less stringent conditions than ℓ0-regularized MLE (van de Geer & PB, 2013)
Empirical results to illustrate what can be achieved with CAM p = 100, n = 200 true model is CAM (additive SEM) with Gaussian error SHD: Structural Hamming Distance SID: Structural Intervention Distance (Peters & PB, 2013)
- CAM
RESIT LiNGAM PC CPC GES 100 200 300 400 SHD to true DAG
p = 100
- CAM
RESIT LiNGAM PC (lower) PC (upper) CPC (lower) CPC (upper) GES (lower) GES (upper) 500 1000 1500 SID to true DAG
p = 100
RESIT (Mooij et al. 2009) cannot be used for p = 100
CAM method is impressive where true functions are non-monotone and nonlinear (sampled from Gaussian proc.); for monotone functions: still good but less impressive gains
Empirical results to illustrate what can be achieved with CAM p = 100, n = 200 true model is CAM (additive SEM) with Gaussian error SHD: Structural Hamming Distance SID: Structural Intervention Distance (Peters & PB, 2013)
- CAM
RESIT LiNGAM PC CPC GES 100 200 300 400 SHD to true DAG
p = 100
- CAM
RESIT LiNGAM PC (lower) PC (upper) CPC (lower) CPC (upper) GES (lower) GES (upper) 500 1000 1500 SID to true DAG
p = 100
RESIT (Mooij et al. 2009) cannot be used for p = 100
CAM method is impressive where true functions are non-monotone and nonlinear (sampled from Gaussian proc.); for monotone functions: still good but less impressive gains
Gene expressions from isoprenoid pathways in Arabidopsis Thaliana (Wille et al., 2004) p = 39, n = 118 top 20 edges from CAM
DXPS1 DXPS2 DXPS3 DXR MCT CMK MECPS HDS HDR IPPI1 GPPS GPPS GGPPS 2,6,8,10,11,12 PPDS1 PPDS2 UPPS1 GGPPS1,5,9 DPPS2 Chloroplast (MEP pathway) AACT1 AACT2 HMGS HMGR1 MK MPDC1 IPPI2 FPPS1 FPPS2 DPPS1,3 GGPPS3,4 Cytoplasm (MVA pathway) HMGR2 MPDC2 Mitochondrion
stability selection
DXPS1 DXPS2 DXPS3 DXR MCT CMK MECPS HDS HDR IPPI1 GPPS GPPS GGPPS 2,6,8,10,11,12 PPDS1 PPDS2 UPPS1 GGPPS1,5,9 DPPS2 Chloroplast (MEP pathway) AACT1 AACT2 HMGS HMGR1 MK MPDC1 IPPI2 FPPS1 FPPS2 DPPS1,3 GGPPS3,4 Cytoplasm (MVA pathway) HMGR2 MPDC2 Mitochondrion
solid edges: estimated from data stability selection: expected no. of false positives ≤ 2 (Meinshausen & PB, 2010)
When knowing the order of the variables
a new approach... Theorem (Ernest & PB, 2014) For a general nonlinear SEM with functions fj ∈ L2(P)
and regularity condition on the joint density/distribution:
for all I ⊂ {1, . . . , p}: p(x) ≥ Mp(xI)p(xIc ) for some 0 < M ≤ 1
E[Y|do(X = x)] = g(x), g(x) = best additive L2-approx. of Y versus X, {Xk; k ∈ S(X)} e.g. S(X) = {k; k < jX}, jX = index corresponding to X that is E[Y|do(X = x)] = g(x), g(x) = best additive approx. of Y versus X, {Xj; j ∈ S(X)}, gapp = argminfjE[(Y − fjX (X) −
- k∈S(X)
fk(Xk))2], gapp
jX (·) = g(·)
implication:
- nly need to run additive model fitting even for an “arbitrary
nonlinear SEM” e.g. even if fj(Xpa(j), εj) is very complicated...! call it ord-additive modeling/fitting
very robust against model-misspecification !
misspecification could be dependent εj’s, i.e. due to hidden variables if we were to consider E[Y|do(X1 = x1, X2 = x2)] ❀ would need to fit first-order interaction model
implication:
- nly need to run additive model fitting even for an “arbitrary
nonlinear SEM” e.g. even if fj(Xpa(j), εj) is very complicated...! call it ord-additive modeling/fitting
very robust against model-misspecification !
misspecification could be dependent εj’s, i.e. due to hidden variables if we were to consider E[Y|do(X1 = x1, X2 = x2)] ❀ would need to fit first-order interaction model
Swiss Air Ltd. ticket pricing and revenue (Ghielmetti et al., in progress) do not need to worry about complicated non-linearities!
estimation – in practice additive model fitting of Y versus X, {Xk; k ∈ S(X)}
Hastie & Tibshirani (≈ 1990); Wood (2006); ...
high-dimensional scenario when |S(X)| is large e.g. |S(X)| ≈ 100 − 5′000 and sample size n ≈ 50 consistency and optimal convergence rate results for sparse additive model fitting if – problem is sparse: maximal degree of true DAG is small – identifiability aka restricted eigenvalue conditions
Ravikumar et al. (2009); Meier, van de Geer & PB (2009)
additive model fitting is well-established and works well including high-dimensional setting
estimation – in practice additive model fitting of Y versus X, {Xk; k ∈ S(X)}
Hastie & Tibshirani (≈ 1990); Wood (2006); ...
high-dimensional scenario when |S(X)| is large e.g. |S(X)| ≈ 100 − 5′000 and sample size n ≈ 50 consistency and optimal convergence rate results for sparse additive model fitting if – problem is sparse: maximal degree of true DAG is small – identifiability aka restricted eigenvalue conditions
Ravikumar et al. (2009); Meier, van de Geer & PB (2009)
additive model fitting is well-established and works well including high-dimensional setting
MEP pathway in Arabidopsis Thaliana (Wille et al, 2004) p = 14 expressions of genes, n = 118
- rder of variables: biological information w.r.t. up-/downstream
DXPS1 DXPS2 DXPS3 DXR MCT CMK MECPS HDS HDR IPPI1 GPPS GPPS GGPPS 2,6,8,10,11,12 PPDS1 PPDS2 Chloroplast (MEP pathway)
◮ rank directed gene pairs (X causes Y)
with
- (ˆ
E[Y|do(X = x) − ˆ E[Y])2dx
◮ top 10 scoring directed gene pairs and
check their stability ❀ stability selection: E[false positives] ≤ 1 (Meinshausen & PB, 2010) do not need to worry about complicated nonlinearities!
- rd-additive regression: a local operation
inferring E[Y|do(X = x)] when DAG (or order) is known:
◮ ord-additive regression Y versus X, pa(X): local ◮ integration of all directed paths from X to Y: global
0.0 0.5 1.0 1.5 2.0 0 (100%) 16 (96%) 32 (92%) 64 (84%) 128 (68%) 256 (34%) SHD to true DAG (percentage of correct edges) squared error Entire Path Partial Path Parent Adjustment 1 2 3 4 0 (100%) 4 (96%) 8 (92%) 16 (84%) 32 (68%) 64 (34%) SHD to true DAG (percentage of correct edges) squared error Entire Path Partial Path Parent Adjustment
nonsparse DAG sparse DAG
- rd-additive regression is much more reliable and “robust”
when having imperfect knowledge of the DAG or the order of variables and computationally much faster
Conclusions
- 1. Beware of over-interpretation!
so far, based on current data: we can not reliably infer a causal network
1
despite theorems... (perturbation of the data yields unstable networks)
- 2. Causal inference relies on subtle uncheckable(!)
assumptions ❀ experimental validations are important (simple organisms in biology are great for pursuing this!) statistical (and other) inference is often not confirmatory
- 3. many technical issues in identifiability, high-dimensional
statistical inference and optimization
- 4. but there is potential:
for stable ranking/prediction of intervention/causal effects ... “causal inference from purely observed data could have practical value in the prioritization and design of perturbation experiments”
Editorial in Nature Methods (April 2010)
this can be very useful in computational biology and in this sense: “causal inference from observational data is much further developed than 30 years ago when it was thought to be impossible”
Thank you!
R-package: pcalg (Kalisch, M¨
achler, Colombo, Maathuis & PB, 2012)
References:
◮
Ernest, J. and B¨ uhlmann, P . (2014). On the role of additive regression for (high-dimensional) causal
- inference. Preprint arXiv:1405.1868
◮
B¨ uhlmann, P ., Peters, J. and Ernest, J. (2013). CAM: Causal Additive Models, high-dimensional order search and penalized regression. Preprint arXiv:1310.1533
◮
Peters, J. and B¨ uhlmann, P . (2014). Identifiability of Gaussian structural equation models with equal error
- variances. Biometrika 101, 219-228.
◮
Uhler, C., Raskutti, G., B¨ uhlmann, P . and Yu, B. (2013). Geometry of faithfulness assumption in causal
- inference. Annals of Statistics 41, 436-463.
◮
van de Geer, S. and B¨ uhlmann, P . (2013). ℓ0-penalized maximum likelihood for sparse directed acyclic
- graphs. Annals of Statistics 41, 536-567.
◮
Kalisch, M., M¨ achler, M., Colombo, D., Maathuis, M.H. and B¨ uhlmann, P . (2012). Causal inference using graphical models with the R package pcalg. Journal of Statistical Software 47 (11), 1-26.
◮
Stekhoven, D.J., Moraes, I., Sveinbj¨
- rnsson, G., Hennig, L., Maathuis, M.H. and B¨
uhlmann, P . (2011). Causal stability ranking. Bioinformatics 28, 2819-2823.
◮
Hauser, A. and B¨ uhlmann, P . (2012). Characterization and greedy learning of interventional Markov equivalence classes of directed acyclic graphs. Journal of Machine Learning Research 13, 2409-2464.
◮
Maathuis, M.H., Colombo, D., Kalisch, M. and B¨ uhlmann, P . (2010). Predicting causal effects in large-scale systems from observational data. Nature Methods 7, 247-248.
◮
Maathuis, M.H., Kalisch, M. and B¨ uhlmann, P . (2009). Estimating high-dimensional intervention effects from
- bservational data. Annals of Statistics 37, 3133-3164.