Missing data and net survival analysis Bernard Rachet General - - PowerPoint PPT Presentation

▶

Nov 28, 2022 414 likes •814 views

Workshop on Flexible Models for Longitudinal and Survival Data with Applications in Biostatistics Warwick, 27 - 29 July 2015 Missing data and net survival analysis Bernard Rachet General context Population-based, routine data Cancer registry

SLIDE 1

Workshop on Flexible Models for Longitudinal and Survival Data with Applications in Biostatistics Warwick, 27 - 29 July 2015

Missing data and net survival analysis

Bernard Rachet

SLIDE 2

Population-based, routine data Cancer registry data Clinical data – tumour, treatment, comorbidity Cancer survival and roles played by patient, tumour and health- care factors (very) large data sets, but incomplete information, which we have handled using multiple imputation procedure with Rubin’s rules

General context

SLIDE 3

Preliminary results of on-going work

SLIDE 4

Under Missing At Random (MAR) assumption 1. Impute the missing data from 𝑔 𝐙𝑁 𝐙𝑃 to give K ‘complete’ data sets 2. Fit the substantive model to each of the K data sets, to

btain K estimates of the parameters and estimates of their

variance 3. Combine them using Rubin’s rules

Multiple imputation procedure

SLIDE 5

Analysis Incomplete data K completed data sets K analysis results Pooling Final results Imputation

Multiple imputation steps

SLIDE 6

Pooling K estimates – Rubin’s rules

Given K completed data sets, there are:

K estimates with variance

Pooled estimate Total variance

within-imputation variance between-imputation variance

K 1,..., k ,

  ˆ K 1,..., k ,



ˆ 





K k k MI

K

ˆ 1 ˆ  

B ) K 1 (1 W VMI ˆ ˆ ˆ   





K k k

K W

1 2

1 ˆ 





 

K k MI k

1 1 B

1 2

) ˆ ˆ ( ˆ  

SLIDE 7

Congeniality 1. Imputation model congenial with substantive model 2. Given the substantive model from 𝑔 𝐙 𝐘 , 𝑔 𝐙 𝐘 𝑕 𝐘 is a congenial imputation model if both 𝑔 and 𝑕 are correctly specified 3. Valid inference (under MAR) if 𝑔 𝐙 𝐘 𝑕 𝐘 (approximately) represents data structure and substantive model

Multiple imputation procedure

SLIDE 8

Aims

Prognosis of a cancer and impact at population level

Concepts

Excess hazard

Excess hazard ratio Net survival Crude probabilities of death from cancer and other causes

Relative survival data setting

Population-based data Expected mortality hazard from life tables

By single year age and sex, and calendar year, geography, deprivation

Concepts and measures of interest

SLIDE 9

Population-based cohort of colorectal cancer patients Complete information on age, sex, follow-up time, vital status, deprivation, comorbidity, surgical treatment Tumour stage, morphology and grade: 45% incomplete data Relative survival data setting λ 𝑦 = λ𝑄 𝑦 + 𝑓𝑦𝑞 𝑦𝛾 Substantive model: generalised linear model (Dickman et al, Stat Med 2005) 𝑚𝑝𝑕 𝜈𝑘 − 𝑒𝑄𝑘 = 𝑚𝑝𝑕 𝑧𝑘 + 𝑦𝛾 𝑒𝑘~𝑄𝑝𝑗𝑡𝑡𝑝𝑜 𝜈𝑘 ; 𝜈𝑘 = λ𝑘𝑧𝑘; 𝑧𝑘 person-time at risk 𝑒𝑄𝑘expected number of deaths – life tables Excess hazard ratio (+ Ederer-2 relative survival)

Offset Link function

Nur et al, 2009 - Settings

SLIDE 10

Missing information associated with:

Older ages
More deprived categories
Less treatment with curative intent
Higher probability of death

Data description

Variable Patients Category No. % 29 563 100.0 Stage I 2 193 12.3 II 7 326 41.0 III 7 726 43.2 IV 643 3.6 Missing 11 684 (39.5) Morphology Adenocarcinoma 23 693 90.7 Mucinous and serous 2 314 8.9 Other 128 0.5 Neoplasm, NOS1 3 428 (11.6) Grade I 3 212 14.5 II 16 047 72.4 III/IV 2 907 13.1 Missing 7 397 (25.0)

SLIDE 11

Multiple imputation using Full Conditional Specification (chained equations – van Buuren, 1999)

Same basic assumptions than in multiple imputation Assumes a joint (multivariate) distribution exists without specifying its form Imputation model (joint model for the data) Gibbs sampler to:

1. Estimate the parameters in the joint imputation model
2. Impute the missing data

Multivariate problem split into a series of univariate problems

Missing information in several variables

 

   

 

1 , 1 , 2 , 2 , 1 , 1 , 1 , 1 , , , 2 , 1 ,

... ,..., ,..., ,..., ,

i i i p i i p i p i i p i p i i i

Y f Y Y f Y Y Y f Y Y Y f Y Y Y f     

  

 

Ω β Y , ~ N

SLIDE 12

Outcomes

Ordinal regression for stage and grade Polytomous regression for morphology

Covariables

Other two covariables with incomplete information Sex, age, deprivation, comorbidity, treatment, cancer site Vital status Follow-up time (years): piecewise function (0, 0.5, 1, 2, 3, 4, 5, 5+) Time-dependent effects (categorical) for deprivation and age

Substantive (excess hazard) model includes

all these variables (binary) time-dependent effects

Imputation models

SLIDE 13

Missing information associated with:

Older ages
More deprived categories
Less treatment with curative intent
Higher probability of death

Results

Data after imputation Variable Patients Category No. % % 29 563 100.0 Stage I 2 193 12.3 10.1 II 7 326 41.0 36.1 III 7 726 43.2 47.4 IV 643 3.6 6.2 Missing 11 684 (39.5) Morphology Adenocarcinoma 23 693 90.7 90.5 Mucinous and serous 2 314 8.9 8.9 Other 128 0.5 0.5 Neoplasm, NOS1 3 428 (11.6) Grade I 3 212 14.5 13.6 II 16 047 72.4 72.0 III/IV 2 907 13.1 14.4 Missing 7 397 (25.0)

SLIDE 14

Other results – Indicator approach

Systematically underestimates variance of EHRs
Overestimates EHRs for tumour morphology
Underestimates EHRs for age and deprivation
Does not identify time-dependent effects

Complete-case analysis (16 223 cases) Multiple imputation (29 563 cases) Period since diagnosis over which EHR was estimated Five years** First year Second to fifth years Five years** First year Second to fifth years EHR 95% CI EHR 95% CI EHR 95% CI EHR 95% CI EHR 95% CI EHR 95% CI I 1.0

3.6 2.7 4.7 2.6 2.2 3.0 III 10.2 7.7 13.5 7.0 5.9 8.4 IV 26.4 19.6 35.5 16.5 13.8 19.8 Missing 15 to 44 1.0

1.0
1.0
1.0
45 to 54

1.1 0.8 1.5 1.3 1.0 1.6 1.3 1.0 1.6 1.3 1.1 1.5 55 to 64 1.4 1.0 1.9 1.2 1.0 1.5 1.7 1.4 2.1 1.3 1.1 1.5 65 to 74 2.0 1.5 2.7 1.2 1.0 1.5 2.4 2.0 2.9 1.3 1.1 1.6 75 to 84 2.7 2.0 3.7 1.1 0.9 1.4 3.6 2.9 4.3 1.4 1.2 1.6 85 to 99 4.0 2.9 5.5 0.9 0.7 1.3 5.4 4.4 6.6 1.5 1.2 1.9

Results

SLIDE 15

Before imputation

20 40 60 80 100 1 2 3 4 5 Years since diagnosis

I II III IV missing

20 40 60 80 100 1 2 3 4 5 Years since diagnosis

I II III IV

After imputation

Stage-specific survival

SLIDE 16

Tutorial paper – no systematic evaluation Relatively simple substantive model

piecewise model categorical variables

Further recent methodological developments in:

multiple imputation net survival, flexible modelling

More systematic evaluation – simulations

Limitations

SLIDE 17

Excess hazard λ𝐹 𝑢 = λ𝑃 𝑢 − λ𝑄 𝑢 λ𝑃 𝑢 𝑒𝑢 =

𝑒𝑂𝑋 𝑢 𝑍𝑋 𝑢 ;

λ𝑄 𝑢 𝑒𝑢 =

𝑗=1

𝑜

𝑍

𝑗 𝑋 𝑢 λ𝑄𝑗 𝑢

𝑍𝑋 𝑢

𝑋 𝑢 = 1 𝑇𝑄𝑗 𝑢 Net survival 𝑇𝐹 𝑢 = 𝑓−

𝑢 λ𝐹 𝑣 𝑒𝑣

Crude mortality 𝐺𝐷 𝑢 =

𝑢

𝑇𝑃 𝑣 − λ𝐹 𝑣 𝑒𝑣

Concepts and measures of interest

Expected probability

f surviving up to t

SLIDE 18

Flexible multivariable excess hazard model

Excess hazard

Time-dependent and non-linear effects (splines)

Variables affecting both mortality processes (cancer and other causes of death) included in the model

Net survival is the mean of individual net survival functions predicted by the model

Modelling approach

SLIDE 19

Congeniality 1. Imputation model congenial with substantive model 2. Given the substantive model from 𝑔 𝐙 𝐘 , 𝑔 𝐙 𝐘 𝑕 𝐘 is a congenial imputation model if both 𝑔 and 𝑕 are correctly specified 3. Valid inference (under MAR) if 𝑔 𝐙 𝐘 𝑕 𝐘 (approximately) represents data structure and substantive model 4. Problematic within net survival setting and with non- linear and time-dependent effects

Multiple imputation procedure

SLIDE 20

Data

44,461 men diagnosed with a colorectal cancer in 1998-2006, followed up to 2009 Age at diagnosis (continuous), tumour stage (4 categories), deprivation (5 categories)

Missing stage: 30%

MCAR 𝑚𝑝𝑕𝑗𝑢 𝑄𝑠 𝑆𝑗 = 1 𝒂𝑗 = 𝜀0 MAR on X 𝑚𝑝𝑕𝑗𝑢 𝑄𝑠 𝑆𝑗 = 1 𝒂𝑗 = 𝛽0 + 𝛽1(age𝑗−60) MAR 𝑚𝑝𝑕𝑗𝑢 𝑄𝑠 𝑆𝑗 = 1 𝒂𝑗 = 𝛿0 + 𝛿1(age𝑗−60) + 𝛿2𝑈

𝑗 + 𝛿3𝐸𝑗

𝑆 = 1 if stage missing 100 simulated data sets per scenario

Falcaro et al, 2015 – Study settings

SLIDE 21

Distribution on fully observed data and empirical expected distribution in remaining complete records

SLIDE 22

Flexible log cumulative excess hazard model 𝑚𝑜 Λ𝐹 𝑢 𝑦𝑗 = 𝑡1 𝑚𝑜 𝑢 ; 𝜹𝟐, 𝒍𝟐 + 𝜸′𝒚𝒋 + 𝑡2 𝑏𝑕𝑓𝑗; 𝜹𝟑, 𝒍𝟑

Flexible functions: restricted cubic splines Baseline excess hazard: 5 df, 4 internal knots and 2 boundary knots Age (continuous): 3 df, 2 internal knots Covariables: deprivation and stage Aims: estimate effect of stage (log EHR) and stage-specific net survival at 1, 5 and 10 years since diagnosis

Substantive model

SLIDE 23

Outcome (stage)

Ordinal or multinomial logistic regression

Covariables

Survival time and log(survival time) or Nelson-Aalen estimate of the cumulative hazard Event indicator Age – splines defined as in the substantive model Deprivation – dummy variables

30 imputations Net survival: Rubin’s rules applied on 𝑚𝑝𝑕 −𝑚𝑝𝑕 𝑇𝐹 𝑢 to obtain approximate normality, then back-transformed

Imputation models

SLIDE 24

Multiple imputation strategy

Multiple Imputation Strategy Functional Form How Survival Is Modeled in the Imputation MI_ologit_surv Ordinal logistic Survival time and log survival time MI_ologit_na Ordinal logistic Nelson-Aalen estimate of cumulative hazard MI_mlogit_surv Multinomial logistic Survival time and log survival time MI_mlogit_na Multinomial logistic Nelson-Aalen estimate of cumulative hazard

SLIDE 25

Poor results with ordered logit even under MCAR scenario

Results

Bias in log excess hazard ratio estimates for stage (reference stage 1), 100 replications

SLIDE 26

Stage-specific net survival at 1 year, 100 replications

SLIDE 27

Bias in stage-specific net survival estimates at 1 year, 100 replications

Results

SLIDE 28

Promising results despite that the parameter estimated in the substantive model (here excess hazard) does not correspond to the final outcome of interest (net survival) Limitations No time-dependent effects of stage Which joint model? Which variables in the imputation models?

Vital status
Nelson-Aalen estimates of cumulative hazard
Interactions with time since diagnosis (age at diagnosis, deprivation…)
Other relevant interactions (tumour stage, region…)
other factors (treatment variables, co-morbidities, hospital volume,

surgeon’s experience…)

Comments

SLIDE 29

Simulated data set – colon cancer, 12,048 men followed up at least 5 years

Baseline excess hazard: 5 df, 4 internal knots Covariables: stage, deprivation, age Time-dependent effects of stage: 2 df, 1 internal knot for each higher stage Non-linear effects of age: 3 df, 2 internal knots

Substantive model

𝑚𝑜 Λ𝐹 𝑢 𝑦𝑗 = 𝑡1 𝑚𝑜 𝑢 ; 𝜹𝟐, 𝒍𝟐 + 𝜸′𝒚𝒋 + 𝑡2 𝑏𝑕𝑓𝑗; 𝜹𝟑, 𝒍𝟑 + 𝑡3𝑘 𝑡𝑢𝑏𝑕𝑓𝑘 𝑢 ; 𝜹𝟒, 𝒍𝟒

Missing stage simulated as in previous example – 100 data sets per scenario, with 30% missing stage

Focus on MAR here

Limitations and challenges: preliminary study

SLIDE 30

Simulation of missingness mechanisms as in previous example Same imputation model was applied (multinomial, Nelson-Aalen)

Limitations and challenges: preliminary study

Time (year) Net Survival function Complete MAR Stage 1 1 0.95 0.99 5 0.91 0.99 2 1 0.90 0.97 5 0.78 0.90 3 1 0.77 0.86 5 0.46 0.59 4 1 0.32 0.41 5 0.06 0.09

SLIDE 31

Results – Excess hazard ratios for stage

.5 1 1.5 2 2.5 3 3.5 1 2 3 4 5 Time since diagnosis (years) True EHR Complete-case EHRs Imputed EHRs

Tumour stage 2 (reference stage 1)

SLIDE 32

Results – Excess hazard ratios for stage

1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 Time since diagnosis (years) True EHR Complete-case EHRs Imputed EHRs

Tumour stage 3 (reference stage 1)

SLIDE 33

Results – Excess hazard ratios for stage

5 10 15 20 25 30 35 40 45 50 55 60 1 2 3 4 5 Time since diagnosis (years) True EHR Complete-case EHRs Imputed EHRs

Tumour stage 4 (reference stage 1)

SLIDE 34

Results – Stage-specific net survival

.1 .2 .3 .4 .5 .6 .7 .8 .9 1 1 2 3 4 5 Time since diagnosis (years)

Tumour stage 1

SLIDE 35

Results – Stage-specific net survival

.1 .2 .3 .4 .5 .6 .7 .8 .9 1 1 2 3 4 5 Time since diagnosis (years)

Tumour stage 2

SLIDE 36

Results – Stage-specific net survival

.1 .2 .3 .4 .5 .6 .7 .8 .9 1 1 2 3 4 5 Time since diagnosis (years)

Tumour stage 3

SLIDE 37

Results – Stage-specific net survival

.1 .2 .3 .4 .5 .6 .7 .8 .9 1 1 2 3 4 5 Time since diagnosis (years)

Tumour stage 4

SLIDE 38

Why MI?

Strength: clear division between imputation and analysis stages both efficiency and MAR plausibility increased Challenge: incompatibility between imputation and substantive models asymptotically biased estimates

Define joint model for flexible excess hazard models Multiple imputation by fully conditional specification with substantive model compatible algorithm (SMC-FCS)

Bartlett JW et al. Statistical Methods in Medical Research 2015

Conclusion and development

SLIDE 39

Little RJA, Rubin DB. Statistical Analysis with Missing Data. New York: John Wiley & Sons; 1987. Van Buuren S, Boshuizen HC, Knook DL. Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med 1999; 18: 681‐94. White IR, Royston P. Imputing missing covariate values for the Cox model. Stat Med 2009; 28: 1982–98. Nur U, Shack LG, Rachet B, Carpenter JR, Coleman MP. Modelling relative survival in the presence of incomplete data: a tutorial. Int J Epidemiol 2010; 39: 118‐28. Carpenter JR, Kenward MG. Multiple imputation and its application. Chichester: John Wiley & Sons; 2013. Falcaro M, Nur U, Rachet B, Carpenter JR. Estimating excess hazard ratios and net survival when covariate data are missing: strategies for multiple imputation. Epidemiology 2015; 26: 421-8. Bartlett JW, Seaman SR, White IR, Carpenter JR. Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Stat Methods Med Res 2015; 24: 462-97.