The need for clinical (and trialist) commonsense in AI algorithm - - PowerPoint PPT Presentation

the need for clinical and trialist commonsense in ai
SMART_READER_LITE
LIVE PREVIEW

The need for clinical (and trialist) commonsense in AI algorithm - - PowerPoint PPT Presentation

The need for clinical (and trialist) commonsense in AI algorithm design Samuel Finlayson MD-PhD Candidate, Harvard-MIT Were all really excited about machine learning, and we should be. Source: eyediagnosis.net


slide-1
SLIDE 1

The need for clinical (and trialist) commonsense in AI algorithm design

Samuel Finlayson

MD-PhD Candidate, Harvard-MIT

slide-2
SLIDE 2

We’re all really excited about machine learning, and we should be.

Source: eyediagnosis.net en.wikipedia.org/wiki/File:ImageNet_error_rate_history_(just_systems).svg

slide-3
SLIDE 3

For the all excitement, clinical benefit of AI is still largely hypothetical

  • Very few prospective trials of medical AI

have been reported in any specialty

  • Per Eric Topol Review, only 4 as of 1/2019
  • Good news: 2 of 4 were in Ophthalmology!
  • Many models struggle to reproduce

findings in new patient populations

  • No trials, to my knowledge, have

demonstrated improved clinical

  • utcomes

CC-Cruiser: 98.87% accuracy in small trial 1 87.4% (vs physician 99.1%) in trial 2

slide-4
SLIDE 4

Goal for this tutorial: Equip attendees to identify co common pit pitfalls alls in medical AI that make in informed d clinic linical al expe perts essential to development and deployment

slide-5
SLIDE 5

Model Design

Training

Retrospective Data

Model Evaluation Model Deployment Model Impact

Clinical Integration

Train/Val Data and Labels Test Data and Labels

Decision- making

Prospective Data

Review: The ML development pipeline

slide-6
SLIDE 6

What do we need clinical experts to be asking?

slide-7
SLIDE 7

Model Design

Training

Retrospective Data

Model Evaluation Model Deployment Model Impact

Clinical Integration

Train/Val Data and Labels Test Data and Labels

Decision- making

Prospective Data

Key questions to ask during da datas aset cur curatio ion

slide-8
SLIDE 8

How might our model be tainted with in infor

  • rmation

tion fr from m the he fut futur ure?

Hypothetical example #1:

  • Plan: train a ML algorithm to detect DR
  • Postdoc downloads all fundus images from

your clinical database, using discharge diagnoses to gather DR cases and healthy controls.

What could go wrong? (Hint: see figure)

Source: endotext.com

slide-9
SLIDE 9

How might our model be tainted with in infor

  • rmation

tion fr from m the he fut futur ure?

Source: endotext.com

Answer:

  • Laser scars are present!
  • Model may learn to “diagnose” the

treatment instead of the disease.

  • This is one example of label leakage, a

very common problem.

slide-10
SLIDE 10

How might our te test set be contaminated with information from our tr train ainin ing se set?

Hypothetical example #1 (con’t):

  • Postdoc tries again, limiting images to

exams prior to treatment.

  • All case and control images split randomly

into a train and test set

What could go wrong? (Hint: see figure)

Image source: wikipedia

Training Image 1 Test Image 1

slide-11
SLIDE 11

How might our te test set be contaminated with information from our tr train ainin ing se set?

Image source: wikipedia

Answer:

  • Images from the same patients are in

both train and test sets!

  • Test set metrics will overestimate model

accuracy, providing limited evidence for accuracy on unseen patients

  • This is one example of train-test set

leakage.

Training Image 1 Test Image 1

slide-12
SLIDE 12

How might our model by co confounded?

Hypothetical example (#2):

  • You build an ML classifier to detect optic disk edema for

neurologic screening.

  • Images are gathered from the ED and the outpatient clinic with

no regard to their site of origin.

How could this data acquisition process lead to confounding?

slide-13
SLIDE 13

How might our model by co confounded?

No Matching Matching on Patient features Matching on Patient + Healthcare process

(One) Answer:

  • Imaging models have been

shown to depend on “non- imaging” variables

  • In ophthalmology, we know

that age, sex, etc. trivially predicted by models from images.

  • Problem very acute with

drug, billing, text data

Source: Badgeley et al, 2018

slide-14
SLIDE 14

Model Design

Training

Retrospective Data

Model Evaluation Model Deployment Model Impact

Clinical Integration

Train/Val Data and Labels Test Data and Labels

Decision- making

Prospective Data

Key questions to ask during mo model ev evaluation

slide-15
SLIDE 15

Is our model performance consistent across patient subpopulations?

Hypothetical example #3:

  • At the request of reviewer #2, your team

evaluates its model performance stratified by race, finding large differences. (See plot on right).

  • You gather more cases from underrepresented

groups and retrain the model, but it doesn’t improve the situation.

What could be happening?

Model Error vs Race

Source: Chen et al, NeurIPS ‘18

slide-16
SLIDE 16

Is our model performance consistent across subpopulations?

Answer:

  • All model bias is not created equal
  • Different biases require different

solutions

  • Could require: More data, more features,
  • r different models.
  • See the brilliant Chen et al, NeurIPS 2018

Model Error vs Race Source: Chen et al, NeurIPS ‘18

slide-17
SLIDE 17

Model Design

Training

Retrospective Data

Model Evaluation Model Deployment Model Impact

Clinical Integration

Train/Val Data and Labels Test Data and Labels

Decision- making

Prospective Data

Key questions to ask during mo model deployme ment

slide-18
SLIDE 18

How might the data we feed our model change ov

  • ver

time time?

Hypothetical example #4:

  • Your highly accurate ML tool suddenly

begins to fail several years after clinical deployment

  • IT team insists the model has not

changed.

What might be going on?

slide-19
SLIDE 19

How might the data we feed our model change ov

  • ver

time time?

Answer:

  • Clinical performance is not fixed!
  • Changes in the input data can disrupt

model performance: dataset shift

  • Model evaluation and development

must be an ongoing New EHR System Installed

Source: Nestor et al, 2018

slide-20
SLIDE 20

Model Design

Training

Retrospective Data

Model Evaluation Model Deployment Model Impact

Clinical Integration

Train/Val Data and Labels Test Data and Labels

Decision- making

Prospective Data

Key questions to ask as we assess mo model imp mpact

slide-21
SLIDE 21

Can we anticipate any unintended consequences?

Welch, 2017 Finlayson et al, 2019 Diagnosis does not equal outcomes! Mismatched incentives -> adversarial behavior

slide-22
SLIDE 22

Conclusions

  • Many of the most pernicious challenges of medical machine learning are

study design problems

  • What sources of leakage, bias and confounding might be baked into the design?
  • How does the target population compare with the study population?
  • How might populations evolve over time, and how should they be monitored?
  • Can we anticipate any unintended consequences of deployment?
  • Clinicians and clinical researchers (trialists, epidemiologists,

biostatisticians) have been asking similar questions for decades

  • Delivering on the promise of medical ML requires a true partnership

between clinical research and machine learning expertise

slide-23
SLIDE 23

Thank you

Invitation to speak: Michael Abramoff Feedback on presentation: Lab team of Isaac Kohane, DBMI at Harvard