Toward Comprehensive Syntactic and Semantic Annotations of the - - PowerPoint PPT Presentation

toward comprehensive syntactic and semantic annotations
SMART_READER_LITE
LIVE PREVIEW

Toward Comprehensive Syntactic and Semantic Annotations of the - - PowerPoint PPT Presentation

Toward Comprehensive Syntactic and Semantic Annotations of the Clinical Narrative Guergana K. Savova, PhD Boston Children s Hospital Harvard Medical School Albright, Daniel; Lanfranchi, Arrick; Fredriksen, Anwen; Styler, William; Warner,


slide-1
SLIDE 1

Toward Comprehensive Syntactic and Semantic Annotations of the Clinical Narrative

Guergana K. Savova, PhD

Boston Childrens Hospital Harvard Medical School

slide-2
SLIDE 2

Albright, Daniel; Lanfranchi, Arrick; Fredriksen, Anwen; Styler, William; Warner, Collin; Hwang, Jena; Choi, Jinho; Dligach, Dmitriy; Nielsen, Rodney; Martin, James; Ward, Wayne; Palmer, Martha; Savova,

  • Guergana. 2013. Towards syntactic and semantic

annotations of the clinical narrative. Journal of the American Medical Informatics Association. 2013;0:1–9. doi:10.1136/amiajnl-2012-001317 h t t p : / / j a m i a . b m j . c o m / c g i / r a p i d p d f / a m i a j n l - 2 0 1 2 - 0 0 1 3 1 7 ? ijkey=z3pXhpyBzC7S1wC&keytype=ref

slide-3
SLIDE 3

JAMIA,&2013&

slide-4
SLIDE 4

Acknowledgments

!

NIH

! Multi-source integrated platform for answering clinical questions

(MiPACQ) (NLM RC1LM010608)

! Temporal Histories of Your Medical Event (THYME) (NLM 10090)

!

Office of the National Coordinator of Healthcare Technologies (ONC)

! Strategic Healthcare Advanced Research Project: Area 4, Secondary

Use of the EMR data (SHARPn) (ONC 90TR0002)

!

Institutions contributing data

! Mayo Clinic ! Seattle Group Health Cooperative

slide-5
SLIDE 5

Overview

!

Motivation

!

Layers of annotations

! TreeBank ! PropBank ! UMLS

!

Component development

!

Discussion and future directions

slide-6
SLIDE 6

Computable Annotations: Why

!

Developing algorithms

!

System evaluation

!

Community-wide training and test sets

! Compare results and establish state-of-the-art ! Establishing standards (ISO TC37)

!

Long tradition in the general NLP domain

! Linguistic Data Consortium and PTB

!

Layers of annotations on the same text

slide-7
SLIDE 7

Goals

!

Combine annotation types developed for general domain syntactic and semantic parsing with medical domain-specific annotations

!

Create accessible annotations for a variety of methods

  • f analysis including algorithm and component

development

!

Evaluate the quality of the annotations by training components to perform the same annotations automatically

!

Distribute resources (corpus, guidelines, methods - Apache cTAKES; ctakes.apache.org)

slide-8
SLIDE 8

Background

!

MiPACQ project

!

Previous work

! Ogren et al, 2008 ! Roberts et al, 2009 (CLEF corpus) ! I2b2/VA challenges ! Bioscope corpus (Vincze et al, 2008) ! ODIE

!

Contributions

! Layers of annotations ! Adherence to community standards and conventions (PTB,

PropBank, UMLS)

slide-9
SLIDE 9

Corpus

slide-10
SLIDE 10

Description

!

MiPACQ

! ~130K words of clinical narrative ! c.f. 901,673 tokes of Wall Street Journal (WSJ)

!

Annotation guidelines

! Syntactic tree (TreeBank):

http://clear.colorado.edu/compsem/documents/ treebank_guidelines.pdf

! Semantic role (PropBank):

http://clear.colorado.edu/compsem/documents/ propbank_guidelines.pdf

! UMLS:

http://clear.colorado.edu/compsem/documents/ umls_guidelines.pdf

! Clinical coreference: http://clear.colorado.edu/compsem

slide-11
SLIDE 11

Treebank Annotations

slide-12
SLIDE 12

Treebank Annotations

!

Consist of part-of-speech tags, phrasal and function tags, and empty categories organized in a tree-like structure

!

Adapted Penns POS tagging guidelines, bracketing guidelines, and associated addenda

!

Extended the guidelines to account for domain- specific characteristics

h t t p : / / c l e a r . c o l o r a d o . e d u / c o m p s e m / d o c u m e n t s / treebank_guidelines.pdf

slide-13
SLIDE 13

Treebank Review

Tokenization, sentence segmentation, and part of speech labels (in brown) are all done in an initial pass. The patient underwent a radical tonsilectomy (with additional right neck dissection) for metastatic squamous cell carcinoma .

slide-14
SLIDE 14

Treebank Review

Phrase labels (in green) and grammatical function tags (in blue) are added by a parser and then manually corrected The patient underwent a radical tonsilectomy (with additional right neck dissection) for metastatic squamous cell carcinoma .

slide-15
SLIDE 15

Treebank Review

In that second pass, new tokens are added for implicit and empty arguments (in red), and grammatically linked elements are indexed (in yellow) Patient was seen 2/18/2001

slide-16
SLIDE 16

Clinical Additions – S-RED

Clinical language is highly reduced, and often elides copula (to be). -RED tag was introduced to mark clauses with elided copulas. Patient (was) seen 2/18/2001

slide-17
SLIDE 17

Clinical Additions – S-RED

  • RED tags are used

for all elisions of the copula, including passive voice, progressive (top example) and equational clauses (bottom example).

Patient (is) having hot flashes Elderly patient (is) in care center with cough

slide-18
SLIDE 18

Clinical Additions – Null Arguments

Dropped subjects are very common in this data, and *PRO* tags are added to represent them.

(*PRO*) (was) Seen 2/18/2001 (*PRO*) (is) Obese (*PRO*) Complains of nausea

slide-19
SLIDE 19

Clinical Additions – FRAG

Use of FRAG label for fragmentary text was increased to accommodate the various kinds of non-clausal structures in the data. Discussion and recommendations: We discussed the registry objectives and procedures.

slide-20
SLIDE 20

Inter-annotator Agreement

!

F-score (EvalB)

! Constituent match – if they share the same node label and

span (punctuation placement, function tags, trace and gap indices, and empty categories are ignored)

!

0.926

slide-21
SLIDE 21

Propbank Annotations

slide-22
SLIDE 22

What is Propbank?

!

A database of syntactically parsed trees annotated with semantic role labels

!

All arguments are annotated with semantic roles in relation to their predicate structure

!

This provides training data that can identify predicate-argument structures for individual verbs.

slide-23
SLIDE 23

Propbank Labels

!

Labels do not change with predicate

!

Meanings of core arguments 2-5 change with predicate

!

Arg0 proto-agent for transitive verbs

!

Arg1 proto-patient for transitive verbs

!

Meanings of Adjunctive args do not change

slide-24
SLIDE 24

Propbank Labels

!

Arg0 = agent

!

Arg1 = theme / patient

!

A r g 2 = b e n e f a c t i v e / i n s t r u m e n t / attribute / end state

!

Arg3 = start point / benefactive / attribute

!

Arg4 = end point

!

ArgM = modifier

slide-25
SLIDE 25

Propbank Labels

ARG0(agent) Adverbial Manner ARG1(patient) Cause Modal ARG2 Direction Negation ARG3 Discourse Purpose ARG4 Extent Temporal Location Predication

slide-26
SLIDE 26

Why Propbank?

!

Identifying a commonalities in predicate-argument structures:

Agent diagnosing

[Dr.Z] diagnosed [Jacks bronchitis]

Person diagnosed Disease

[Jack] was diagnosed [with bronchitis] [by Dr.Z] [Dr. Zs] diagnosis [of Jacks bronchitis] allowed her to treat him with the proper antibiotics.

slide-27
SLIDE 27

Stages of the Propank process

!

Frame Creation

slide-28
SLIDE 28

Stages of Propbank

!

Annotation

! Data is double annotated ! Annotators

  • 1. Determine and select the sense of the predicate
  • 2. Annotate the arguments for the selected predicate sense

!

Adjudication

! After data is annotated, it is passed to an adjudicator who

resolves differences between the two annotators

! This creates the gold standard – corrected, finished training

data

slide-29
SLIDE 29

Annotation Example

slide-30
SLIDE 30

Results

!

Propbank layer included 1772 distinct predicate lemmas

! 1006 has existing frames ! 74 new verb frames were created ! 692 noun frames were created

!

Of numbered arguments, Arg0 was the most common, at 48.47%, followed by Arg1 at 14.58%

slide-31
SLIDE 31
slide-32
SLIDE 32

Inter-annotator Agreement

!

Agreement was calculated 3 ways:

! Exact -- annotation needed to match on constituent

boundaries and roles

! Core-Argument -- constituent boundaries matched,

numbered arguments were the same, and ArgMs were used in the with exact boundaries

! Constituent -- annotators marked the same constituent

!

Results:

! PropBank, exact 0.891 ! PropBank, Core-arg 0.917 ! PropBank, Constituent 0.931

slide-33
SLIDE 33

UMLS Annotations

slide-34
SLIDE 34

UMLS Semantic Types, Groups and Relations annotation

!

UMLS (Unified Medical Language System) was developed to help with cross-linguistic translation

  • f medical concepts

!

We mark semantic groups (similar to Named Entity Types) using UMLS with attributes:

! Negation (true/false) ! Status (none (=confirmed), possible, historyOf, and

familyHistoryOf)

!

Added Person category

34

slide-35
SLIDE 35

UMLS Example

!

The patient underwent a radical tonsillectomy (with additional right neck dissection) for metastatic squamous cell carcinoma. He returns with a recent history of active bleeding from his oropharynx.

slide-36
SLIDE 36

Inter-annotator Agreement

!

F1 measure

! Boundaries are exact match (0.697) ! Boundaries are partial match (0.75)

slide-37
SLIDE 37
slide-38
SLIDE 38

Development and Evaluation of NLP Components

slide-39
SLIDE 39

Development of NLP Components

Treebank PropBank

Semantic Role Labeling

Dependency Parsing Dependency Conversion Part-of-speech Tagging Automatic Output

slide-40
SLIDE 40

Development of NLP Components

!

ClearNLP dependency converter

! Generates the Stanford dependency labels (and more). ! Unlike the Stanford dependency converter, our approach

generates non-projective dependencies.

! Adapts to the MiPACQ Treebank guidelines. ! http://clearnlp.googlecode.com

!

OpenNLP part-of-speech tagger

! One-pass, left-to-right part-of-speech tagging approach. ! Uses maximum entropy for machine learning. ! http://opennlp.apache.org

slide-41
SLIDE 41

Development of NLP Components

!

ClearNLP dependency parser

! Transition-based, non-projective dependency parsing

approach using bootstrapping.

! Takes about 2 milliseconds per sentence to parse. ! Showed state-of-the-art performance on the CoNLL’09 data

for English and Czech (Choi & Palmer, 2011a).

!

ClearNLP semantic role labeler

! Transition-based semantic role labeling approach using rich-

semantic features from dependency structure.

! Showed state-of-the-art performance for the CoNLL’09 data

for English (Choi & Palmer, 2011b).

slide-42
SLIDE 42

Evaluation of NLP Components

!

Training data distribution

! The Penn Treebank (Wall Street Journal): 901K tokens. ! The Penn Treebank (Wall Street Journal): 148K tokens. ! The MiPACQ Treebank: 148K tokens. ! Penn Treebank + MiPACQ Treebank: 295K tokens. ! Penn Treebank + MiPACQ Treebank: 1.05M tokens.

WSJ-9 (901K) WSJ-1 (148K) MiPACQ (148K) WSJ-1 + MiPACQ WSJ-9 + MiPACQ Sentences 37,015 6,006 11,435 17,441 43,021 Tokens 901,673 147,710 147,698 295,408 1,049,383 Predicates 96,159 15,695 16,776 32,471 111,854

slide-43
SLIDE 43

Evaluation of NLP Components

!

Evaluation data distribution

! The MiPACQ Treebank: clinical notes (colon cancer). ! The MiPACQ Treebank: pathology notes. ! The SHARP Treebank: radiology notes. ! The THYME Treebank: clinical/pathology notes (colon cancer).

MiPACQ CN MiPACQ PA SHARP THYME Sentences 893 203 9,070 9,107 Tokens 10,865 2,701 119,912 102,745 Predicates 1,355 145 8,573 8,866

slide-44
SLIDE 44

Evaluation of Part-of-speech Tagging

Accuracies

slide-45
SLIDE 45

Evaluation of Dependency Parsing

Labeled attachment scores

slide-46
SLIDE 46

Evaluation of Semantic Role Labeling

F1-scores of argument identification + classification

slide-47
SLIDE 47

Constituency Parser (not in the paper; preliminary results)

!

A wrapper around the OpenNLP parser implementing Ratnaparkhis MaxEnt parser (Ratnaparkhi, 1997)

! Labeled F1 0.81 (WSJ and MiPACQ)

slide-48
SLIDE 48

Discussion

slide-49
SLIDE 49

Discussion I

!

All resources are available

! Corpus (DUA required; send email to

Guergana.Savova@childrens.harvard.edu)

! Guidelines ! NLP components (ctakes.apache.org)

!

UMLS IAA similar to previously reported

!

Component results can be used as baselines to develop future improved parsers

!

Domain specific data (even in small quantities) improves performance

!

The need for a community adopted annotation scheme (UMLS?)

slide-50
SLIDE 50

Discussion II

!

Rich syntactic and semantic annotations are building blocks to more complex tasks (discovery of implicit arguments, inferencing)

!

Addition of coreference and temporal relations (SHARPn, THYME)

!

Active learning (Settles, 2010; Miller et al, 2012)

slide-51
SLIDE 51

References

!

Jinho D. Choi, Martha Palmer, “Getting the Most out of Transition-based Dependency Parsing”, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 687-692, Portland, Oregon, 2011a.

!

Jinho D. Choi, Martha Palmer, “Transition-based Semantic Role Labeling Using Predicate Argument Clustering”, Proceedings of ACL workshop on Relational Models of Semantics (RELMS'11), 37-45, Portland, Oregon, 2011b.

!

Jinho D. Choi, Martha Palmer, “Guidelines for the Clear Style Constituent to Dependency Conversion”, Technical report 01-12: Institute of Cognitive Science, University of Colorado Boulder, Boulder, CO, 2012.

!

More references are in the published manuscript…