SLIDE 1
Toward Comprehensive Syntactic and Semantic Annotations of the Clinical Narrative
Guergana K. Savova, PhD
Boston Childrens Hospital Harvard Medical School
SLIDE 2 Albright, Daniel; Lanfranchi, Arrick; Fredriksen, Anwen; Styler, William; Warner, Collin; Hwang, Jena; Choi, Jinho; Dligach, Dmitriy; Nielsen, Rodney; Martin, James; Ward, Wayne; Palmer, Martha; Savova,
- Guergana. 2013. Towards syntactic and semantic
annotations of the clinical narrative. Journal of the American Medical Informatics Association. 2013;0:1–9. doi:10.1136/amiajnl-2012-001317 h t t p : / / j a m i a . b m j . c o m / c g i / r a p i d p d f / a m i a j n l - 2 0 1 2 - 0 0 1 3 1 7 ? ijkey=z3pXhpyBzC7S1wC&keytype=ref
SLIDE 3
JAMIA,&2013&
SLIDE 4
Acknowledgments
!
NIH
! Multi-source integrated platform for answering clinical questions
(MiPACQ) (NLM RC1LM010608)
! Temporal Histories of Your Medical Event (THYME) (NLM 10090)
!
Office of the National Coordinator of Healthcare Technologies (ONC)
! Strategic Healthcare Advanced Research Project: Area 4, Secondary
Use of the EMR data (SHARPn) (ONC 90TR0002)
!
Institutions contributing data
! Mayo Clinic ! Seattle Group Health Cooperative
SLIDE 5
Overview
!
Motivation
!
Layers of annotations
! TreeBank ! PropBank ! UMLS
!
Component development
!
Discussion and future directions
SLIDE 6
Computable Annotations: Why
!
Developing algorithms
!
System evaluation
!
Community-wide training and test sets
! Compare results and establish state-of-the-art ! Establishing standards (ISO TC37)
!
Long tradition in the general NLP domain
! Linguistic Data Consortium and PTB
!
Layers of annotations on the same text
SLIDE 7 Goals
!
Combine annotation types developed for general domain syntactic and semantic parsing with medical domain-specific annotations
!
Create accessible annotations for a variety of methods
- f analysis including algorithm and component
development
!
Evaluate the quality of the annotations by training components to perform the same annotations automatically
!
Distribute resources (corpus, guidelines, methods - Apache cTAKES; ctakes.apache.org)
SLIDE 8
Background
!
MiPACQ project
!
Previous work
! Ogren et al, 2008 ! Roberts et al, 2009 (CLEF corpus) ! I2b2/VA challenges ! Bioscope corpus (Vincze et al, 2008) ! ODIE
!
Contributions
! Layers of annotations ! Adherence to community standards and conventions (PTB,
PropBank, UMLS)
SLIDE 9
Corpus
SLIDE 10
Description
!
MiPACQ
! ~130K words of clinical narrative ! c.f. 901,673 tokes of Wall Street Journal (WSJ)
!
Annotation guidelines
! Syntactic tree (TreeBank):
http://clear.colorado.edu/compsem/documents/ treebank_guidelines.pdf
! Semantic role (PropBank):
http://clear.colorado.edu/compsem/documents/ propbank_guidelines.pdf
! UMLS:
http://clear.colorado.edu/compsem/documents/ umls_guidelines.pdf
! Clinical coreference: http://clear.colorado.edu/compsem
SLIDE 11
Treebank Annotations
SLIDE 12
Treebank Annotations
!
Consist of part-of-speech tags, phrasal and function tags, and empty categories organized in a tree-like structure
!
Adapted Penns POS tagging guidelines, bracketing guidelines, and associated addenda
!
Extended the guidelines to account for domain- specific characteristics
h t t p : / / c l e a r . c o l o r a d o . e d u / c o m p s e m / d o c u m e n t s / treebank_guidelines.pdf
SLIDE 13
Treebank Review
Tokenization, sentence segmentation, and part of speech labels (in brown) are all done in an initial pass. The patient underwent a radical tonsilectomy (with additional right neck dissection) for metastatic squamous cell carcinoma .
SLIDE 14
Treebank Review
Phrase labels (in green) and grammatical function tags (in blue) are added by a parser and then manually corrected The patient underwent a radical tonsilectomy (with additional right neck dissection) for metastatic squamous cell carcinoma .
SLIDE 15
Treebank Review
In that second pass, new tokens are added for implicit and empty arguments (in red), and grammatically linked elements are indexed (in yellow) Patient was seen 2/18/2001
SLIDE 16
Clinical Additions – S-RED
Clinical language is highly reduced, and often elides copula (to be). -RED tag was introduced to mark clauses with elided copulas. Patient (was) seen 2/18/2001
SLIDE 17 Clinical Additions – S-RED
for all elisions of the copula, including passive voice, progressive (top example) and equational clauses (bottom example).
Patient (is) having hot flashes Elderly patient (is) in care center with cough
SLIDE 18
Clinical Additions – Null Arguments
Dropped subjects are very common in this data, and *PRO* tags are added to represent them.
(*PRO*) (was) Seen 2/18/2001 (*PRO*) (is) Obese (*PRO*) Complains of nausea
SLIDE 19
Clinical Additions – FRAG
Use of FRAG label for fragmentary text was increased to accommodate the various kinds of non-clausal structures in the data. Discussion and recommendations: We discussed the registry objectives and procedures.
SLIDE 20
Inter-annotator Agreement
!
F-score (EvalB)
! Constituent match – if they share the same node label and
span (punctuation placement, function tags, trace and gap indices, and empty categories are ignored)
!
0.926
SLIDE 21
Propbank Annotations
SLIDE 22
What is Propbank?
!
A database of syntactically parsed trees annotated with semantic role labels
!
All arguments are annotated with semantic roles in relation to their predicate structure
!
This provides training data that can identify predicate-argument structures for individual verbs.
SLIDE 23
Propbank Labels
!
Labels do not change with predicate
!
Meanings of core arguments 2-5 change with predicate
!
Arg0 proto-agent for transitive verbs
!
Arg1 proto-patient for transitive verbs
!
Meanings of Adjunctive args do not change
SLIDE 24
Propbank Labels
!
Arg0 = agent
!
Arg1 = theme / patient
!
A r g 2 = b e n e f a c t i v e / i n s t r u m e n t / attribute / end state
!
Arg3 = start point / benefactive / attribute
!
Arg4 = end point
!
ArgM = modifier
SLIDE 25
Propbank Labels
ARG0(agent) Adverbial Manner ARG1(patient) Cause Modal ARG2 Direction Negation ARG3 Discourse Purpose ARG4 Extent Temporal Location Predication
SLIDE 26
Why Propbank?
!
Identifying a commonalities in predicate-argument structures:
Agent diagnosing
[Dr.Z] diagnosed [Jacks bronchitis]
Person diagnosed Disease
[Jack] was diagnosed [with bronchitis] [by Dr.Z] [Dr. Zs] diagnosis [of Jacks bronchitis] allowed her to treat him with the proper antibiotics.
SLIDE 27
Stages of the Propank process
!
Frame Creation
SLIDE 28 Stages of Propbank
!
Annotation
! Data is double annotated ! Annotators
- 1. Determine and select the sense of the predicate
- 2. Annotate the arguments for the selected predicate sense
!
Adjudication
! After data is annotated, it is passed to an adjudicator who
resolves differences between the two annotators
! This creates the gold standard – corrected, finished training
data
SLIDE 29
Annotation Example
SLIDE 30
Results
!
Propbank layer included 1772 distinct predicate lemmas
! 1006 has existing frames ! 74 new verb frames were created ! 692 noun frames were created
!
Of numbered arguments, Arg0 was the most common, at 48.47%, followed by Arg1 at 14.58%
SLIDE 31
SLIDE 32
Inter-annotator Agreement
!
Agreement was calculated 3 ways:
! Exact -- annotation needed to match on constituent
boundaries and roles
! Core-Argument -- constituent boundaries matched,
numbered arguments were the same, and ArgMs were used in the with exact boundaries
! Constituent -- annotators marked the same constituent
!
Results:
! PropBank, exact 0.891 ! PropBank, Core-arg 0.917 ! PropBank, Constituent 0.931
SLIDE 33
UMLS Annotations
SLIDE 34 UMLS Semantic Types, Groups and Relations annotation
!
UMLS (Unified Medical Language System) was developed to help with cross-linguistic translation
!
We mark semantic groups (similar to Named Entity Types) using UMLS with attributes:
! Negation (true/false) ! Status (none (=confirmed), possible, historyOf, and
familyHistoryOf)
!
Added Person category
34
SLIDE 35
UMLS Example
!
The patient underwent a radical tonsillectomy (with additional right neck dissection) for metastatic squamous cell carcinoma. He returns with a recent history of active bleeding from his oropharynx.
SLIDE 36
Inter-annotator Agreement
!
F1 measure
! Boundaries are exact match (0.697) ! Boundaries are partial match (0.75)
SLIDE 37
SLIDE 38
Development and Evaluation of NLP Components
SLIDE 39
Development of NLP Components
Treebank PropBank
Semantic Role Labeling
Dependency Parsing Dependency Conversion Part-of-speech Tagging Automatic Output
SLIDE 40
Development of NLP Components
!
ClearNLP dependency converter
! Generates the Stanford dependency labels (and more). ! Unlike the Stanford dependency converter, our approach
generates non-projective dependencies.
! Adapts to the MiPACQ Treebank guidelines. ! http://clearnlp.googlecode.com
!
OpenNLP part-of-speech tagger
! One-pass, left-to-right part-of-speech tagging approach. ! Uses maximum entropy for machine learning. ! http://opennlp.apache.org
SLIDE 41
Development of NLP Components
!
ClearNLP dependency parser
! Transition-based, non-projective dependency parsing
approach using bootstrapping.
! Takes about 2 milliseconds per sentence to parse. ! Showed state-of-the-art performance on the CoNLL’09 data
for English and Czech (Choi & Palmer, 2011a).
!
ClearNLP semantic role labeler
! Transition-based semantic role labeling approach using rich-
semantic features from dependency structure.
! Showed state-of-the-art performance for the CoNLL’09 data
for English (Choi & Palmer, 2011b).
SLIDE 42
Evaluation of NLP Components
!
Training data distribution
! The Penn Treebank (Wall Street Journal): 901K tokens. ! The Penn Treebank (Wall Street Journal): 148K tokens. ! The MiPACQ Treebank: 148K tokens. ! Penn Treebank + MiPACQ Treebank: 295K tokens. ! Penn Treebank + MiPACQ Treebank: 1.05M tokens.
WSJ-9 (901K) WSJ-1 (148K) MiPACQ (148K) WSJ-1 + MiPACQ WSJ-9 + MiPACQ Sentences 37,015 6,006 11,435 17,441 43,021 Tokens 901,673 147,710 147,698 295,408 1,049,383 Predicates 96,159 15,695 16,776 32,471 111,854
SLIDE 43
Evaluation of NLP Components
!
Evaluation data distribution
! The MiPACQ Treebank: clinical notes (colon cancer). ! The MiPACQ Treebank: pathology notes. ! The SHARP Treebank: radiology notes. ! The THYME Treebank: clinical/pathology notes (colon cancer).
MiPACQ CN MiPACQ PA SHARP THYME Sentences 893 203 9,070 9,107 Tokens 10,865 2,701 119,912 102,745 Predicates 1,355 145 8,573 8,866
SLIDE 44
Evaluation of Part-of-speech Tagging
Accuracies
SLIDE 45
Evaluation of Dependency Parsing
Labeled attachment scores
SLIDE 46
Evaluation of Semantic Role Labeling
F1-scores of argument identification + classification
SLIDE 47
Constituency Parser (not in the paper; preliminary results)
!
A wrapper around the OpenNLP parser implementing Ratnaparkhis MaxEnt parser (Ratnaparkhi, 1997)
! Labeled F1 0.81 (WSJ and MiPACQ)
SLIDE 48
Discussion
SLIDE 49
Discussion I
!
All resources are available
! Corpus (DUA required; send email to
Guergana.Savova@childrens.harvard.edu)
! Guidelines ! NLP components (ctakes.apache.org)
!
UMLS IAA similar to previously reported
!
Component results can be used as baselines to develop future improved parsers
!
Domain specific data (even in small quantities) improves performance
!
The need for a community adopted annotation scheme (UMLS?)
SLIDE 50
Discussion II
!
Rich syntactic and semantic annotations are building blocks to more complex tasks (discovery of implicit arguments, inferencing)
!
Addition of coreference and temporal relations (SHARPn, THYME)
!
Active learning (Settles, 2010; Miller et al, 2012)
SLIDE 51 References
!
Jinho D. Choi, Martha Palmer, “Getting the Most out of Transition-based Dependency Parsing”, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 687-692, Portland, Oregon, 2011a.
!
Jinho D. Choi, Martha Palmer, “Transition-based Semantic Role Labeling Using Predicate Argument Clustering”, Proceedings of ACL workshop on Relational Models of Semantics (RELMS'11), 37-45, Portland, Oregon, 2011b.
!
Jinho D. Choi, Martha Palmer, “Guidelines for the Clear Style Constituent to Dependency Conversion”, Technical report 01-12: Institute of Cognitive Science, University of Colorado Boulder, Boulder, CO, 2012.
!
More references are in the published manuscript…