Using NLP approaches on clinical and biomedical textual data - - PowerPoint PPT Presentation

using nlp approaches on clinical and biomedical textual
SMART_READER_LITE
LIVE PREVIEW

Using NLP approaches on clinical and biomedical textual data - - PowerPoint PPT Presentation

Using NLP approaches on clinical and biomedical textual data Thierry Hamon Institut Galil ee - Universit e Paris 13,Villetaneuse, France & LIMSI-CNRS, Orsay, France hamon@limsi.fr http://perso.limsi.fr/hamon/ March 2014 ERASMUS


slide-1
SLIDE 1

Using NLP approaches

  • n clinical and biomedical textual data

Thierry Hamon

Institut Galil´ ee - Universit´ e Paris 13,Villetaneuse, France & LIMSI-CNRS, Orsay, France hamon@limsi.fr http://perso.limsi.fr/hamon/

March 2014 ERASMUS Mobility - M¨ alardalen University (MDH) - V¨ aster˚ as - Sweden

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 1 / 66

slide-2
SLIDE 2

Presentation of three applications

Mining literature to identify of relations between risk factors and their pathologies Exploring graph structure to acquire synonym relations from terminological resource Mining patients’ Electronic Health Records (Discharge summaries)

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 2 / 66

slide-3
SLIDE 3

Risk factors

Mining literature to identify of relations between risk factors and their pathologies

Work with Martin Gra˜ naa, V´ ıctor Raggioa, Hugo Nayaa and Natalia Grabarb

a Unidad de Bioinform´

atica, Institut Pasteur de Montevideo, Mataojo 2020, Montevideo 11400, Uruguay

b UMR 8163 Savoirs, Textes, Langage (STL), Universit´

e Lille3, France

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 3 / 66

slide-4
SLIDE 4

Risk factors

Mining literature to identify of relations between risk factors and their pathologies

Risk factors: a complex notion

behaviour, environmental condition, disease, genetics... increase people’s chance to develop a given disease

⇒ Discover risk factor and design prevention strategies Research from biology, epidemiology, medicine, public health Despite an intensive activity, the knowledge is not complete

coronary heart disease: only 50% of risks known (Allen, 2000)

Information on risk factors is wide-spread over the web:

websites, bibliographical databases, ...

⇒ Difficulties: reliability and access

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 4 / 66

slide-5
SLIDE 5

Risk factors

Previous work

Active recent activity in text mining:

scientific literature: BioCreAtIvE, TREC Genomics clinical records: I2B2 NLP challenges (specific task in 2014)

Risk factors studies: data mining

managing a large number of variables (Ahmad & Bath, 2005) groups with similar risks/ICD-9 codes (16) claim costs in insurance companies (17) KDD challenge 2004 (http://lisp.vse.cz/challenge)

identify atherosclerosis risk factors monitor the evolution of these risks and their impacts

Processing of narratives (18):

breast cancer risk factors combination of manual and automatic meta-analysis findings consistent with known studies

positive association with alcohol consumption negative association with former smoking

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 5 / 66

slide-6
SLIDE 6

Risk factors

Objectives

Massive exploitation of Medline bibliographical database

text mining methods applied to full text

Extraction of risk factors and their associations to health conditions

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 6 / 66

slide-7
SLIDE 7

Risk factors

Material

Bibliographical database Medline

  • ver 18 M citations

⇒ titles, abtracts, MeSH indexing

MeSH

thesaurus for information storage and retrieval ⇒ MeSH headings

Snomed CT

nomenclature for organizing and exhanging clinical data rich semantic network: terms and relations three relations explicit on risk factors and health conditions

has causative agent: direct cause of the disorder or finding bacterial endocarditis has causative agent bacterium due to: relate a clinical finding directly to its cause acute pancreatitis due to infection associated with: clinically relevant association between concepts without either asserting or excluding a causal or sequential relationship between the two

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 7 / 66

slide-8
SLIDE 8

Risk factors

Bibliographical database Medline

Automated detection of potentially relevant citations

risk factors, factor of risk

Annotation of Medline citations with linguistic information

Ogmios NLP platform (Hamon&al, 2007) Segmentation, POS-tagging & lemmatization – Genia Tagger (Tsuruoka&al, 2005) Term extraction and recognition – Y A T EA (Aubin&Hamon, 2006)

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 8 / 66

slide-9
SLIDE 9

Risk factors

Information extraction

Corresponding pathologies and health conditions

Semantico-syntactic patterns

5 patterns for risk factors and pathologies 12 patterns for handling enumerations 3 patterns for pathologies <NP-RF> as a risk factor for <NP-P> where

as a risk factor for: trigger sequence <NP-RF>: noun phrases corresponding to risk factors <NP-P>: pathologies ? and *: optional and recurrent elements

MeSH descriptors of citations

Descriptors belonging to C heading of diseases

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 9 / 66

slide-10
SLIDE 10

Risk factors

Information extraction

Risk factors

Coordination and enumeration:

Risk factors for survival were age and severity of aortic stenosis ...

(PMID 8705769)

...a high intake of calcium and phosphorus is a risk factor for the development of metabolic acidosis . (PMID 1435825) ...had more than one of the common risk factors for cerebrovascular accidents , including hypertension, advanced age, hyperfibrinogenemia, diabetes mellitus, and past history of cerebrovascular accident. (PMID 1560589)

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 10 / 66

slide-11
SLIDE 11

Risk factors

Evaluation

1 Quality and exhaustiveness of risk factors for a given pathology 2 Associations risk factor/pathology, by comparison between:

text mining results MeSH indexing

3 Comparison between:

text mining results Snomed CT causal and associative relations

Evaluation of precision

ratio of correct extractions among the results

Manual evaluation:

no dedicated and comprehensive gold standard is available

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 11 / 66

slide-12
SLIDE 12

Risk factors

Building and preparing the material

Medline material:

187,544 citations selected: over 42 M word occurrences processed through the Ogmios platform

Snomed CT

accessed through UMLS (2008AB) 154,130 pairs pathology/causative agent, pathology/pathology

92,807 relations has causative agent 25,309 relations due to 36,134 relations associated with (120 relations provided by several Snomed CT relationships)

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 12 / 66

slide-13
SLIDE 13

Risk factors

Extraction of information on risk factors and pathologies

Application of three kinds of patterns

(1) {risk factor, pathology}, (2) risk factors, (3) pathologies

Definition of relations:

direct relations with patterns {risk factor, pathology} combination of information provided by (2) and (3)

10,445 PMIDs provide information 313 pairs {risk factor, pathology} 15,398 pairs by combination of (2) and (3) 5,873 risk factors (2) not associated with any pathology MeSH indexing: 5,106 pathologies and health conditions 21,584 triplets {risk factor, pathologytext?, pathologyMeSH?}

17,620 (14,895) pairs provided only by information extraction patterns 5,717 (4,412) pairs contain MeSH descriptors as pathology

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 13 / 66

slide-14
SLIDE 14

Risk factors

Evaluation

Risk factors for coronary heart disease (CHD)

CHD, most common hearth disease Important cause of premature death all around the world Evaluation by medical doctor 1,102 risk factors extracted:

128 (11.62%) rejected = 88.38% precision

Well known risk factors found

hypertension, smoking, diabetes, age, obesity, hypercholesterolemia, hyperlipidemia, family history ...

Detection of synonyms

{smoking; cigarette smoking; smoking history; importance of total life consumption of cigarettes} {hyperhomocysteinemia; hyperhomocysteinaemia; homocysteine; plasma homocysteine}

Error! (?){CHD, work} :

Passive smoking at work as a risk factor for coronary heart disease in Chinese women who have never smoked

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 14 / 66

slide-15
SLIDE 15

Risk factors

Evaluation

Comparison between MeSH-indexed and extracted pathologies

291 triplets {risk factor, pathologytext?, pathologyMeSH?} {pathologytext, pathologyMeSH}

42 identical {pathologytext, pathologyMeSH} 32 synonyms ({breast cancer, breast neoplasms}, {coronary artery disease, coronary disease}) 28 lexically included ({alzheimer, alzheimer disease}, {unsuspected anaphylaxis, anaphylaxis}) 101 close semantics ({poor pregnancy outcome, fetal growth retardation}, {development of alcohol disorders, alcoholism}) 7 broad semantics ({tardive dyskinesia, dyskinesia, drug-induced}) 91 not related semantically

Among 291 generated triplets:

91 extracted pathologies (31%) not relevant nearly 70% are identical, or have a close or broad semantic relation with the MeSH-indexing

MeSH indexing very often relevant to the addressed pathology

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 15 / 66

slide-16
SLIDE 16

Risk factors

Evaluation

Comparison of extracted risk factors with three Snomed CT associative relations

Comparison with three Snomed CT causative relationships

has causative agent, due to, associated with

Evaluation performed by a computer scientist 22,730 propositions related to 168 various pathologies Analysis of 20 pathologies (3,100 extractions, about 25%):

19 extractions (0.6%) considered as already recorded in Snomed CT

...how patients with abundant alcohol consumption as a risk factor develop the chronic alcohol abuse episode of care... extracted: {abundant alcohol consumption, alcoholism} Snomed CT: drinking alcohol (C0589068) has causative agent alcoholism (C0001973)

Snomed CT not dedicated to risk factors, but they may occur

extracted for acquired immunodeficiency syndrome:

bisexuality, bisexual, blood transfusion, intravenous drug abuse Snomed CT terms but there is relation with acquired immunodeficiency syndrome

⇒ Creation of a dedicated resource for risk factors

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 16 / 66

slide-17
SLIDE 17

Risk factors

Conclusion

Extraction of information related to risk factors Relation with associated pathologies Text mining approach based on semantico-syntactic patterns Evaluation by medical doctor and computer scientist

88.38% of risk factors related to coronary heart disease are correct about 70% of extracted pathologies are equivalent with MeSH indexing Snomed CT is not dedidated to the recording of risk factors, although they may occur

⇒ Creation of a dedicated resource for risk factors is suitable

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 17 / 66

slide-18
SLIDE 18

Risk factors

Perspectives

Other patterns, i.e. predictor, precursor ... Machine learning methods Knowledge representation:

homogeneous groups of risk factors environmental, social, clinical, behavioral ...

Characterization of this information modal, negative contexts Geographical, demographic variation

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 18 / 66

slide-19
SLIDE 19

Exploring graph structure

Exploring graph structure for detection of reliability zones within synonym resources

Experiment with the Gene Ontology Work with Natalia Grabar (CNRS – Lille 3) Objectives: Exploring reliable indicators for profiling synonymy relations

Linguistic indicators:

Productivity of the relations between terms Cooccurrences with other relationships (is-a, part-of) and lexical inclusion

Graph Theory notions:

Connectivity of the graphs Measures associated to the nodes and edges

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 19 / 66

slide-20
SLIDE 20

Exploring graph structure

Semantic similarity between words or terms

Knowledge on semantic similarity between words or terms is useful within several applications: Information retrieval (query expansion) Knowledge extraction Terminology matching Terminology/ontology building ...

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 20 / 66

slide-21
SLIDE 21

Exploring graph structure

Gene Ontology (GO)

Goal: propose a structured and controlled vocabulary for describing the roles of genes and their products in any organism Three hierarchies: biological processes, molecular functions and cellular components Relationships: synonymy, hyperonymy (is-a) and meronymy (part-of) relations Used version:

79,994 terms 459,834 synonyms 269,339 is-a relations 29,573 part-of relations

= ⇒ Original resource for inducing elementary semantic relations

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 21 / 66

slide-22
SLIDE 22

Exploring graph structure

Observations on compositionality of GO terms

Compositionality: the meaning of a complex expression is fully determined by its syntactic structure, the meaning of its parts and the composition function (Partee,

1984) GO terms are often coined on the same scheme GO:0009073 (original synonyms): aromatic amino acid family biosynthesis aromatic amino acid family anabolism aromatic amino acid family formation aromatic amino acid family synthesis It is possible to induce the set of elementary synonyms: biosynthesis, anabolism, formation, synthesis

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 22 / 66

slide-23
SLIDE 23

Exploring graph structure

Observations on compositionality of GO terms

Works based using the compositionality of GO (string matching approaches): Deriving simple graphs from relations between complex GO terms (Verspoor et al. 2003) Consistency checking of the GO (Mungall 2004) Enriching the GO with missing synonym terms (Ogren et al. 2004) Our approach: exploiting syntactic invariants of terms to induce semantic relations

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 23 / 66

slide-24
SLIDE 24

Exploring graph structure

Acquisition of elementary semantic relations

Hypothesis: Compositionality preserves relation within complex terms Meaning M of two complex terms A rel B and A′ rel B: M(A rel B) = f (M(A), M(B), M(rel)) M(A′ rel B) = f (M(A′), M(B), M(rel)) If terms A rel B and A′ rel B are semantically related = ⇒ Same semantic relation between their component A and A′ can be induced Semantic relations: synonymy, hyperonymy and meronymy relations

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 24 / 66

slide-25
SLIDE 25

Exploring graph structure

Acquisition of synonymy relations

Elementary synonym relation between components of two terms can be induced iff:

1

terms are synonymous;

2

these components are located at the same syntactic position (head or expansion)

3

these components have the same syntactic category

4

the other components within terms are either synonymous or identical

component expansion head component

replication (of) mitochondrial DNA

component expansion head component

replication mtDNA

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 25 / 66

slide-26
SLIDE 26

Exploring graph structure

Acquisition of synonymy relations

Elementary synonym relation between components of two terms can be induced iff:

1

terms are synonymous;

2

these components are located at the same syntactic position (head or expansion)

3

these components have the same syntactic category

4

the other components within terms are either synonymous or identical

component expansion head component

replication (of) mitochondrial DNA

component expansion head component

replication mtDNA

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 25 / 66

slide-27
SLIDE 27

Exploring graph structure

Acquisition of synonymy relations

Elementary synonym relation between components of two terms can be induced iff:

1

terms are synonymous;

2

these components are located at the same syntactic position (head or expansion)

3

these components have the same syntactic category

4

the other components within terms are either synonymous or identical

component expansion head component

replication (of) mitochondrial DNA

component expansion head component

replication mtDNA

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 25 / 66

slide-28
SLIDE 28

Exploring graph structure

Acquisition of hyperonymy and meronymy relations

Hyperonymy:

component expansion head component

cell activation

component expansion head component

astrocyte activation

Meronymy:

component expansion head component

development cerebral cortex

component expansion head component

regionalization cerebral cortex

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 26 / 66

slide-29
SLIDE 29

Exploring graph structure

Profiling synonym resource

Expected problems: Contextuality of the synonymy relations (Cruse86): two terms or words are considered as synonyms if they can occur within the same context Ability of automatic tools to detect and characterize the relations: two terms or words taken out of their context can convey different relations than the one expected

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 27 / 66

slide-30
SLIDE 30

Exploring graph structure

Exploiting linguistic indicators

Elementary is-a relations Elementary part-of relations Lexical inclusion: nested terms convey a hierarchical relation {DNA binding, binding} Productivity: number of original GO relations from which an elementary relation is inferred

relation Productivity {binding, DNA binding}) 1 {cell, lymphocyte}) 1 {T-cell, T-lymphocyte}) 8

Such factors can help to profil synonymy relations but they are not sufficient (Grabar & Hamon 2008)

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 28 / 66

slide-31
SLIDE 31

Exploring graph structure

Exploiting graph theory notions

Representation of the words and relations as a graph structure words and terms: nodes elementary relations: edges Linguistic indicators (elementary relations and productivity) are associated to the relations → Analysis of the connectivity of subgraphs Connected components and cliques Density: ratio between the number of the edges of the CC and the number of edges of the corresponding clique Bridge: edge which removal increases the number of connected components Centrality of a node: number of shortest paths passing through it

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 29 / 66

slide-32
SLIDE 32

Exploring graph structure

Preprocessing of GO terms

Ogmios NLP platform

Word Segmentation POS-tagging, lemmatisation: TreeTagger (Schmid, 1994) Syntactic parsing of terms: Y A T EA (Aubin & Hamon, 2006)

component expansion head component component expansion component head component expansion component head

acetone anabolism gamma−aminobutyric acid secretion

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 30 / 66

slide-33
SLIDE 33

Exploring graph structure

Elementary semantic relation acquisition on GO

79,994 GO terms fully parsed through the NLP platform Original semantic GO relations are used for inducing elementary relations

459,834 synonym relations = ⇒ 3,019 relations 269,339 hyperonymy relations = ⇒ 3,243 relations 29,537 meronymy relations = ⇒ 1,205 relations

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 31 / 66

slide-34
SLIDE 34

Exploring graph structure

Analysis of the results

linguistic indicators

2,533 synonymy relations are free of the relation indicators 486 synonymy relations (16%) cooccur with other relations

142 synonym relations are also labelled as is-a 34 synonym relations are also labelled as part-of 40 synonym relations are also labelled as both is-a and part-of 309 synonym relations are also labelled as incl (lexical inclusion)

Productivity of the elementary synonyms is between 1 and 422

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 32 / 66

slide-35
SLIDE 35

Exploring graph structure

Analysis of the results

Graph theory notions

3,019 elementary synonyms grouped into 1,018 connected components Connected component size: between 2 and 69 nodes, between 1 and 132 edges 914 cliques: 708 with 2 nodes, 66 with 3 nodes, 88 with 4 nodes, 44 with 5 nodes, 8 with 6 nodes Density between 0,0467 and 1 104 connected components with density < 1 249 bridges among the 104 CC (between 0 and 35) Analysis of CCs: identify clues that could be useful for profiling relations

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 33 / 66

slide-36
SLIDE 36

Exploring graph structure

Analysis of a connected component

Articulation vertex Centrality = 4 Bridge

density=0.666666666666667 #nodes=4 #edges=4 IMS envelope lumen membrane lumen intermembrane space 2 : SYN 2 : SYN 2 : SYN 1 : SYN

Density: information about semantic cohesiveness of the relations (connected component) higher is the density, stronger is the semantic cohesion

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 34 / 66

slide-37
SLIDE 37

Exploring graph structure

Analysis of a connected component

Articulation vertex Centrality = 18 Articulation vertex Centrality = 10 Articulation vertex Centrality = 16 Bridge Bridge density=0.380952380952381 #nodes=7 #edges=8 cycling uptake regeneration ion import import salvage recycling 1|2 : SYN ,HIER 1 : SYN 1 : SYN 1 : SYN 1|1 : SYN ,INCL 7|7 : SYN ,HIER 1|2 : SYN ,HIER 2|3 : SYN ,HIER

Bridge: information about weakness of the relation, especially when synonymy co-occurs with other relationships (is-a, part-of, lexical inclusion)

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 35 / 66

slide-38
SLIDE 38

Exploring graph structure

Analysis of a connected component

Centrality = 15 Articulation vertex Centrality = 52 Articulation vertex Centrality = 37 Articulation vertex Bridge Bridge Bridge density=0.333333333333333 #nodes=10 #edges=15 cell cycle modulation cell cycle regulator cell cycle regulation regulation ion homeostasis control regulator modulation cell cycle control homeostasis 1 : SYN 1 : SYN 1 : SYN 1 : SYN 1 : SYN 1 : SYN 1 : SYN 4|1|1 : SYN ,HIER ,INCL 1 : SYN 5 : SYN 5 : SYN 6 : SYN 5 : SYN 5 : SYN 5 : SYN

Centrality: identification of polysemic words or terms

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 36 / 66

slide-39
SLIDE 39

Exploring graph structure

Conclusion

Method for inferring elementary semantic relations:

Exploiting compositionality principle Applying set of rules based on syntactic dependency analysis Based on the use of structured terminologies Application to Gene Ontology

Exploration of factors for profiling synonymy relations

Linguistic factors: combination of elementary relationships, lexical inclusion, productivity Graph theory notions: connectivity (connected components, cliques), density, bridge, centrality

Helpfulness of the indicators for:

Profiling acquired synonymy Preparing the validation of the lexicon by experts

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 37 / 66

slide-40
SLIDE 40

Exploring graph structure

Perspectives

→ (done) Use of the Page Rank (Hamon et al. 2012) Formalisation of the approach: assignment of numeric weight to edges Taking into account the nature of the synonymy relations in Gene Ontology (exact, broad, narrow, related) Cross-validation of the relationships (synonymy, is-a, part-of) Enriching and extending the Gene Ontology Connected component with high density: indication about missing relations

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 38 / 66

slide-41
SLIDE 41

Medication and Assertion

Introduction

Patients’ Electronic Health Records (Discharge summaries) description of the hospitalisation plenty of (personal) information about a patient

Problems Therapies (Treatments, drugs, etc.) Tests and analysis (lab data, etc.) Assertions regarding facts (certainty, hypothesis, etc.) ...

the best way to record information (database are difficult to maintain) Application:

enrichment of the information extraction databases become richer and more precise queries with better results

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 39 / 66

slide-42
SLIDE 42

Medication and Assertion

Introduction

Mining of the discharge summaries

Identification and extraction of medication names administrated to patients related information (dosage, duration, frequency, mode of administration, reason for prescription) assertion: certainty and uncertainty of the information in the medical texts focus on the relation {patient / medical problem} BUT the texts are written by practitioners: in a hurry, with mistakes, with little or incorrect syntactic structures, etc.

Work with Natalia Grabar (CNRS - Lille3) and Amandine P´ erinet (LIM&BIO)

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 40 / 66

slide-43
SLIDE 43

Medication and Assertion

Examples (excerpt 1)

medication name and associated information

The patient is currently off diuretics at this time. Daily weights should be checked and if her weight increases by more than 3 pounds Dr. Bockoven should be notified. The patient was also started on calcitriol given elevation of parathyroid hormone. Cardiovascular: Rate and rhythm: The patient has a history of atrial fibrillation with a slow ventricular response. The patient was started on metoprolol 12.5 mg p.o. q.6 h. for rate control , however , this dose was decreased to 12.5 mg p.o. twice a day, given some bradycardia on her telemetry. The patient was also started on Flecainide 75 mg p.o. q.12 h. She will continue on these two medications upon discharge.

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 41 / 66

slide-44
SLIDE 44

Medication and Assertion

Examples (excerpt 2)

Assertion {medical problem – patient}

FAMILY HISTORY : Noncontributory . REVIEW OF SYSTEMS : The review of systems is negative , except as above . ALLERGIES : The patient has no known drug allergies . PHYSICAL EXAMINATION : The physical examination reveals a pleasant 70 year old male in no acute distress . The blood pressure was 113/70 , pulse 66 , respirations 20 , and a temperature of 96.8 . HEENT showed pupils equal , round , reactive to light and accommodation , and extra ocular movements intact . The sclerae were anicteric . The neck was supple , without jugular venous distention or increased thyroid . Nodes : There was no appreciable cervical , supraclavicular , axillary ,

  • r inguinal adenopathy .

The chest was clear to auscultation and percussion . Cardiac : regular rate and rhythm , with a I / VI systolic ejection murmur . The abdomen showed a 10 cm liver with the edge palpable 2 fingerbreadths below the right costal margin . The extremity exam was notable for 2+ pitting edema in the right lower extremity extending to the knee . The neurological examination was nonfocal .

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 42 / 66

slide-45
SLIDE 45

Medication and Assertion

Examples (excerpt 3)

ADMISSION MEDICATIONS: Coumadin 2.5 mg p.o. daily , Zocor 20 mg p.o. daily , atenolol 100 mg p.o. b.i.d. , Lasix 80 mg p.o. b.i.d. , lisinopril 40 mg p.o. daily , Medrol 10 mg daily , glyburide 10 mg p.o. b.i.d. , potassium chloride 10 mg p.o. b.i.d. , and multivitamin 1 tab p.o. daily.

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 43 / 66

slide-46
SLIDE 46

Medication and Assertion

Examples (excerpt 4)

DISCHARGE MEDICATIONS: HYDROCORTISONE 2.5\% -RECTAL CREAM TP BID Instructions: Apply to hemorrhoids BEN-GAY TOPICAL TP BID Instructions: Apply liberally to legs Alert overridden: Override added on 9/8/03 by FACK , PASQUALE DIEGO , M.D. DEFINITE ALLERGY ( OR SENSITIVITY ) to SALICYLATES Reason for override: aware PREMARIN ( CONJUGATED ESTROGENS ) 1.25 MG PO QD LASIX ( FUROSEMIDE ) 60 MG qam; 40 MG qpm PO BID 60 MG qam 40 MG qpm Starting Today ( 0/29 ) METAMUCIL SUGAR FREE ( PSYLLIUM ( METAMUCIL ) SU... ) 1 PACKET PO TID Instructions: With meals NORVASC ( AMLODIPINE ) 10 MG PO QD Food/Drug Interaction Instruction Avoid grapefruit unless MD instructs otherwise. AMBIEN ( ZOLPIDEM TARTRATE ) 10 MG PO QHS PRN insomnia

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 44 / 66

slide-47
SLIDE 47

Medication and Assertion

Examples (excerpt 5)

RRR , lots of BS’s , neuro nonfocal , ext with 1+ edema. On atenolol , zestril , norvasc , premarin , detrol , lasix 60 qd , nebs prn at home. Labs sig for Cr 0.7 , CK 48 , TnI .05 , QBC 9.5 , Hct 41.3. From CV point of view , thought to be CHF exac. ROMI’d without events on monitor and diuresed 2L/day. IV Lasix 80 bid to start transitioned to 60 po

  • bid. BNP>assay. 6/17 dobut MIBI with mod sized ant septal wall defect

c/w diagonal lesion , 3/22 Echo with EF 55-60\% , mild LAE/RAE , no WMA , mod large RV. No further CV studies. Cont previously meds on d/c. From FEN point of view , 2 L fluid restriction , 2 g Na

  • restriction. Nutrition consult , but pt very resistant to diet changes.

From GI point of view , GERD; nexium started. From pulm point of view , CXR c/w sl fluid overload , no focal findings , no pulm edema. Given NC O2 and BiPAP at night.

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 45 / 66

slide-48
SLIDE 48

Medication and Assertion

Context

I2B2 shared-tasks 2009 and 2010 Challenges in Natural Language Processing for Clinical Data https://www.i2b2.org/NLP/Main.php Obesity Challenge (Who’s obese and what co-morbidities do they (definitely/likely) have?), 2008 Medication Extraction Challenge, 2009 Relations Challenge, 2010 Coreference Challenge, 2011 Temporal relations Challenge, 2012 3 months to adapt the system

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 46 / 66

slide-49
SLIDE 49

Medication and Assertion

Objectives

Medication task

Identification of Medication names: Premarin, LASIX, potassium chloride Related information:

frequency (f): qam, qpm, half hour before meals dosage (do): 1.25 MG, 60 MG, home dose, large dose, 1 TAB 04 mg mode of administration (mo): PO, orally, iv, applied to face duration (du): constantly, for 7 to 10 days, x10 days, for an additional two week course reason (r): pain, jugular venous distention

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 47 / 66

slide-50
SLIDE 50

Medication and Assertion

Assertion task

Degree of certainty from abdominal pain With shrimps, the patient suffers The patient is to call the hospital if he suffers from abdominal pain The patient denies suffering from abdominal pain abdominal pain The patient suffers from might suffer from abdominal pain It was thought that the patient Certainty Hypothesis Condition Negative certainty Positive certainty Assertion Possibility

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 48 / 66

slide-51
SLIDE 51

Medication and Assertion

Enriching documents with linguistic information

Extraction of terms

Ontology

Lemmatisation Tagging

  • f the terms

Terminoloy

Semantic tagging linguistic and structural annotations XML document with

  • f named

Dictionary entities

Named entity tagging Word and sentence segmentation

Specialised lexicon

Part−Of−Speech Tagging Tokenisation XML document with structural annotations

Symbolic approach: use of NLP methods Terminological resources and disambiguation rules Concurrent annotations and annotation selection Design of post-processing modules for

Annotation disambiguation Establishment of dependency relations between patient, medication names and related information, or assertion

Annotation based on the Ogmios NLP platform (developed during the EU Project Alvis)

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 49 / 66

slide-52
SLIDE 52

Medication and Assertion

Enriching document with linguistic information

Identification of the sentences

these genes has GerE sites binding two each of

  • f

region promoter The

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 50 / 66

slide-53
SLIDE 53

Medication and Assertion

Enriching document with linguistic information

Identification of the sentences, words

these genes has GerE sites binding two each of

  • f

region promoter The

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 50 / 66

slide-54
SLIDE 54

Medication and Assertion

Enriching document with linguistic information

Identification of the sentences, words, named entities

these genes has GerE sites binding two each of

  • f

region promoter The

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 50 / 66

slide-55
SLIDE 55

Medication and Assertion

Enriching document with linguistic information

Identification of the sentences, words, named entities, and terms

  • these

genes has GerE sites binding two each of

  • f

region promoter The

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 50 / 66

slide-56
SLIDE 56

Medication and Assertion

Enriching document with linguistic information

Identification of the sentences, words, named entities, and terms Semantic enrichment of the units

promoter region binding site gene gene name

  • these

genes has GerE sites binding two each of

  • f

region The promoter

DT N N P DT P V NP Adj DT N CN N

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 50 / 66

slide-57
SLIDE 57

Medication and Assertion

Preparing material for document annotation

Medication task: Discharge summaries: 1,249 documents (training set 649, test set 553) 17 manually annotated documents (for illustrating the annotation guidelines) Assertion task: Discharge summaries: 1,249 documents (training set 349 annotated documents + 827 raw documents, test set 477) Assertions: 11,968 in the training set, 18,550 in the test set

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 51 / 66

slide-58
SLIDE 58

Medication and Assertion

Preparing material for document annotation

Medication tasks

Terminological resources and nomenclatures:

Medication names:

RxNorm: 243,869 entries Ambiguous medication (red blood cells, magnesium, iron): specific status during the annotation process Therapeutic classes and groups of medication (FDA website)

Reasons:

45,898 terms (Diagnosis and Morphology axes of the Snomed International) 476 terms from the training set documents

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 52 / 66

slide-59
SLIDE 59

Medication and Assertion

Preparing material for document annotation

Medication tasks

Negation: 284 markers of the NegEx resource (differenciation of the pre- and post- negation) Named entities recognition and contextual rules:

Automata for frequency, dosage, duration and mode of administration 52 identification rules for reasons: characterization of Snomed Int NP and/or extracted terms as reasons

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 53 / 66

slide-60
SLIDE 60

Medication and Assertion

Preparing material for document annotation

Assertion tasks

Problems are given Only one known resource: NegExp Definition of our own resources:

Term structure (33 markers)

Lexical clues: on exertion (condition) Morphological clues: afebrile (negative certainty)

Context and document structure (342 markers)

Clues in the sentence: ... could represent a multifocal pneumonic process (possible) Section headings: ALLERGIES, SOCIAL HISTORY, lists

Lexico-syntactic patterns (137 patterns)

Hypothesis:

be to (address | request | notify) @DT (office | clinic | hospital) if @PB

Possibility:

@TE to (evaluate | check | eval | consult) (from | if | with | against) @PB

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 54 / 66

slide-61
SLIDE 61

Medication and Assertion

Concurrent annotation of documents

Named Entity Recognition (frequency, duration, dosage, mode of administration) + internal disambiguation (avoid nested annotations of different types and merge annotations of the same type) Term and semantic tagging (medication and reasons, negation and reason marker, assertion) based on linguistic information (word and sentence segmentation, lemmatization) + internal disambiguation (nested terms, parenthesed medication names, etc.)

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 55 / 66

slide-62
SLIDE 62

Medication and Assertion

Annotation selection

Medication names

Several modules for selection and disambiguation of annotations Processing of ambiguous medication names : laboratory data or medication

1

annotation with special status

2

if a list section: status changed in medication

HOME MEDS: methadone 20 bid, imdur 120 bid, hydral taking 25 bid, lasix 20 bid, coumadin, colace, iron, nexium 40 bid, doxazosin 2 qd, allopurinol 100 qod

Rejection of medication names: if they occur in allergy sections not prescribed medication

ALLERGY: prednisone, penicillins, tamsulosin, simvastatin, spironolactone

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 56 / 66

slide-63
SLIDE 63

Medication and Assertion

Annotation selection

Medication names

Processing of negative contexts medication names within negative contexts are rejected Enrichement of the medication list Identification of medication names with specific semantic patterns (m do mo? f) Medication names:

1

Nouns and noun phrases recognized by the term extractor Y A T EA

2

Stopwords rejected

3

Filtering with typical suffixes of the medication names Diovan 160mg PO BID, HCTZ 25mg PO QD, Imdur ER 60mg PO QD, NTG .4mg PRN CP, Norvasc 10mg PO QD, Pavachol 80mg PO QD.

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 57 / 66

slide-64
SLIDE 64

Medication and Assertion

Annotation selection

Post-processing of medication names

Segmentation of medication names containing other types of information Lisinopril 5 mg p.o. q. day → medication name Lisinopril and its dosage 5 mg Processing enumerated or coordinated medication names CV: 3VD s/p CABG x 3 in 2002; continued ASA, plavix, lisinopril, lopressor, statin Dependency relations between medication names and the related information Generation of the I2B2 offsets

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 58 / 66

slide-65
SLIDE 65

Medication and Assertion

Results

Medication task

Focus on various parameters for reason identification and guessing medication names

RUN2 RUN1 RUN3 System 0.7801 0.7681 (-0.0120) 0.7719 (-0.0082) m 0.8142 0.8093 (-0.0049) 0.808 (-0.0062) do 0.8234 0.8172 (-0.0062) 0.821 (-0.0024) f 0.837 0.8304 (-0.0066) 0.8345 (-0.0025) mo 0.8655 0.8577 (-0.0078) 0.8624 (-0.0031) du 0.3575 0.3516 (-0.0059) 0.3505 (-0.0070) r 0.2867 0.2759 (-0.0108) 0.2666 (-0.0201) RUN1: All reasons (1-2 minutes per document) RUN2: All reasons without semantic tagging and reason markers (1-2 minutes per document) RUN3 (4-5 minutes per document): All reasons without semantic tagging and use of reason markers Guessing medication names

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 59 / 66

slide-66
SLIDE 66

Medication and Assertion

Results

Medication task

exact inexact F P R F P R System 0.7801 0.7997 0.7614 0.7792 0.8111 0.7497 m 0.8142 0.8448 0.7858 0.8304 0.8666 0.7971 do 0.8234 0.8728 0.7793 0.8503 0.8799 0.8226 f 0.837 0.8306 0.8435 0.8411 0.8436 0.8386 mo 0.8655 0.8543 0.877 0.863 0.844 0.8828 du 0.3575 0.3483 0.3673 0.3607 0.3669 0.3546 r 0.2867 0.3047 0.2708 0.3386 0.4386 0.2757

Reason: difficult to identify the exact noun phrases (-13% between inexact and exact precision)

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 60 / 66

slide-67
SLIDE 67

Medication and Assertion

Results

Assertion task

List of markers + section headings (with or without patterns)

Categories Training Test P R F P R F Associated to somebody else 0.96 0.80 0.88 0.84 0.74 0.79 Hypothesis 0.71 0.31 0.43 0.63 0.24 0.35 Condition 0.08 0.40 0.14 0.08 0.33 0.12 Possibility 0.46 0.57 0.51 0.51 0.47 0.49 Absent 0.92 0.75 0.82 0.87 0.75 0.81 Present 0.86 0.90 0.88 0.84 0.87 0.86 Assertions 0.82 0.82 0.82 0.80 0.80 0.80

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 61 / 66

slide-68
SLIDE 68

Medication and Assertion

Conclusion

Resources

375 markers (33 + 342) 137 patterns

F-measure of the system: 0.800 Analysis of the resource contribution:

Importance of the markers Poor pattern contribution Need to include syntactic structures

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 62 / 66

slide-69
SLIDE 69

Medication and Assertion

Conclusion

Design of a NLP system based on resources and contextual rules Medication tasks:

Concurrent linguistic annotations of the discharge summaries Post-processing modules for selecting medication names and related information Stable results between the 17 annotated documents

Assertion task:

Importance of the markers Poor pattern contribution Need to include syntactic structures Difficulty to identify certainty degrees few examples for condition and hypothesis Stable results on a 10% corpus sample

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 63 / 66

slide-70
SLIDE 70

Medication and Assertion

Further improvements

Medication tasks:

Guessing new medication names: additional evaluation needed Duration extraction: identification of specific prepositional phrases based on parsing Reason identification: development of a specific reasoning module

Assertion task:

Enrich resources

Using synonyms (Wordnet) Improving the patterns: using syntactic dependencies integrating semantic classes (verbs of evidence, verbs to get in touch with somebody, etc.)

Exploit resources On another application or another corpus (abstracts of scientific articles)

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 64 / 66

slide-71
SLIDE 71

Medication and Assertion Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 65 / 66

slide-72
SLIDE 72

Medication and Assertion

System architecture

I2B2 / Ogmios offset conversion token index generation token index Frequency Duration Dosage Mode of administration Sentence segmentation Word segmentation POS tagging and lemmatization Term extraction Negation Medications Medication selection sections Valid I2B2 output Generation of the text format Section and list identification Named Entity Recognition Frequency Duration Dosage Mode of administration Reason Markers Negation Semantic tagging Reasons Reason Markers identification

  • f reasons

Terms Medications amb. Medications list? Missing medications list? Extracted terms between medication names

  • f dependency relations

Identification and the related information Stopwords Medication suffixes EHR Corpus EHR Corpus XML format with sections Ogmios platform TermTagging Terms (reasons + SNOMED/DM) Medications2 Reasons (reasons + SNOMED/DM) Medications PostProcessing

Cleaning

Thierry Hamon (LIMSI & Paris Nord) Using NLP approaches on clinical and biomedical textual data March 2014 66 / 66