Truecasing Clinical Narratives (Full Paper) Markus Kreuzthaler 1 , - - PowerPoint PPT Presentation

truecasing clinical narratives
SMART_READER_LITE
LIVE PREVIEW

Truecasing Clinical Narratives (Full Paper) Markus Kreuzthaler 1 , - - PowerPoint PPT Presentation

Truecasing Clinical Narratives (Full Paper) Markus Kreuzthaler 1 , Stefan Schulz 1 , 2 1 Institute for Medical Informatics Statistics and Documentation, Medical University of Graz, Austria 2 Institute of Medical Biometry and Medical Informatics,


slide-1
SLIDE 1

Truecasing Clinical Narratives

(Full Paper) Markus Kreuzthaler1, Stefan Schulz1,2

1Institute for Medical Informatics Statistics and Documentation,

Medical University of Graz, Austria

2Institute of Medical Biometry and Medical Informatics,

University Medical Center Freiburg, Germany

MIE Conference, August 30, 2011, Oslo

Markus Kreuzthaler (IMI) Truecasing MIE 2011 1 / 12

slide-2
SLIDE 2

Motivation (1)

Original text:

CHRONISCHE HEPATITIS MIT GERING BIS MITTELGRADIGER AKTIVITAET (HEPATISCHER AKTIVITAETSINDEX 6 VON 18) UND MITTELGRADIGER BIS HOEHERGRADIGER PORTALER UND MITTELGRADIGER INKOMPLETTER UND KOMPLETTER PORTOPORTALER UND PORTOZENTRALER FIBROSE (FIBROSESCORE 4 VON 6)

Corrected text:

Chronische Hepatitis mit gering bis mittelgradiger Aktivität (hepatischer Aktivitätsindex 6 von 18) und mittelgradiger bis höhergradiger portaler und mittelgradiger inkompletter und kompletter portoportaler und portozentraler Fibrose (Fibrosescore 4 von 6).

Markus Kreuzthaler (IMI) Truecasing MIE 2011 2 / 12

slide-3
SLIDE 3

Motivation (2)

Clinical Center Graz, Pathology Legacy Data Repository:

~1.8 million texts (1984 - 2005) are still of interest for clinical and scientific inquiries.

Markus Kreuzthaler (IMI) Truecasing MIE 2011 3 / 12

slide-4
SLIDE 4

Motivation (2)

Clinical Center Graz, Pathology Legacy Data Repository:

~1.8 million texts (1984 - 2005) are still of interest for clinical and scientific inquiries.

Legacy data example:

MITTELGRADIGE CHRONISCHE GASTRITS (MAGENMUCOSA VOM CORPUSTYP , UEBERGANGSTYP) MIT MITTELGRADIGER AKTIVITAET, KOMPLETTER UND INKOMPLETTER (TYP III) INTESTINALER METAPLASIE, MITTELGRADIGER ATROPHIE DER TIEFEN DRUESEN. ANTEIL EINES TUBULAEREN MAGENSCHLEIMHAUTADENOMS (INTESTINALER TYP; MITTELGRADIGE DYSPLASIE; WHO: GERINGGRADIGE INTRAEPITHELIALE NEOPLASIE). HP NICHT NACHWEISBAR. Upper case, no diacritics (e.g. "Füße", "FUESSE"), occasional typing errors.

Markus Kreuzthaler (IMI) Truecasing MIE 2011 3 / 12

slide-5
SLIDE 5

Motivation (2)

Clinical Center Graz, Pathology Legacy Data Repository:

~1.8 million texts (1984 - 2005) are still of interest for clinical and scientific inquiries.

Legacy data example:

MITTELGRADIGE CHRONISCHE GASTRITS (MAGENMUCOSA VOM CORPUSTYP , UEBERGANGSTYP) MIT MITTELGRADIGER AKTIVITAET, KOMPLETTER UND INKOMPLETTER (TYP III) INTESTINALER METAPLASIE, MITTELGRADIGER ATROPHIE DER TIEFEN DRUESEN. ANTEIL EINES TUBULAEREN MAGENSCHLEIMHAUTADENOMS (INTESTINALER TYP; MITTELGRADIGE DYSPLASIE; WHO: GERINGGRADIGE INTRAEPITHELIALE NEOPLASIE). HP NICHT NACHWEISBAR. Upper case, no diacritics (e.g. "Füße", "FUESSE"), occasional typing errors. Acronyms are not easy to identify (e.g. "WHO", "HP").

Markus Kreuzthaler (IMI) Truecasing MIE 2011 3 / 12

slide-6
SLIDE 6

Motivation (2)

Clinical Center Graz, Pathology Legacy Data Repository:

~1.8 million texts (1984 - 2005) are still of interest for clinical and scientific inquiries.

Legacy data example:

MITTELGRADIGE CHRONISCHE GASTRITS (MAGENMUCOSA VOM CORPUSTYP , UEBERGANGSTYP) MIT MITTELGRADIGER AKTIVITAET, KOMPLETTER UND INKOMPLETTER (TYP III) INTESTINALER METAPLASIE, MITTELGRADIGER ATROPHIE DER TIEFEN DRUESEN. ANTEIL EINES TUBULAEREN MAGENSCHLEIMHAUTADENOMS (INTESTINALER TYP; MITTELGRADIGE DYSPLASIE; WHO: GERINGGRADIGE INTRAEPITHELIALE NEOPLASIE). HP NICHT NACHWEISBAR. Upper case, no diacritics (e.g. "Füße", "FUESSE"), occasional typing errors. Acronyms are not easy to identify (e.g. "WHO", "HP"). German language specific spelling variants (e.g. "colon", "kolon"; "cerebral", "zerebral").

Markus Kreuzthaler (IMI) Truecasing MIE 2011 3 / 12

slide-7
SLIDE 7

Motivation (3)

"More data usually beats better algorithms." a

aAnand Rajaraman. Blog. March, 2008.

http://anand.typepad.com/datawocky/2008/03/more-data-usual.html

Sophisticated algorithms using little data versus less sophisticated algorithms using big data.

Markus Kreuzthaler (IMI) Truecasing MIE 2011 4 / 12

slide-8
SLIDE 8

Motivation (3)

"More data usually beats better algorithms." a

aAnand Rajaraman. Blog. March, 2008.

http://anand.typepad.com/datawocky/2008/03/more-data-usual.html

Sophisticated algorithms using little data versus less sophisticated algorithms using big data. "The good news is that Big Data is here." a

  • aT. White. Hadoop: The definitive guide. O’Reilly Media. Inc., June, 2009.

We will use Big Data for "Truecasing" Clinical Narratives.

Markus Kreuzthaler (IMI) Truecasing MIE 2011 4 / 12

slide-9
SLIDE 9

Corpus Description (1)

Corpus: 3,542 German-language pathology reports. 7-bit ASCII text. 83,818 tokens.

Markus Kreuzthaler (IMI) Truecasing MIE 2011 5 / 12

slide-10
SLIDE 10

Corpus Description (1)

Corpus: 3,542 German-language pathology reports. 7-bit ASCII text. 83,818 tokens.

Very low lexical coverage of 51%

Of 7500 word-types in the text corpus only 3808 match any word-token

  • f a standard medical dictionary (Pschyrembel).

Markus Kreuzthaler (IMI) Truecasing MIE 2011 5 / 12

slide-11
SLIDE 11

Corpus Description (2)

Gold standard for formative evaluation:

Sampling: 100 sentences; Delimiters (.;:!?); 9.3 tokens (SD=7.9, MIN=2, MAX=38, Median=7) per sentence.

Markus Kreuzthaler (IMI) Truecasing MIE 2011 6 / 12

slide-12
SLIDE 12

Corpus Description (2)

Gold standard for formative evaluation:

Sampling: 100 sentences; Delimiters (.;:!?); 9.3 tokens (SD=7.9, MIN=2, MAX=38, Median=7) per sentence. Correction: Manual spelling and grammar correction according to: 1996 German orthography reform. Medical spelling rules in accordance with German medical publishers. Pschyrembel Clinical Dictionary.

Markus Kreuzthaler (IMI) Truecasing MIE 2011 6 / 12

slide-13
SLIDE 13

Corpus Description (2)

Gold standard for formative evaluation:

Sampling: 100 sentences; Delimiters (.;:!?); 9.3 tokens (SD=7.9, MIN=2, MAX=38, Median=7) per sentence. Correction: Manual spelling and grammar correction according to: 1996 German orthography reform. Medical spelling rules in accordance with German medical publishers. Pschyrembel Clinical Dictionary.

Reference N-gram corpus:

All tokens in the World Wide Web indexed by Google.

Markus Kreuzthaler (IMI) Truecasing MIE 2011 6 / 12

slide-14
SLIDE 14

Algorithm (1)

Scraping Google with JDOM, TagSoup and XPath.

Markus Kreuzthaler (IMI) Truecasing MIE 2011 7 / 12

slide-15
SLIDE 15

Algorithm (1)

Scraping Google with JDOM, TagSoup and XPath.

Markus Kreuzthaler (IMI) Truecasing MIE 2011 7 / 12

slide-16
SLIDE 16

Algorithm (2)

Example:

"GERINGGRADIGE CHRONISCHE GASTRITIS" Bigram 1 "GERINGGRADIGE CHRONISCHE" Frequency Geringgradige 7 chronische 15 geringgradige 6 geringgradige" 2 Bigram 2 "CHRONISCHE GASTRITIS" Frequency Chronische 9 Gastritis 14 chronische 5

Markus Kreuzthaler (IMI) Truecasing MIE 2011 8 / 12

slide-17
SLIDE 17

Algorithm (3)

Merged "GERINGGRADIGE CHRONISCHE GASTRITIS" Frequency Chronische 9 Gastritis 14 Geringgradige 7 chronische 20 geringgradige 6 geringgradige" 2 Decision chronische

Markus Kreuzthaler (IMI) Truecasing MIE 2011 9 / 12

slide-18
SLIDE 18

Algorithm (3)

Merged "GERINGGRADIGE CHRONISCHE GASTRITIS" Frequency Chronische 9 Gastritis 14 Geringgradige 7 chronische 20 geringgradige 6 geringgradige" 2 Decision chronische Decision according to weighting: wi =

frequency(ti) levenShtein(t,ti)+1

Markus Kreuzthaler (IMI) Truecasing MIE 2011 9 / 12

slide-19
SLIDE 19

Results of Truecasing and Spelling Variant Correction

Correction Phenomenon Total Units Right case correction of normal tokens 896 909 tokens Right case correction of acronyms 13 16 tokens Correction of diacritics ("ä","ö","ü","ß") 73 80

  • ccurrences

"c", "k", "z" - variants corrected 4 21

  • ccurrences

Meaning of sentence affected by correction 3 100 sentences Spelling / grammar error corrected 1 5 sentences New grammar error after processing 1 100 sentences

Markus Kreuzthaler (IMI) Truecasing MIE 2011 10 / 12

slide-20
SLIDE 20

Outlook

Problem:

Meaning of sentence is affected by correction ("minimaler" → "maximaler").

Markus Kreuzthaler (IMI) Truecasing MIE 2011 11 / 12

slide-21
SLIDE 21

Outlook

Problem:

Meaning of sentence is affected by correction ("minimaler" → "maximaler"). Avoiding Google black box.

Markus Kreuzthaler (IMI) Truecasing MIE 2011 11 / 12

slide-22
SLIDE 22

Outlook

Problem:

Meaning of sentence is affected by correction ("minimaler" → "maximaler"). Avoiding Google black box. Use of open Web N-gram services:

◮ Yahoo! N-Grams, version 2.0:

12000 news-oriented sites, February 2006 to December 2006, Language English.

◮ Google/LDC Web 1T 5-gram, 10 European Languages Version 1:

Web pages from October 2008 to December 2008, 10 European Languages.

◮ Microsoft Web N-gram Services:

N-gram models based on Web snapshot taken in June 2009, EN-US market.

◮ Google Books N-grams:

Amazon S3 in a Hadoop friendly file format.

Markus Kreuzthaler (IMI) Truecasing MIE 2011 11 / 12

slide-23
SLIDE 23

Outlook

Problem:

Meaning of sentence is affected by correction ("minimaler" → "maximaler"). Avoiding Google black box. Use of open Web N-gram services:

◮ Yahoo! N-Grams, version 2.0:

12000 news-oriented sites, February 2006 to December 2006, Language English.

◮ Google/LDC Web 1T 5-gram, 10 European Languages Version 1:

Web pages from October 2008 to December 2008, 10 European Languages.

◮ Microsoft Web N-gram Services:

N-gram models based on Web snapshot taken in June 2009, EN-US market.

◮ Google Books N-grams:

Amazon S3 in a Hadoop friendly file format.

Going beyond bigrams, using N-grams.

Markus Kreuzthaler (IMI) Truecasing MIE 2011 11 / 12

slide-24
SLIDE 24

Outlook

Problem:

Meaning of sentence is affected by correction ("minimaler" → "maximaler"). Avoiding Google black box. Use of open Web N-gram services:

◮ Yahoo! N-Grams, version 2.0:

12000 news-oriented sites, February 2006 to December 2006, Language English.

◮ Google/LDC Web 1T 5-gram, 10 European Languages Version 1:

Web pages from October 2008 to December 2008, 10 European Languages.

◮ Microsoft Web N-gram Services:

N-gram models based on Web snapshot taken in June 2009, EN-US market.

◮ Google Books N-grams:

Amazon S3 in a Hadoop friendly file format.

Going beyond bigrams, using N-grams. Truecasing → Spelling correction.

Markus Kreuzthaler (IMI) Truecasing MIE 2011 11 / 12

slide-25
SLIDE 25

Literature

L.V. Lita, A. Ittycheriah, S. Roukos, and N. Kambhatla. tRuEcasIng. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 152–159. Association for Computational Linguistics, 2003.

  • C. Zhai, K. Wang, D. Yarowsky, S. Vogel, and E. Viegas.

Web N-gram workshop 2010. In ACM SIGIR Forum, volume 44, pages 59–63. ACM, 2011.

Markus Kreuzthaler (IMI) Truecasing MIE 2011 12 / 12