SLIDE 13 Overview Data Set Methodology Take Aways The DiagBot
Comp Ling Issues
As Gina pointed out in Week 6,“biomedical texts are not really English”!!!! POS X comes up nearly 30% of the time. Punctuation is very heavy owing to abbreviations.
Table : Part of Speech Counts
POS Count Percentage POS Count Percentage X 354,165 28.4 CC 28,902 2.3 NN 198,815 15.9 VBN 28,441 2.3 PUNC 147,095 11.8 RB 28,031 2.2 NNP 124,185 9.9 VB 20,515 1.6 JJ 93,352 7.5 PRP 18,060 1.4 IN 91,270 7.3 TO 17,915 1.4 CD 66,893 5.4 VBZ 16,474 1.3 DT 54,860 4.4 PRP$ 12,653 1.0 VBD 46,635 3.7 VBP 10,895 0.9 NNS 46,234 3.7 VBG 9,972 0.8 Michael Roylance and Nicholas Waltner Looking for Subjectivity in Medical Discharge Summaries The Obesity NLP i2b2 Challenge (2008) Tuesday 3rd June, 2014 13 / 16