Exploring the application of deep learning techniques on medical - - PDF document
Exploring the application of deep learning techniques on medical - - PDF document
Exploring the application of deep learning techniques on medical text corpora Jos Antonio Miarro-Gimnez a, Oscar Marn-Alonso a,b and Matthias Samwald a a Section for Medical Expert and Knowledge-Based Systems Center for Medical
Introduction
Problem: Increasingly difficult to find relevant information
ARTIFICIAL INTELLIGENCE IN MEDICINE COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS INTERNATIONAL JOURNAL OF TECHNOLOGY ASSESSMENT IN HEALTH CARE JOURNAL OF BIOMEDICAL INFORMATICS MEDICAL & BIOLOGICAL ENGINEERING & COMPUTING JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION MEDICAL DECISION MAKING METHODS OF INFORMATION IN MEDICINE STATISTICAL METHODS IN MEDICAL RESEARCH STATISTICS IN MEDICINE BRIEFINGS IN BIOINFORMATICS BMC BIOINFORMATICS MEDICAL IMAGE ANALYSIS ARTIFICIAL INTELLIGENCE NEUROINFORMATICS BIOINFORMATICS
Introduction
- Challenge:
Automatically process biomedical literature.
- Approaches:
Data mining. Information extraction methods. Natural language processing. ...
- Tools:
Word2vec (https://code.google.com/p/word2vec/)
Word2vec toolkit
Word2vec toolkit
word2vec ABC Vector models Options
- Type of architecture: Skip-gram or continuous bag-of-words.
- Vector space dimension.
- Size of the context window.
- Training algorithms: hierarchical softmax and / or negative sampling.
- Threshold for downsampling the frequent words.
- ...
Word2vec toolkit
Word2vec toolkit
word2vec distance analogy
Word2vec Analogy method
Distance vs Analogy
Corpus
Corpora Word count Vocabulary size Clinically relevant subset of PubMed, full abstracts 161.428.286 204.096 Conclusion sections from clinically relevant subset of PubMed, “pubmed_key_assertions” 17.342.158 47.703 Merck Manuals 12.667.064 49.174 Medscape 25.854.998 63.600 Clinically relevant subset of Wikipedia, “wikipedia" 10.945.677 65.875 Combined corpus (including all corpora above), “combined” 236.835.672 261.353
NDF-RT ontology NDF-RT relationshi p Description Example
may_treat Provides the association between drugs and the diseases they may treat. Warfarin -> may_treat -> “Thrombophlebitis” may_prevent Provides the list of diseases that a drug may prevent. Warfarin -> may_prevent -> “Myocardial Infarction” has_PE Relates drugs to their corresponding physiological effects. Warfarin -> has_PE -> "Decreased Coagulation Factor Concentration" has_MoA The mechanisms of action of each drug. Warfarin -> has_MoA -> “Vitamin K Epoxide Reductase Inhibitors”
Testing system
RESTful client NDF-RT
- ntology
Results Query module Matching module Word2vec train tool RESTful server Analogy service Distance service Word2vec analogy tool Word2vec distance tool Trained corpus
Pre-processing corpus
Pre-processing corpus
Gathering text corpora Raw text List of terms
Processing corpora
Remove punctuation signs Avoid capitalized words Group multiword terms Processed text
NDF-RT
- ntology
1. The number of resulting vectors of words with at least
- ne correct term from the relationships of NDF-RT
- ntology.
2. The evaluation of window size and the type of architecture. 3. The evaluation of vector dimension in vector model. Statistics
Results
Corpus Tool may_treat may_prevent has_PE has_MoA combined Analogy 27,37% 10,59% 2,49% 6,91% Distance 3,21% 6,32% 0,67% 6,81% PubMed key assertions Analogy 15,74% 5,09% 0,84% 1,51% Distance 2,13% 4,07% 0,37% 3,60% wikipedia Analogy 14,9% 5,35% 2,22% 2,69% Distance 1,3% 3.34% 0,32% 3,34%
Hit rate
Results
Window size
Results
Vector dimension
Conclusions
- Word2vec is very efficient to generate vector models and to execute the
different search methods.
- Pre-processing the corpus content is needed to improve the resulting vector
models.
- The analogy method gets better related terms than distance search method.
- The generated vector models provide the best results when searching for
information related to “may_treat” relationship.
- However, only a 27% of hit rate is a poor result compared to other
approaches.
- The customization of vector dimension has more impact than other training
parameters such as the size of the context window.
- The number of indexed terms is a better factor than the number of words in a
corpus to measure their quality.
- Test the word2vec toolkit with even larger medical corpora (> 10GB).
- Investigate the use of contextual knowledge to improve precision and