Exploring the application of deep learning techniques on medical - - PDF document

exploring the application of deep learning techniques on
SMART_READER_LITE
LIVE PREVIEW

Exploring the application of deep learning techniques on medical - - PDF document

Exploring the application of deep learning techniques on medical text corpora Jos Antonio Miarro-Gimnez a, Oscar Marn-Alonso a,b and Matthias Samwald a a Section for Medical Expert and Knowledge-Based Systems Center for Medical


slide-1
SLIDE 1

Exploring the application of deep learning techniques on medical text corpora

José Antonio Miñarro-Giménez a, Oscar Marín-Alonso a,b and Matthias Samwald a

MIE 2014, 1st September 2014, Istanbul, Turkey

a Section for Medical Expert and Knowledge-Based Systems Center for Medical Statistics, Informatics, and Intelligent Systems Medical University of Vienna, Austria & Vienna University of Technology, Austria b Dept. of Computer Technology, University of Alicante, Alicante, Spain

slide-2
SLIDE 2

Introduction

Problem: Increasingly difficult to find relevant information

ARTIFICIAL INTELLIGENCE IN MEDICINE COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS INTERNATIONAL JOURNAL OF TECHNOLOGY ASSESSMENT IN HEALTH CARE JOURNAL OF BIOMEDICAL INFORMATICS MEDICAL & BIOLOGICAL ENGINEERING & COMPUTING JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION MEDICAL DECISION MAKING METHODS OF INFORMATION IN MEDICINE STATISTICAL METHODS IN MEDICAL RESEARCH STATISTICS IN MEDICINE BRIEFINGS IN BIOINFORMATICS BMC BIOINFORMATICS MEDICAL IMAGE ANALYSIS ARTIFICIAL INTELLIGENCE NEUROINFORMATICS BIOINFORMATICS

slide-3
SLIDE 3

Introduction

  • Challenge:

Automatically process biomedical literature.

  • Approaches:

Data mining. Information extraction methods. Natural language processing. ...

  • Tools:

Word2vec (https://code.google.com/p/word2vec/)

slide-4
SLIDE 4

Word2vec toolkit

Word2vec toolkit

word2vec ABC Vector models Options

  • Type of architecture: Skip-gram or continuous bag-of-words.
  • Vector space dimension.
  • Size of the context window.
  • Training algorithms: hierarchical softmax and / or negative sampling.
  • Threshold for downsampling the frequent words.
  • ...
slide-5
SLIDE 5

Word2vec toolkit

Word2vec toolkit

word2vec distance analogy

slide-6
SLIDE 6

Word2vec Analogy method

slide-7
SLIDE 7

Distance vs Analogy

slide-8
SLIDE 8

Corpus

Corpora Word count Vocabulary size Clinically relevant subset of PubMed, full abstracts 161.428.286 204.096 Conclusion sections from clinically relevant subset of PubMed, “pubmed_key_assertions” 17.342.158 47.703 Merck Manuals 12.667.064 49.174 Medscape 25.854.998 63.600 Clinically relevant subset of Wikipedia, “wikipedia" 10.945.677 65.875 Combined corpus (including all corpora above), “combined” 236.835.672 261.353

slide-9
SLIDE 9

NDF-RT ontology NDF-RT relationshi p Description Example

may_treat Provides the association between drugs and the diseases they may treat. Warfarin -> may_treat -> “Thrombophlebitis” may_prevent Provides the list of diseases that a drug may prevent. Warfarin -> may_prevent -> “Myocardial Infarction” has_PE Relates drugs to their corresponding physiological effects. Warfarin -> has_PE -> "Decreased Coagulation Factor Concentration" has_MoA The mechanisms of action of each drug. Warfarin -> has_MoA -> “Vitamin K Epoxide Reductase Inhibitors”

slide-10
SLIDE 10

Testing system

RESTful client NDF-RT

  • ntology

Results Query module Matching module Word2vec train tool RESTful server Analogy service Distance service Word2vec analogy tool Word2vec distance tool Trained corpus

slide-11
SLIDE 11

Pre-processing corpus

slide-12
SLIDE 12

Pre-processing corpus

Gathering text corpora Raw text List of terms

Processing corpora

Remove punctuation signs Avoid capitalized words Group multiword terms Processed text

NDF-RT

  • ntology
slide-13
SLIDE 13

1. The number of resulting vectors of words with at least

  • ne correct term from the relationships of NDF-RT
  • ntology.

2. The evaluation of window size and the type of architecture. 3. The evaluation of vector dimension in vector model. Statistics

slide-14
SLIDE 14

Results

Corpus Tool may_treat may_prevent has_PE has_MoA combined Analogy 27,37% 10,59% 2,49% 6,91% Distance 3,21% 6,32% 0,67% 6,81% PubMed key assertions Analogy 15,74% 5,09% 0,84% 1,51% Distance 2,13% 4,07% 0,37% 3,60% wikipedia Analogy 14,9% 5,35% 2,22% 2,69% Distance 1,3% 3.34% 0,32% 3,34%

Hit rate

slide-15
SLIDE 15

Results

Window size

slide-16
SLIDE 16

Results

Vector dimension

slide-17
SLIDE 17

Conclusions

  • Word2vec is very efficient to generate vector models and to execute the

different search methods.

  • Pre-processing the corpus content is needed to improve the resulting vector

models.

  • The analogy method gets better related terms than distance search method.
  • The generated vector models provide the best results when searching for

information related to “may_treat” relationship.

  • However, only a 27% of hit rate is a poor result compared to other

approaches.

  • The customization of vector dimension has more impact than other training

parameters such as the size of the context window.

  • The number of indexed terms is a better factor than the number of words in a

corpus to measure their quality.

slide-18
SLIDE 18
  • Test the word2vec toolkit with even larger medical corpora (> 10GB).
  • Investigate the use of contextual knowledge to improve precision and

recall of word2vec search methods. – Medical terminologies and ontologies.

Future work

slide-19
SLIDE 19

QUESTIONS?