10th Workshop on Building and Using Comparable Corpora at ACL17 - - PowerPoint PPT Presentation

10th workshop on building and using comparable corpora at
SMART_READER_LITE
LIVE PREVIEW

10th Workshop on Building and Using Comparable Corpora at ACL17 - - PowerPoint PPT Presentation

10th Workshop on Building and Using Comparable Corpora at ACL17 Vancouver, Canada A parallel collection of clinical trials in Portuguese and English Mariana Neves


slide-1
SLIDE 1

10th Workshop on Building and Using Comparable Corpora at ACL‘17 Vancouver, Canada A parallel collection of clinical trials in Portuguese and English

Mariana Neves August 3rd, 2017

slide-2
SLIDE 2

Parallel corpora of documents

  • Parallel collections of documents are valuable resources for

training, tuning and evaluating machine translation (MT) tools.

  • However, these are not available for some domains, e.g.,

biomedicine, and manually creating such collections is an expensive task.

slide-3
SLIDE 3

Parallel corpora for news vs. biomedical domains

(https://ufal.mff.cuni.cz/ufal_medical_corpus)

slide-4
SLIDE 4

Sources for biomedical documents

  • Clinical discharge summaries

Not available due to privacy issues

Usually monolingual

(http://home.iprimus.com.au/callanders/Image-21.jpg)

slide-5
SLIDE 5

Sources for biomedical documents

  • Scientifjc publications

Many are freely available, but frequently monolingual (English)

There are some exceptions, e.g., Scielo, EDP

(http://www.scielo.br/scielo.php?script=sci_abstract&pid=S1676-06032017000100201&lng=pt&nrm=iso&tlng=en http://www.scielo.br/scielo.php?script=sci_abstract&pid=S1676-06032017000100201&lng=pt&nrm=iso&tlng=pt)

slide-6
SLIDE 6

Sources for biomedical documents

  • Scientifjc publications

Scielo corpus: the fjrst parallel (comparable) corpus of scientifjc publications

(Neves M, Jimeno-Yepes A and Névéol A. The Scielo Corpus: a Parallel Corpus of Scientific Publications for Biomedicine, International Conference on Language Resources and Evaluation (LREC), 2016, Portoroz, Slovenia. )

slide-7
SLIDE 7

Sources for biomedical documents

  • Scientifjc publications

Datasets from Scielo and EDP currently being used in the WMT Biomedical Translation T ask

(http://www.statmt.org/wmt17/biomedical-translation-task.htm)

slide-8
SLIDE 8

Sources for biomedical documents

  • Clinical trials

Freely (publicily) available but usually monolingual (English)

(http://clinicaltrials.gov/)

slide-9
SLIDE 9

Sources for biomedical documents

  • Clinical trials

Even sometimes in countries whose native language isn‘t English...

(http://reec.aemps.es)

slide-10
SLIDE 10

Deutschen Register Klinischer Studien (DRKS)

(http://www.drks.de)

  • Clinical trials

Sometimes parallel documents are available but license doesn‘t allow its distribution

slide-11
SLIDE 11

Brazilian Clinical T rials Registry / Registro Brasileiro de Ensaios Clínicos (ReBEC)

(http://www.ensaiosclinicos.gov.br/)

slide-12
SLIDE 12

Overview of a clinical trial

(http://www.ensaiosclinicos.gov.br/rg/RBR-48pb9h/)

slide-13
SLIDE 13

Overview of a clinical trial

(http://www.ensaiosclinicos.gov.br/rg/RBR-48pb9h/)

slide-14
SLIDE 14

Overview of a clinical trial

(http://www.ensaiosclinicos.gov.br/rg/RBR-48pb9h/)

slide-15
SLIDE 15

Overview of a clinical trial

(http://www.ensaiosclinicos.gov.br/rg/RBR-48pb9h/)

slide-16
SLIDE 16

Overview of a clinical trial

(http://www.ensaiosclinicos.gov.br/rg/RBR-48pb9h/)

slide-17
SLIDE 17

Corpus construction: Pipeline

  • Data download
  • OpenXML T

rials parsing

  • Sentence splitting
  • Sentence alignment
  • Quality checking

Similar to : Neves M, Jimeno-Yepes A and Névéol A. The Scielo Corpus: a Parallel Corpus of Scientific Publications for Biomedicine, International Conference on Language Resources and Evaluation (LREC), 2016, Portoroz, Slovenia.

slide-18
SLIDE 18

Data download

slide-19
SLIDE 19

Data download

(120 links as of January 4th)

slide-20
SLIDE 20

OpenXML parsing

slide-21
SLIDE 21

OpenXML parsing

slide-22
SLIDE 22

OpenXML parsing

slide-23
SLIDE 23

OpenXML parsing

slide-24
SLIDE 24

Open XML parsing

  • Fields that have been considered:

(a) trial identifjer (b) public title (c) scientifjc title (d) interventions to be carried out (e) inclusion criteria for taking part (f) exclusion criteria for not participating (g) primary outcome (h) secondary outcome Final documents based on the concatenation of the various fjelds following the order in the OpenXML Trials fjle

slide-25
SLIDE 25

Sentence splitting

  • „Sentence Detector“ models for EN and PT

(https://opennlp.apache.org/)

slide-26
SLIDE 26

Sentence alignment

  • Default parameters of GMA
  • List of stopwords

EN: http://www.textfjxer.com/tutorials/common-english-words.txt

PT: http://www.linguateca.pt/chave/stopwords/chave.MF300.txt and English

(http://nlp.cs.nyu.edu/GMA/)

slide-27
SLIDE 27

Quality checking

(https://github.com/cfedermann/Appraise)

(Sample of 50 trials)

slide-28
SLIDE 28

Results

  • Clinical trials corpus: 1188 documents

EN: 23,843 sentences, 625,881 tokens

PT: 23,666 sentences, 665,325 tokens

slide-29
SLIDE 29

Results

  • Manual validation of 50 trials

67% of the sentences are correctly aligned

28% of the sentences were not aligned

5% of the sentences had some overlap

  • In contrast, our results for the Scielo corpus had a >80% correct

alignment

slide-30
SLIDE 30

Discussion

  • Many of the wrong

alignments were due to shifted sentences

Mainly due to fjelds being placed in difgerent order due to multiple instances of the same type

slide-31
SLIDE 31

Discussion

  • Some few errors were due to wrong sentence splitting

Subject must be at least 18 years of age; males and females with a documented diagnosis of ulcerative colitis (UC) at least 4 months prior to entry into the study; subjects with moderately to severely active UC based on Mayo score criteria; subjects must have failed or be intolerant of at least one of the following treatments for UC: corticosteroids (oral ou intravenous), azathioprine or 6 mercaptopurine (6MP), anti TNF alpha therapy (infliximab ou adalimumab).

slide-32
SLIDE 32

Discussion

  • Some errors were due to splitting of (very) long sentences

Diseases which cause damage in the intestinal mucosa, diseases that significantly increase the gastrointestinal transit as infectious enteritis, celiac disease, inflammatory bowel disease (Chron), drug-induced enteritis or radiation, diverticular disease of the colon; History of surgery: heart (whatever), renal (exercises kidney or renal agenesis), intestinal (partial or total removal of the esophagus, stomach, duodenum, jejunum, ileum, ascending colon, transverse colon, descending colon, sigmoid colon or rectum) , liver or pancreas; Volunteers smoking more than five cigarettes a day; different eating habits of the population standard, eg vegetarianism, veganism; History

  • f alcohol consumption or use of drugs of abuse; Made use of

antibiotics as regular medication (continuous use) within the 4 weeks preceding the valuation date and / or the start of the breath test; This examination Colonoscopy one month before the breath test H2 expired.

slide-33
SLIDE 33

Conclusions

  • A novel comparable/parallel corpus of clinical trials for EN/PT

Reasonable size, easy to obtain and freely available

  • However, further processing is necessary to improve the quality of

the corpus.

  • Experiments still pending to evaluate its suitability for MT.
  • Available at:

https://github.com/biomedical-translation-corpora/wmt-task

slide-34
SLIDE 34

Thank you!

Looking forward to answering your questions! Mariana Neves

Current email: Mariana.Lara-Neves@bfr.bund.de Current affjliation: German Federal Institute for Risk Assessment