Using LOD to crowdsource Dutch WW2 underground newspapers on - - PowerPoint PPT Presentation

using lod to crowdsource dutch ww2 underground newspapers
SMART_READER_LITE
LIVE PREVIEW

Using LOD to crowdsource Dutch WW2 underground newspapers on - - PowerPoint PPT Presentation

Using LOD to crowdsource Dutch WW2 underground newspapers on Wikipedia Olaf Janssen, National Library of the Netherlands & Wikipedia Gerard Kuys , DBpedia & Wikimedia Nederland olaf.janssen@kb.nl - @ookgezellig -


slide-1
SLIDE 1

Using LOD to crowdsource Dutch WW2 underground newspapers on Wikipedia

Olaf Janssen, National Library of the Netherlands & Wikipedia Gerard Kuys, DBpedia & Wikimedia Nederland

  • laf.janssen@kb.nl - @ookgezellig - slideshare.net/OlafJanssenNL

SWIB 2016, Bonn, 29-11-2016

slide-2
SLIDE 2

http://www.4en5meiamsterdam.nl/attachment/47454

slide-3
SLIDE 3

During WW2 the Dutch resistance issued many underground newspapers. In every shape & form…

http://www.4en5meiamsterdam.nl/attachment/47454

slide-4
SLIDE 4

http://resolver.kb.nl/resolve?urn=ddd:010436323 http://resolver.kb.nl/resolve?urn=ddd:010442948 http://resolver.kb.nl/resolve?urn=ddd:010447825 http://resolver.kb.nl/resolve?urn=ddd:010450508

From well-organized, ‘professional’ big titles…

(o.a. Parool, Vrij Nederland, Trouw, de Waarheid)

slide-5
SLIDE 5

…to very small, amateur, home-made, pamphlet-like issues

slide-6
SLIDE 6

After the war 1.300 newspaper titles were (physically) preserved at the NIOD …

https://commons.wikimedia.org/wiki/File:Verzetskrant_in_archiefdozen_bij_het_NIOD.jpg – CC-BY-SA - OlafJanssen

The national Institute for War, Holocaust and Genocide Studies in Amsterdam

slide-7
SLIDE 7

http://opac-gonext.oclc.org:8180/DB=8/XMLPRS=Y/PPN?PPN=107123223

.. and were described in formal library catalogues

(1.300 titles)

Bibliographic metadata

Underground students’ newspaper from The Hague

slide-8
SLIDE 8

In 2010 these WW2 newspapers were digitized…..

slide-9
SLIDE 9

www.delpher.nl/kranten

…into full-texts in Delpher …

(1.300 titles) The Dutch national aggregator for historic full-texts

  • Newspapers
  • Books
  • Magzines
slide-10
SLIDE 10

In Delpher you can read and search these newspapers…

  • Scans
  • Full-text OCR
  • ALTO
slide-11
SLIDE 11

But say, I want to know more about this newspaper

  • What sort of illegal newspaper was it?
  • What is the history of this newspaper?
  • Who wrote it?
  • Where was this newspaper printed?
  • How was it distributed?
  • Were there any relations with other underground newspapers?
  • Etc…
slide-12
SLIDE 12

But say, I want to know more about this newspaper

  • What sort of illegal newspaper was it?
  • What is the history of this newspaper?
  • Who wrote it?
  • Where was this newspaper printed?
  • How was it distributed?
  • Were there any relations with other underground newspapers or

resistance groups?

  • Etc…
slide-13
SLIDE 13

But say, I want to know more about this newspaper

  • What sort of illegal newspaper was it?
  • What is the history of this newspaper?
  • Who wrote it?
  • Where was this newspaper printed?
  • How was it distributed?
  • Were there any relations with other underground newspapers?
  • Etc…

You can’t answer these questions from Delpher

slide-14
SLIDE 14

Big drawback of Delpher:

No contextual information

about WW2 underground newspapers

https://thejungleisneutral.files.wordpress.com/2013/11/lost.jpg

slide-15
SLIDE 15

http://nl.wikipedia.org/wiki/De_Geus_onder_studenten_(verzetsblad)

Where would many people go to find contextual information about historic newspapers? Probably Wikipedia (via Google)

slide-16
SLIDE 16

http://2.bp.blogspot.com/_BWzuYwiS6-I/TMgeRsFd3mI/AAAAAAAAElw/3cvgbZSPWcs/s1600/doctor+macro+judy+scared.jpg

slide-17
SLIDE 17

http://2.bp.blogspot.com/_BWzuYwiS6-I/TMgeRsFd3mI/AAAAAAAAElw/3cvgbZSPWcs/s1600/doctor+macro+judy+scared.jpg

slide-18
SLIDE 18

http://2.bp.blogspot.com/_BWzuYwiS6-I/TMgeRsFd3mI/AAAAAAAAElw/3cvgbZSPWcs/s1600/doctor+macro+judy+scared.jpg

Information on underground newspapers is distributed across multiple, unconnected sources

  • 1. Descriptions (metadata in library catalogue, 1.300 titles)
  • 2. Content (full-text in Delpher, 1.300 titles)
  • 3. Context (in Wikipedia…. at least... )
slide-19
SLIDE 19
slide-20
SLIDE 20

This Wikipedia article is a carefully chosen exception

slide-21
SLIDE 21
  • 1. There are very few illegal

newspapers with their own WP articles

  • 2. The inventory of these newspapers
  • n WP is far from complete

<<< 1.300 titles

slide-22
SLIDE 22

We can tackle both problems!

slide-23
SLIDE 23

Wikiproject

Systematically and uniformly describe & interlink all 1.300 Dutch underground newspapers from WW2

  • n Wikipedia

tinyurl.com/verzetskranten

slide-24
SLIDE 24

Wikiproject

Systematically and uniformly describe & interlink all 1.300 Dutch underground newspapers from WW2

  • n Wikipedia

tinyurl.com/verzetskranten

2) Automatically make data available for other open purposes Wikidata -- DBpedia -- Dataviz 1) Reach big audiences

slide-25
SLIDE 25

https://thejungleisneutral.files.wordpress.com/2013/11/lost.jpg

We badly need contextual information about the

  • newspapers. Where do we get it?

De Ondergrondse Pers 1940-1945

Lydia E. Winkel, H. de Vries , 1989, ISBN 9021837463, Veen Uitgevers

This paper book contains entries about all 1.300 illegal newspapers

slide-26
SLIDE 26

Entry 199 – De Geus; (onder studenten) Unique ID (within the book)

slide-27
SLIDE 27

Place of publication

Newspaper Place name

Entry 199 – De Geus; (onder studenten)

slide-28
SLIDE 28

Entry 199 – De Geus; (onder studenten)

Context

Raw material for Wikipedia article!

slide-29
SLIDE 29

Entry 199 – De Geus; (onder studenten) Person names

Newspaper Persons

slide-30
SLIDE 30

Entry 199 – De Geus; (onder studenten) IDs of related students’ newspapers

This newspaper Other newspapers

slide-31
SLIDE 31

We OCRed this book into PDF

(CC-BY-SA)

http://www.niod.nl/nl/de-ondergrondse-pers-1940-1945 (PDF)

slide-32
SLIDE 32

We OCRed this book into PDF

(CC-BY-SA)

http://www.niod.nl/nl/de-ondergrondse-pers-1940-1945 (PDF)

Available online (PDF, flat file)

Open license (CC-BY-SA) Convert PDF into structured database. Link: titles  places, persons, other titles Link: titles  library catalogue (metadata) and Delpher (full-text) Link: titles, persons and places  external sources

slide-33
SLIDE 33

Convert PDF into structured database.

Link: titles  places, persons, other titles Link: titles  library catalogue (metadata) and Delpher (full-text) Link: titles, persons and places  external sources

My co-author Gerard Kuys

slide-34
SLIDE 34

Convert PDF into structured database.

Link: titles  places, persons, other titles Link: titles  library catalogue (metadata) and Delpher (full-text) Link: titles, persons and places  external sources

VIAF

slide-35
SLIDE 35
slide-36
SLIDE 36

Technical appendix

from slide 48 onwards

slide-37
SLIDE 37

We OCRed this book into PDF

(CC-BY-SA)

http://www.niod.nl/nl/de-ondergrondse-pers-1940-1945 (PDF)

Available online (PDF, flat file)

Open license (CC-BY-SA) Convert PDF into structured database. Link: titles  places, persons, other titles Link: titles  library catalogue (metadata) and Delpher (full-text) Link: titles, persons and places external sources

slide-38
SLIDE 38

Summer 2016 This LOD triple store (Virtuoso) is unique in the Netherlands. First time data about underground newspapers is systematically collected and linked online!

https://www.pinterest.com/freethewronged/world-war-ii/

2) For other open reuse purposes Wikidata -- DBpedia -- Dataviz 1) For Wikipedia

slide-39
SLIDE 39

Wikiproject

Systematically and uniformly describe & interlink all 1.300 Dutch underground newspapers from WW2

  • n Wikipedia
slide-40
SLIDE 40

We have: LOD-database Using an article template we generated 1.300 uniform and interlinked Wikipedia stubs

https://c1.staticflickr.com/9/8281/7699231918_11a7356c38_b.jpg

slide-41
SLIDE 41

https://nl.wikipedia.org/wiki/De_Geus_onder_studenten_(verzetsblad)

Non-grey = Wikipedia article stub Automatically generated from database using a template

slide-42
SLIDE 42

This bit was added manually to expand stub into full article  Crowdsourcing by Dutch Wikipedia community

https://nl.wikipedia.org/wiki/De_Geus_onder_studenten_(verzetsblad)

slide-43
SLIDE 43

A group of Wikipedia volunteers is currently working to expand the 1.300 stubs… gradually creating more and more full articles.

Door Sebastiaan ter Burg [CC BY 2.0 (http://creativecommons.org/licenses/by/2.0)], via Wikimedia Commons

slide-44
SLIDE 44

Before the project

slide-45
SLIDE 45

The number of articles is growing steadily…

slide-46
SLIDE 46

… making many Dutch people happy!

http://www.formerdays.com/2011/05/dutch-liberation.html

slide-47
SLIDE 47

Thanks!

  • laf.janssen@kb.nl - @ookgezellig

tinyurl.com/verzetskranten

slide-48
SLIDE 48

Slides by Gerard Kuys

Technical appendix

http://www.ilord.com/vintage.html - http://www.ilord.com/images/enigma-8-rotors-1000px.jpg

slide-49
SLIDE 49

Transforming Descriptive Data into Linked Open Data - Locations

slide-50
SLIDE 50

Transforming Descriptive Data into Linked Open Data - Persons

slide-51
SLIDE 51

Transforming Descriptive Data into Linked Open Data - interlinking

slide-52
SLIDE 52
  • Interlinked descriptions in Lydia Winkel’s annotations (‘see also’) can

be put to use in order to construct an affiliation chain for underground publications

  • Right now, the model of people involved with one or more

underground publications is very flat indeed: either someone is involved or not mentioned in this context at all. The consequences are devastating:

– No distinction between people writing and people distributing, or doing both – Hardly a clue as to the people who did the illegal multiplying of copies, and how they

  • rganised their logistics (labour, machines, paper, ink, stencil sheets or lead slugs, etc.)

– And, worst of all: no way to distinguish resistance people from snitches and agents provocateurs

  • We need an event model in order to connect people to the things

that happened to an underground publication, and be at least a bit precise about their role in a particular event

  • More often than not, new editions sprang up as a result of

collaborators holding gradually differing opinions; we would like to create an overview of evolving points of view by way of some kind of representation of categorizations of political beliefs

Things yet to come

slide-53
SLIDE 53
  • Forget about a fully automated process: it is 80 / 20 all the time
  • But what we can do in an automated way, is Named Entity Recognition
  • In order to do Named Entity Recognition, we need reference lists of people or things

(‘gazetteers’) that strings within descriptive text fragments can be matched against

  • We dispose of two excellent reference lists:

– The Index of Places (already in the 1954 edition of Lydia Winkel’s book) – The Index of Persons (added to the 1989 edition of the same work) – With only slight manual corrections (e.g., ‘Ferwerderadeel’ where Winkel has ‘Ferweradeel’) – Linking to the site gemeentegeschiedenis.nl, providing data on Dutch municipality boundaries, which kept on changing during World War II

  • And, of course, there is DBpedia:

– Currently identifying 402 Dutch resistance people, apart from people who became better known as a writer, politician, sportsman, etc. – Identifying and linking to all of the locations mentioned in Lydia Winkel’s text – Inviting everyone to improve the list by adding entries or list items to Wikipedia

  • Once digitized, Lydia Winkel’s texts become very much malleable and searchable, so we could

easily locate all candidate references to other underground periodicals for interlinking

– Find ‘(Zie nr. 270)’, ‘(Zie nr. 270, xxxx )’, ‘(Zie nrs. xxxx, nr. 270)’, ‘(Zie nrs. xxxx, 270, yyyy)’

How did we do the linking?

slide-54
SLIDE 54

How did we do the linking?

slide-55
SLIDE 55

How did we do the linking?

slide-56
SLIDE 56

Named Entity Recognition using SILK Workbench

slide-57
SLIDE 57

Generating References

  • The general idea is, that a Reference is a resource in its own right

– It is not the resource pointed to – It has properties of its own, like source, page number, connected resource – Could also be the place where an event is linked to the object that is referenced, because we have a context here

  • A single Reference resource for each occasion the subject is mentioned in

a tekst

– In this way, we can point to the exact place of a reference within a larger tekst fragment

  • A Reference is not a Link

– A Reference is a real-world thing itself, it is a place in a tekst saying something about something else –

  • wl:sameAs links should be bound to the real-world object or, better still, be stored in a

LinkSet

slide-58
SLIDE 58

Matching text fragments against Linked Data resources

Approaches:

  • Brute force with SPARQL: a query with the ‘Contains’ keyword
  • Using the existing data with SPARQL: a query connecting Persons from the Persons’ Index

to References generated from the text

  • Matching against DBpedia: DBpedia Spotlight
  • Fine-grained comparison: GATE scripting
slide-59
SLIDE 59

Generating References

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX bf: <http://bibframe.org/vocab/> PREFIX ns0: <http://almere.pilod.nl/LydiaWinkel/> PREFIX dct: <http://purl.org/dc/terms/> PREFIX dbo: <http://dbpedia.org/ontology/> CONSTRUCT { ?URI a dbo:Reference ; dct:references ?ts ; dct:source ?comm ; dbo:connectsReferencedTo ?subject } FROM <http://almere.pilod.nl/LydiaWinkel/> WHERE { ?ts a ns0:UndergroundPublication BIND (IRI(CONCAT(STR(?ts), "-Ref1")) AS ?URI ). ?ts ns0:winkelSummary ?comm . ?comm bf:annotationBody ?ann . ?ref dct:references ?subject . ?subject rdfs:label ?ond FILTER (contains(?ann, ?ond)) }

slide-60
SLIDE 60

The Data Model: Library of Congress’ BibFrame

slide-61
SLIDE 61

The Data Model: Interlinking Underground Publications

slide-62
SLIDE 62

The Data Model: Interlinking Underground Publications