TRACER TUTORIAL: TEXT REUSE DETECTION INTRODUCTION TO HISTORICAL - - PowerPoint PPT Presentation

tracer tutorial text reuse detection introduction to
SMART_READER_LITE
LIVE PREVIEW

TRACER TUTORIAL: TEXT REUSE DETECTION INTRODUCTION TO HISTORICAL - - PowerPoint PPT Presentation

TRACER TUTORIAL: TEXT REUSE DETECTION INTRODUCTION TO HISTORICAL TEXT REUSE DETECTION M arco B uchler, Emily Franzini and Greta Franzini TABLE OF CONTENTS 1. Who am I? 2. What is text reuse? 3. Aspects of text reuse 4. ACID for the Digital


slide-1
SLIDE 1

TRACER TUTORIAL: TEXT REUSE DETECTION INTRODUCTION TO HISTORICAL TEXT REUSE DETECTION

Marco B¨ uchler, Emily Franzini and Greta Franzini

slide-2
SLIDE 2

TABLE OF CONTENTS

  • 1. Who am I?
  • 2. What is text reuse?
  • 3. Aspects of text reuse
  • 4. ACID for the Digital Humanities
  • 5. Big (Humanities) Data
  • 6. Language Model

2/34

slide-3
SLIDE 3

WHO AM I?

slide-4
SLIDE 4

WHO AM I?

  • 2001-2002: Head of Quality Assurance department in a software

company;

  • 2006: Diploma in Computer Science on big scale co-occurrence

analysis;

  • 2007: Consultant for several SMEs in IT sector;
  • 2008: Technical project management of the eAQUA project;
  • 2011: PI and project manager of the eTRACES project;
  • 2013: PhD in Digital Humanities on Text Reuse;
  • 2014: Head of Early Career Research Group eTRAP at the University
  • f G¨
  • ttingen.

4/34

slide-5
SLIDE 5

MY INTERESTS :)

5/34

slide-6
SLIDE 6

WHAT IS TEXT REUSE?

slide-7
SLIDE 7

WHAT DO YOU ASSOCIATE WITH TEXT REUSE AND INTERTEXTUALITY?

7/34

slide-8
SLIDE 8

ASPECTS OF TEXT REUSE

slide-9
SLIDE 9

EXPECTATIONS OF A COMPUTER SCIENTIST: OVERSIMPLIFICATION

9/34

slide-10
SLIDE 10

EXPECTATIONS OF A HUMANIST: OVERSIMPLIFICATION

10/34

slide-11
SLIDE 11

TEXT REUSE FOR HUMANITIES AND COMPUTER SCIENCE

Question: Why is text reuse so relevant for Humanities and Computer Science? Premise: The amount of digitally available data is growing exponentially (Big Data).

  • Humanities:
  • Lines of transmission and textual criticism.
  • Transmissions of ideas/thoughts under different circumstances and

conditions.

  • Computer Science:
  • Text decontamination for stylometry and authorship attribution, dating
  • f texts.
  • gen. Text Mining, Corpus Linguistics.

11/34

slide-12
SLIDE 12

TEMPERATURE MAP

12/34

slide-13
SLIDE 13

ACID FOR THE DIGITAL HUMANITIES

slide-14
SLIDE 14

ACID PARADIGM

ACID for the Digital Humanities:

  • Acceptance
  • Complexity
  • Interoperability
  • Diversity

14/34

slide-15
SLIDE 15

ACID FOR THE DIGITAL HUMANITIES: ACCEPTANCE I

15/34

slide-16
SLIDE 16

ACID FOR THE DIGITAL HUMANITIES: ACCEPTANCE II

How to be accepted by humanists if text mining is a black box we can’t look into?

16/34

slide-17
SLIDE 17

ACID FOR THE DIGITAL HUMANITIES: ACCEPTANCE III

Transparency: How to provide user-friendly insights into complex mining techniques and machine learning?

17/34

slide-18
SLIDE 18

BIG (HUMANITIES) DATA

slide-19
SLIDE 19

WHAT IS BIG DATA?

Ulrike Rieß (Big Data bestimmt die IT-Welt):

  • Large amounts of data that can’t be processed and analysed

manually;

  • Less structured data, e.g. in comparison to databases and data

warehouse systems;

  • Linked data between heterogeneous and distributed resources.

Information overload = large amounts of data (Big Data). Information poverty = noisy, missing, fragmentary, oral data (Humanities Data). COMPLEXITY

19/34

slide-20
SLIDE 20

CURRENT APPROACH: TRACER

20/34

slide-21
SLIDE 21

ACID FOR THE DIGITAL HUMANITIES: ACCEPTANCE IV

21/34

slide-22
SLIDE 22

ACID FOR THE DIGITAL HUMANITIES: ACCEPTANCE V

22/34

slide-23
SLIDE 23

ACID FOR THE DIGITAL HUMANITIES: ACCEPTANCE VI

23/34

slide-24
SLIDE 24

ACID FOR THE DIGITAL HUMANITIES: ACCEPTANCE VII

24/34

slide-25
SLIDE 25

ACID FOR THE DIGITAL HUMANITIES: COMPLEXITY

25/34

slide-26
SLIDE 26

ACID FOR THE DIGITAL HUMANITIES: INTEROPERABILITY

26/34

slide-27
SLIDE 27

ACID FOR THE DIGITAL HUMANITIES: DIVERSITY (REUSE TYPES)

  • Stability (yellow)
  • Purpose (green)
  • Size of text reuse (blue)
  • Classification (light blue)
  • Degree of distribution (purple)
  • Written and oral transmission

27/34

slide-28
SLIDE 28

ACID FOR THE DIGITAL HUMANITIES: DIVERSITY (REUSE STYLES)

28/34

slide-29
SLIDE 29

LANGUAGE MODEL

slide-30
SLIDE 30

KEY PROBLEM

Question: The distribution of Reuse Types and Reuse Styles is often unknown - which model(s) should be chosen?

30/34

slide-31
SLIDE 31

OUTLINE

31/34

slide-32
SLIDE 32

FINITO!

32/34

slide-33
SLIDE 33

CONTACT

Team Marco B¨ uchler, Greta Franzini and Emily Franzini. Visit us http://www.etrap.eu contact@etrap.eu

33/34

slide-34
SLIDE 34

LICENCE

The theme this presentation is based on is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Changes to the theme are the work of eTRAP.

cba

34/34