Review, Access, and Triage of Mail (RATOM) Jamie Patrick-Burns - - PowerPoint PPT Presentation

review access and triage of
SMART_READER_LITE
LIVE PREVIEW

Review, Access, and Triage of Mail (RATOM) Jamie Patrick-Burns - - PowerPoint PPT Presentation

Review, Access, and Triage of Mail (RATOM) Jamie Patrick-Burns Digital Archivist, State Archives of North Carolina Christopher (Cal) Lee University of North Carolina at Chapel Hill Best Practices Exchange Columbus, Ohio April 30, 2019 1


slide-1
SLIDE 1

Review, Access, and Triage of Mail (RATOM)

Jamie Patrick-Burns Digital Archivist, State Archives of North Carolina Christopher (Cal) Lee University of North Carolina at Chapel Hill Best Practices Exchange Columbus, Ohio April 30, 2019

1

slide-2
SLIDE 2

Motivation – Selection/Appraisal

  • Despite progress on various technologies to support data

management and digital preservation, relatively little progress on software support for the core activities of selection and appraisal

  • Selection/appraisal decisions are based on various patterns
  • When patterns can be identified algorithmically, software can assist

the process

  • LAMs frequently want to take actions that reflect contextual

relationships

  • Timeline representations and visualizations can also provide useful,

high-level views of materials

slide-3
SLIDE 3

Motivation - Email

  • 48 years of email creation
  • Hundreds of billions of messages generated

every day

  • Most has little long-term retention value,

but some absolutely does

  • Despite presence of numerous other

modalities, email still deeply embedded in activities, serving as massive source of evidence and information

  • Often found in collections and acquisitions

with other types of materials

http://hci.stanford.edu/~jheer/projects/enron/v1/

slide-4
SLIDE 4

Background – BitCurator (2011-2014)

  • BitCurator environment allows LAMs to:
  • acquire data from media
  • characterize and triage data
  • expose numerous data points that can inform selection and appraisal

decisions, including file types, file sizes, timestamps, original directory structures, potentially sensitive features

  • Output is generally static
  • Users have expressed interest in additional ways to iteratively make

judgements

slide-5
SLIDE 5

http://bitcurator.github.io/

slide-6
SLIDE 6

Background – BitCurator Access and BitCurator NLP (2014-2018)

  • Developed and repurposed software (topic modelling and named

entity extraction) that can facilitate appraisal/selection

slide-7
SLIDE 7

TOMES and the State Archives of NC

State highway system of NC, 1936, NC State Highway Commission (MC.150.1936na). NC Maps, https://dc.lib.unc.edu/cdm/ref/collection/ncmaps/id/760

7

slide-8
SLIDE 8

What was Transforming Online Mail with Embedded Semantics (TOMES)?

  • NHPRC-funded grant, 2015-2018
  • Appraisal, preservation, and processing challenges of email in state

government

  • Utah State Archives and Kansas Historical Society partners
  • Building on EMCAPP (EAXS XML)
  • More information:
  • https://www.ncdcr.gov/resources/records-management/tomes
  • https://github.com/StateArchivesOfNorthCarolina/tomes-project

8

slide-9
SLIDE 9

TOMES objectives

  • Identify email accounts of public officials with records of enduring

value (Capstone methodology)

  • Produce cross platform .pst to EAXS XML parser
  • Publish NLP dictionary designed to flag named entities unique to

government at the state and local level

  • Process set of test email accounts

9

slide-10
SLIDE 10

Results: Capstone

  • Methodology for

managing/accessing archival email

  • NARA Bulletin 2013-02
  • Email appraised at account level

10

Archival Non- permanent

slide-11
SLIDE 11

Results: Software

PST EML EAXS Tagged EAXS AIP

  • 1. TOMES PST Extractor: converts

PST to EML

  • 2. TOMES DarcMail: converts EML
  • r MBOX to EAXS
  • 3. TOMES Entities: converts

Microsoft Excel files to a valid entity dictionary file

  • 4. TOMES Tagger: converts EAXS to

a tagged EAXS file

  • 5. TOMES Packager: creates an AIP

structure consisting of source and derivative files as well as basic METS files

11

slide-12
SLIDE 12

Building on the BitCurator/TOMES foundation

  • We have XML output with lots of metadata and tags; now what?
  • Iterative processing
  • Archivist-assisted review and machine learning
  • Record/non-record
  • PII/redaction
  • Reaching beyond state governments
  • Integration with other datasets and tools (BitCurator)
  • Open source iterative access tool to facilitate processing and access to

historically significant email accounts

  • Review and approve tags
  • Redact sensitive information
  • Make reviewed emails viewable to researchers

12

slide-13
SLIDE 13

Review, Appraisal and Triage of Mail

  • Funded by Andrew W. Mellon Foundation (2019-2020)
  • Developing and repurposing software (including NLP and

machine learning) for selection/appraisal in BitCurator environment with hooks and enhancements to TOMES

  • utput
  • Support iterative processing - information discovered at

various points in the processing workflow can support further selection, redaction or description actions

  • Mapping of timestamp, entity, sensitive features and other

elements across the tools

Ray Tomlinson

https://upload.wikimedia.org/wikipedia/commons/0/01/Ray_Tomlinson_%28cropped%29.jpg

slide-14
SLIDE 14

RATOM Project Team at UNC

  • Christopher (Cal) Lee, Principal

Investigator

  • Kam Woods, Co-PI and Technical

Lead

  • Antoine de Torcy, Software

Developer

  • Anusha Suresh, Project Manager
slide-15
SLIDE 15

RATOM Project Team at State Archives of NC

  • Camille Tyndall Watson, Co-

Principal Investigator

  • Jamie Patrick-Burns,

Investigator

  • Nitin Arora, Software Developer
slide-16
SLIDE 16

RATOM Goals

  • 1. Explore the incorporation of

software into an iterative processing approach

  • 2. Create a module that would allow

email items approved for release to be reviewed/released

  • 3. Investigate machine learning

applications to support automated identification of records and materials that require redaction or closure

16

slide-17
SLIDE 17

Cal Lee University of North Carolina https://ils.unc.edu/callee/ Jamie Patrick-Burns Digital Archivist, State Archives of North Carolina Jamie.patrickburns@ncdcr.gov (919) 814-6905 State Archives Twitter: @NCArchives State Archives Facebook: https://www.facebook.com/State-Archives-of-North-Carolina-119904548024750/

17

http://ratom.web.unc.edu/

slide-18
SLIDE 18

Discussion Questions

  • What are your most pressing needs related to

email?

  • How are you addressing those needs now?
  • What would you like software to do?

18