QATIP An Optical Character Recognition System for Arabic Heritage - - PowerPoint PPT Presentation

qatip
SMART_READER_LITE
LIVE PREVIEW

QATIP An Optical Character Recognition System for Arabic Heritage - - PowerPoint PPT Presentation

QATIP An Optical Character Recognition System for Arabic Heritage Collections in Libraries QCRI, Arabic Language Technologies Felix Stahlberg Stephan Vogel DAS 2016, April 14, 2016 OCR at the Qatar National Library Difficulty Modern Prints


slide-1
SLIDE 1

QATIP

An Optical Character Recognition System for Arabic Heritage Collections in Libraries

QCRI, Arabic Language Technologies Felix Stahlberg Stephan Vogel

DAS 2016, April 14, 2016

slide-2
SLIDE 2

2

OCR at the Qatar National Library

Difficulty Modern Prints

(known font)

Modern Prints

(unknown font)

Historical Documents

(early prints)

Historical Documents

(handwritten)

  • Basic image

processing

  • Standard OCR

engines

  • Minor error

correction

  • Train e.g. Sakhr

to the new font

  • Several man-

months

Very low error rates

  • Left untouched after scanning

The QATIP system (this work)

slide-3
SLIDE 3

3

Challenges in Historical Documents

Many ligatures Document aging effects Non-uniform background Overlapping characters/words Curved text lines

slide-4
SLIDE 4

4

Challenges in Historical Documents (cont’d)

Many ligatures Ink erosions Gaps between connecting characters

slide-5
SLIDE 5

5

Challenges in Historical Documents (cont’d)

slide-6
SLIDE 6

6

Core Technologies

  • Kaldi Speech Recognition Toolkit
  • Open source framework widely

used in the speech recognition community

  • Segmentation-free (HMM+LSTM)
  • PrepOCRessor
  • Image processing tool for (Arabic)

OCR

  • Feature extraction for Kaldi
  • Developed at QCRI

http://kaldi.sourceforge.net/ http://alt.qcri.org/tools/prepocressor/

slide-7
SLIDE 7

7

Specializing ASR Technology to Arabic OCR

  • Position dependency
  • Glyph dependent model lengths
  • Dedicated sil and conn states between characters
  • Extended question set for decision tree generation

(Dreuw et. al., 2009) (Likforman-Sulem et. al., 2012) (Ahmad et. al., 2014)

slide-8
SLIDE 8

8

Arabic OCR at QCRI

  • Ligature modeling using “pronunciation variants”

Problem 1: Dictionary size

  • Allowing all writing variants for each word blows up

pronunciation dictionary

  • → Restrict to only one ligature per word

Problem 2: Data sparseness

  • Not enough training examples for all possible

ligatures

  • → Model ligatures without dots

Word Pronunciation

زيجولا

aaA laB waE jaB yaM zyE aaA laB waE hjLM zyE … …

Pronunciation dictionary

slide-9
SLIDE 9

9

Arabic OCR at QCRI

  • Morphological language model (ATB)
  • Problem: Connecting characters across morpheme boundaries

Standard ATB tokenization scheme:

هذهبب +هذه

Word Pronunciation

ب + baB هذه heM dhE heA OR: heB dhE heA Pronunciation dictionary Shape of ه(he) depends on previous morpheme

slide-10
SLIDE 10

10

Arabic OCR at QCRI

  • Extended ATB tokenization scheme with “=“ marker

Word Pronunciation

ب + baB هذه heB dhE heA هذه heM dhE heA Pronunciation dictionary =

هذهب=هذهب+

slide-11
SLIDE 11

11

Arabic OCR at QCRI

  • Text image normalization

(Stahlberg and Vogel, 2015)

slide-12
SLIDE 12

12

The QATIP System

URL to the document (PDF, ZIP, …) Historic or general content Three different output formats:

  • txt: Plain text files, one per page
  • xml: XML file with OCR plus page

layout information

  • image: OCR results rendered into
  • riginal images

Automatic translation into English Accuracy/Runtime tradeoff

slide-13
SLIDE 13

13

Compatibility with Aletheia

http://www.primaresearch.org/tools/Aletheia

slide-14
SLIDE 14

14

Job Monitoring

slide-15
SLIDE 15

15

Job Monitoring

slide-16
SLIDE 16

16

The QATIP Architecture

slide-17
SLIDE 17

17

QATIP Training Data

  • ALTEC corpus
  • Modern printed books
  • IFN/ENIT database
  • Handwritten Tunisian town names (modern)
  • KHATT corpus
  • Handwritten forms (modern)
  • HADARA corpus
  • Historic handwritten Arabic

(Pantke et. al., 2014) (Mahmoud et. al., 2012) (Pechwitz et. al., 2002) (http://www.altec-center.org/) Corpus #Lines #Word Tokens ALTEC 2,110 23,239 IFN/ENIT 42,736 42,736 KHATT 13,363 185,321 HADARA 1,319 16,587 Sum 59,528 267,883

slide-18
SLIDE 18

18

Results (QNL)

Word Error Rate Character Error Rate Tesseract 99.6% 51.8% ABBYY 99.2% 54.8% Sakhr 126.8% 65.0% QATIP 37.5% 12.6% Early Print Word Error Rate Character Error Rate Tesseract 99.4% 78.9% ABBYY 100.0% 85.2% Sakhr 99.4% 65.8% QATIP 84.6% 53.3% Manuscript

Tesseract 3.03 ABBYY FineReader 12 Professional Sakhr Automatic Reader Platinum Edition 11

slide-19
SLIDE 19

19

Results (Modern Print – ALTEC)

Word Error Rate Character Error Rate Tesseract 29.2% 8.2% ABBYY 39.6% 10.9% Sakhr 27.1% 8.1% QATIP 40.8% 10.3% Book 1 (last 5 pages) Word Error Rate Character Error Rate Tesseract 38.3% 12.3% ABBYY 66.1% 24.0% Sakhr 57.1% 19.2% QATIP 40.5% 9.7% Book 8 (last 5 pages)

Tesseract 3.03 ABBYY FineReader 12 Professional Sakhr Automatic Reader Platinum Edition 11

slide-20
SLIDE 20

20

Current Runtime of QATIP

  • 2.1 GHz, 8 core, 10 GB RAM

~8.7 images per hour 30 images per hour 14.5 images per hour

slide-21
SLIDE 21

21

How Fast is Fast Enough?

  • Scanning: 2,500 pages in 8 hours (= 1 working day)
  • 12,500 pages in 5 working days (= week)
  • OCR systems needs to process

12500 7⋅24 ≈ 𝟖𝟓. 𝟓 pages per hour

to keep up with a single operator

*2.1 GHz, 8 core, 10 GB RAM Document Complexity Required #Machines* per Operator Simple 2.5 Complex 5.1 Current Runtime of the QATIP system

slide-22
SLIDE 22

Thank You