QATIP An Optical Character Recognition System for Arabic Heritage - - PowerPoint PPT Presentation

▶

Dec 27, 2022 125 likes •351 views

QATIP An Optical Character Recognition System for Arabic Heritage Collections in Libraries QCRI, Arabic Language Technologies Felix Stahlberg Stephan Vogel DAS 2016, April 14, 2016 OCR at the Qatar National Library Difficulty Modern Prints

SLIDE 1

QATIP

An Optical Character Recognition System for Arabic Heritage Collections in Libraries

QCRI, Arabic Language Technologies Felix Stahlberg Stephan Vogel

DAS 2016, April 14, 2016

SLIDE 2

OCR at the Qatar National Library

Difficulty Modern Prints

(known font)

Modern Prints

(unknown font)

Historical Documents

(early prints)

Historical Documents

(handwritten)

Basic image

processing

Standard OCR

engines

Minor error

correction

Train e.g. Sakhr

to the new font

Several man-

months

Very low error rates

Left untouched after scanning

The QATIP system (this work)

SLIDE 3

Challenges in Historical Documents

Many ligatures Document aging effects Non-uniform background Overlapping characters/words Curved text lines

SLIDE 4

Challenges in Historical Documents (cont’d)

Many ligatures Ink erosions Gaps between connecting characters

SLIDE 5

Challenges in Historical Documents (cont’d)

SLIDE 6

Core Technologies

Kaldi Speech Recognition Toolkit
Open source framework widely

used in the speech recognition community

Segmentation-free (HMM+LSTM)
PrepOCRessor
Image processing tool for (Arabic)

OCR

Feature extraction for Kaldi
Developed at QCRI

http://kaldi.sourceforge.net/ http://alt.qcri.org/tools/prepocressor/

SLIDE 7

Specializing ASR Technology to Arabic OCR

Position dependency
Glyph dependent model lengths
Dedicated sil and conn states between characters
Extended question set for decision tree generation

(Dreuw et. al., 2009) (Likforman-Sulem et. al., 2012) (Ahmad et. al., 2014)

SLIDE 8

Arabic OCR at QCRI

Ligature modeling using “pronunciation variants”

Problem 1: Dictionary size

Allowing all writing variants for each word blows up

pronunciation dictionary

→ Restrict to only one ligature per word

Problem 2: Data sparseness

Not enough training examples for all possible

ligatures

→ Model ligatures without dots

Word Pronunciation

زيجولا

aaA laB waE jaB yaM zyE aaA laB waE hjLM zyE … …

Pronunciation dictionary

SLIDE 9

Arabic OCR at QCRI

Morphological language model (ATB)
Problem: Connecting characters across morpheme boundaries

Standard ATB tokenization scheme:

هذهبب +هذه

Word Pronunciation

ب + baB هذه heM dhE heA OR: heB dhE heA Pronunciation dictionary Shape of ه(he) depends on previous morpheme

SLIDE 10

Arabic OCR at QCRI

Extended ATB tokenization scheme with “=“ marker

Word Pronunciation

ب + baB هذه heB dhE heA هذه heM dhE heA Pronunciation dictionary =

هذهب=هذهب+

SLIDE 11

Arabic OCR at QCRI

Text image normalization

(Stahlberg and Vogel, 2015)

SLIDE 12

The QATIP System

URL to the document (PDF, ZIP, …) Historic or general content Three different output formats:

txt: Plain text files, one per page
xml: XML file with OCR plus page

layout information

image: OCR results rendered into
riginal images

Automatic translation into English Accuracy/Runtime tradeoff

SLIDE 13

Compatibility with Aletheia

http://www.primaresearch.org/tools/Aletheia

SLIDE 14

Job Monitoring

SLIDE 15

Job Monitoring

SLIDE 16

The QATIP Architecture

SLIDE 17

QATIP Training Data

ALTEC corpus
Modern printed books
IFN/ENIT database
Handwritten Tunisian town names (modern)
KHATT corpus
Handwritten forms (modern)
HADARA corpus
Historic handwritten Arabic

(Pantke et. al., 2014) (Mahmoud et. al., 2012) (Pechwitz et. al., 2002) (http://www.altec-center.org/) Corpus #Lines #Word Tokens ALTEC 2,110 23,239 IFN/ENIT 42,736 42,736 KHATT 13,363 185,321 HADARA 1,319 16,587 Sum 59,528 267,883

SLIDE 18

Results (QNL)

Word Error Rate Character Error Rate Tesseract 99.6% 51.8% ABBYY 99.2% 54.8% Sakhr 126.8% 65.0% QATIP 37.5% 12.6% Early Print Word Error Rate Character Error Rate Tesseract 99.4% 78.9% ABBYY 100.0% 85.2% Sakhr 99.4% 65.8% QATIP 84.6% 53.3% Manuscript

Tesseract 3.03 ABBYY FineReader 12 Professional Sakhr Automatic Reader Platinum Edition 11

SLIDE 19

Results (Modern Print – ALTEC)

Word Error Rate Character Error Rate Tesseract 29.2% 8.2% ABBYY 39.6% 10.9% Sakhr 27.1% 8.1% QATIP 40.8% 10.3% Book 1 (last 5 pages) Word Error Rate Character Error Rate Tesseract 38.3% 12.3% ABBYY 66.1% 24.0% Sakhr 57.1% 19.2% QATIP 40.5% 9.7% Book 8 (last 5 pages)

Tesseract 3.03 ABBYY FineReader 12 Professional Sakhr Automatic Reader Platinum Edition 11

SLIDE 20

Current Runtime of QATIP

2.1 GHz, 8 core, 10 GB RAM

~8.7 images per hour 30 images per hour 14.5 images per hour

SLIDE 21

How Fast is Fast Enough?

Scanning: 2,500 pages in 8 hours (= 1 working day)
12,500 pages in 5 working days (= week)
OCR systems needs to process

12500 7⋅24 ≈ 𝟖𝟓. 𝟓 pages per hour

to keep up with a single operator

*2.1 GHz, 8 core, 10 GB RAM Document Complexity Required #Machines* per Operator Simple 2.5 Complex 5.1 Current Runtime of the QATIP system

SLIDE 22

QATIP

An Optical Character Recognition System for Arabic Heritage Collections in Libraries

QCRI, Arabic Language Technologies Felix Stahlberg Stephan Vogel

OCR at the Qatar National Library

Difficulty Modern Prints

Challenges in Historical Documents

Challenges in Historical Documents (cont’d)

Challenges in Historical Documents (cont’d)

Core Technologies

Specializing ASR Technology to Arabic OCR

Arabic OCR at QCRI

Problem 2: Data sparseness

Pronunciation dictionary

Arabic OCR at QCRI

هذهبب +هذه

Arabic OCR at QCRI

هذهب=هذهب+

Arabic OCR at QCRI

The QATIP System

Compatibility with Aletheia

Job Monitoring

Job Monitoring

The QATIP Architecture

QATIP Training Data

Results (QNL)

Results (Modern Print – ALTEC)

Current Runtime of QATIP

How Fast is Fast Enough?

to keep up with a single operator

Thank You