QATIP
An Optical Character Recognition System for Arabic Heritage Collections in Libraries
QCRI, Arabic Language Technologies Felix Stahlberg Stephan Vogel
DAS 2016, April 14, 2016
QATIP An Optical Character Recognition System for Arabic Heritage - - PowerPoint PPT Presentation
QATIP An Optical Character Recognition System for Arabic Heritage Collections in Libraries QCRI, Arabic Language Technologies Felix Stahlberg Stephan Vogel DAS 2016, April 14, 2016 OCR at the Qatar National Library Difficulty Modern Prints
DAS 2016, April 14, 2016
2
(known font)
Modern Prints
(unknown font)
Historical Documents
(early prints)
Historical Documents
(handwritten)
processing
engines
correction
to the new font
months
Very low error rates
The QATIP system (this work)
3
Many ligatures Document aging effects Non-uniform background Overlapping characters/words Curved text lines
4
Many ligatures Ink erosions Gaps between connecting characters
5
6
used in the speech recognition community
OCR
http://kaldi.sourceforge.net/ http://alt.qcri.org/tools/prepocressor/
7
(Dreuw et. al., 2009) (Likforman-Sulem et. al., 2012) (Ahmad et. al., 2014)
8
Problem 1: Dictionary size
pronunciation dictionary
ligatures
Word Pronunciation
زيجولا
aaA laB waE jaB yaM zyE aaA laB waE hjLM zyE … …
9
Standard ATB tokenization scheme:
Word Pronunciation
ب + baB هذه heM dhE heA OR: heB dhE heA Pronunciation dictionary Shape of ه(he) depends on previous morpheme
10
Word Pronunciation
ب + baB هذه heB dhE heA هذه heM dhE heA Pronunciation dictionary =
11
(Stahlberg and Vogel, 2015)
12
URL to the document (PDF, ZIP, …) Historic or general content Three different output formats:
layout information
Automatic translation into English Accuracy/Runtime tradeoff
13
http://www.primaresearch.org/tools/Aletheia
14
15
16
17
(Pantke et. al., 2014) (Mahmoud et. al., 2012) (Pechwitz et. al., 2002) (http://www.altec-center.org/) Corpus #Lines #Word Tokens ALTEC 2,110 23,239 IFN/ENIT 42,736 42,736 KHATT 13,363 185,321 HADARA 1,319 16,587 Sum 59,528 267,883
18
Word Error Rate Character Error Rate Tesseract 99.6% 51.8% ABBYY 99.2% 54.8% Sakhr 126.8% 65.0% QATIP 37.5% 12.6% Early Print Word Error Rate Character Error Rate Tesseract 99.4% 78.9% ABBYY 100.0% 85.2% Sakhr 99.4% 65.8% QATIP 84.6% 53.3% Manuscript
Tesseract 3.03 ABBYY FineReader 12 Professional Sakhr Automatic Reader Platinum Edition 11
19
Word Error Rate Character Error Rate Tesseract 29.2% 8.2% ABBYY 39.6% 10.9% Sakhr 27.1% 8.1% QATIP 40.8% 10.3% Book 1 (last 5 pages) Word Error Rate Character Error Rate Tesseract 38.3% 12.3% ABBYY 66.1% 24.0% Sakhr 57.1% 19.2% QATIP 40.5% 9.7% Book 8 (last 5 pages)
Tesseract 3.03 ABBYY FineReader 12 Professional Sakhr Automatic Reader Platinum Edition 11
20
~8.7 images per hour 30 images per hour 14.5 images per hour
21
12500 7⋅24 ≈ 𝟖𝟓. 𝟓 pages per hour
*2.1 GHz, 8 core, 10 GB RAM Document Complexity Required #Machines* per Operator Simple 2.5 Complex 5.1 Current Runtime of the QATIP system