[PPT] - Similarity and Correction of Strings and Trees : Towards a PowerPoint Presentation

SLIDE 1

Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents

Agata SAVARY Université-François Rabelais de Tours, Campus de Blois, Laboratoire d’Informatique

Seminarium IPIPAN, 24 kwietnia, 2006

SLIDE 2

2

String-to-string correction

SLIDE 3

3

T raditional string-to-string correction

(Wagner&Fischer 1974, Lawrence&Wagner 1975,…)

CONTEXT:

– Finite set of symbols (alphabet) – Elementary operations on symbols (editing operations, e.g. deletion, insertion, or replacement of a letter, inversion of two adjacent letters) with their costs (usually 1 per operation) – Sequences of editing operations (edit sequences; each operation applies to a word resulting from the previous operations) with their costs (sums of costs of editing operations involved) – Measure of similarity between words A and B (edit distance or error distance): minimum cost of all edit sequences transforming A to B

INPUT:

– Two words A and B

OUTPUT:

– Distance between A and B

SLIDE 4

4

Examples of elementary edit

perations
Insertion of a letter

monter  montaer, monter  montrer

Deletion of a letter

monter  montr, monter  monte

Replacement of a letter by another

monter  ponter, monter  conter

Transposition of two adjacent letters

monter  mnoter, monter  montre

Each elementary operation has a non negatif cost. From now on we admit cost 1 for each elementary operation.

SLIDE 5

5

Edit sequence

Edit sequence = sequence of elementary edit operations
For each couple of words X and Y many edit sequences exist that transform

X into Y.

Example 1: transforming sorting into string :

– sorting  srting  sting  string (3 operations) – sorting  sotring  string (2 operations) – sorting  srting  string (2 operations) – sorting  strting  string (2 operations) – sorting  srting  sting  sing  sring  string (5 operations) – .................

Example 2: transforming abc into ca :

– abc  ac ca (2 operations) – abc  cabc  cac  ca (3 operations)

From now on, we’ll be interested in linear edit sequences (Du&Chang

1992), i.e. such that the operations are performed from left to right, and no further operation may alter the result of a previous operation.

Linear sequence Linear sequence Linear sequence Linear sequence

SLIDE 6

6

Edit (error) distance

Cost of an edit sequence = sum of costs of all elementary
perations included in the sequence

– sortingsrtingstingstring (3 operations)  cost = 3 – sortingsotringstring (2 operations)  cost = 2 – sortingsrtingstingsingsringstring (5 operations)  cost = 5

Edit distance (error distance) between two words X and Y

(ed(X,Y)) = minimal cost of all edit sequences transforming X into Y :

ed(sorting, string) = 2 ed(abc,ca) = 2, if all edit sequences are taken into account ed(abc,ca) = 3, if only the linear edit sequences are taken into account

SLIDE 7

7

Calculating the edit distance (1/4)

If xi+1 = yj+1 then ed(X[i+1],Y[j+1]) = ed(X[i],Y[j])

X[i+1] Y[j+1] i j Notation : word X= x1 x2 ... xi ...xn; the prefix of lenght i of X : X[i] = x1 x2 ... xi X i X[i] It is possible to calculate the distance between two prefixes X[i+1] and Y[j+1] on the basis of the distances between shorter prefixes: 3 cases

x1 x2 x3 ... xi ... xn

SLIDE 8

8

Transposition’s cost

If xi = yj+1 and xi+1 = yj (the 2 last characters may be inverted) then 4 sub-cases are possible:

The cheapest sequence transforming X[i+1]

into Y[j+1] contains a transposition of xi and xi+1 : ed(X[i+1],Y[j+1]) = ed(X[i-1],Y[j-1]) + 1

X[i+1] Y[j+1] i j

The cheapest sequence transforming X[i+1] into Y[j+1] contains the replacement of xi+1

by yj+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) + 1

The cheapest sequence transforming X[i+1] into Y[j+1] contains the l’insertion of yj+1 :

ed(X[i+1],Y[j+1]) = ed(X[i+1],Y[j]) + 1

The cheapest sequence transforming X[i+1] into Y[j+1] contains the deletion of xi+1 :

ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]) + 1 Replacement’s cost Insertion’s cost Deletion’s cost

Calculating the edit distance (2/4)

SLIDE 9

9

OTHERWISE (if xi+1  yj+1, and (xi  yj+1 or xi+1  yj)) then 3 sub-cases are possible:

X[i+1] Y[j+1] i j

The cheapest sequence transforming X[i+1] into Y[j+1] contains the replacement of xi+1

by yj+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) + 1

The cheapest sequence transforming X[i+1] into Y[j+1] contains the insertion of yj+1 :

ed(X[i+1],Y[j+1]) = ed(X[i+1],Y[j]) + 1

The cheapest sequence transforming X[i+1] into Y[j+1] contains the deletion of xi+1 :

ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]) + 1 Replacement’s cost Insertion’s cost Deletion’s cost

Calculating the edit distance (3/4)

SLIDE 10

10 Edit distance between X[i] and Y[j] - recursive definition: For i=0,...,m, j=0,...,n: 1° ed(X[-1],Y[j]) = ed(X[i], Y[-1]) = max(m,n)

2° ed(X[0],Y[j]) = j ed(X[i],Y[0]) = i ed(X[i],Y[j]) if xi+1 = yj+1 1 + min{ ed(X[i],Y[j])), ed(X[i+1],Y[j]), if xi=yj+1 et xi+1 = yj 3° ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]), ed(X[i-1],Y[j-1]) } 1 + min{ ed(X[i],Y[j])), ed(X[i+1],Y[j]), otherwise ed(X[i],Y[j+1])}

Calculating the edit distance (4/4)

SLIDE 11

11

case [n,m] contains the edit distance between the 2 words case [i,j] contains the edit distance between the prefix [1,..,i] of the one word and the prefixe [1,...,j] of the other word

Calculation the edit distance : dynamic programming

 s

r

t i n g  1 2 3 4 5 6 7 s 1 1 2 3 4 5 6 t 2 1 1 2 2 3 4 5 r 3 2 2 1 2 3 4 5 i 4 3 3 2 3 2 3 4 n 5 4 4 3 4 3 2 3 g 6 5 5 4 5 4 3 2 i j n m

SLIDE 12

12

Dynamic programming: case 1

 s

r

t i n g  1 2 3 4 ? ? ? s 1 1 2 3 ? ? ? t 2 1 1 2 2 ? ? ? r ? ? ? ? ? ? ? ? i ? ? ? ? ? ? ? ? n ? ? ? ? ? ? ? ? g ? ? ? ? ? ? ? ? i+1 j+1

xi+1 = yj+1

SLIDE 13

13

Dynamic programming : case 2

 s

r

t i n g  1 2 3 4 ? ? ? s 1 1 2 3 ? ? ? t 2 1 1 2 2 ? ? ? r 3 2 2 1 2 ? ? ? i ? ? ? ? ? ? ? ? n ? ? ? ? ? ? ? ? g ? ? ? ? ? ? ? ? i+1 j+1

xi+1 = yj and xi+1 = yj

SLIDE 14

14

Dynamic programming : case 3

 s

r

t i n g  1 2 3 4 ? ? ? s 1 1 2 3 ? ? ? t 2 1 1 2 2 ? ? ? r 3 2 2 1 2 ? ? ? i 4 3 3 2 2 ? ? ? n ? ? ? ? ? ? ? ? g ? ? ? ? ? ? ? ? i+1 j+1

xi+1  yj+1 et (xi+1  yj ou xi+1  yj)

SLIDE 15

15

String-to-language correction

SLIDE 16

16

String-to-language correction: problem defjnition

CONTEXT:

– Finite set of symbols (alphabet) – Elementary edit operations on symbols (as before) with their costs (1 per

peration)

– Edit sequences (as before) – Edit distance (error distance) between words: as before

INPUT:

– Regular grammar describing words (a finite set of words in particular) – Incorrect word A (unrecognizable by the grammar) – Threshold t

OUTPUT:

– A set of correct words B1, B2, …, Bn whose distance from A stays within t (the nearest neighbors of A)

SLIDE 17

17

String-to-language correction: simplistic approach

METHOD:

– For each word B recognizable by the grammar calculate the edit distance matrix between A and B. – Propose candidates whose distance from A does not exceed the threshold t (ed(A,B)  t).

FAISABILITY:

– Impossible in case of infinite languages

COMPLEXITY:

O(n * m * |D|)

SLIDE 18

18

String-to-language correction: threshold-controlled depth-fjrst exploration of an FSA

(Ofmazer 1996, …)

SLIDE 19

19

Part of the matrix calculated only once for all valid words sharing the same prefix appl

String correction with respect to a deterministic FSA (1/4)

1 2 4 5 3 6 7 8 9

a p p l y e s p l y e a Word to be corrected : *aply, threshold 2

 a p p l ... ...  1 2 3 4 ... ... a 1 1 2 3 ... ... p 2 1 1 2 ... ... l 3 2 1 1 1 ... ... y 4 3 2 2 2 ... ...

Each time a transition is followed a

new column is calculated in the edit distance matrix

e 5 4 3 2 2

If we get to a final state and the edit

distance remains within the thershold  a new candidate has been found apple

SLIDE 20

20

1 2 4 5 3 6 7 8 9

a p p l y e s p l y e a

 a p p l ... ...  1 2 3 4 ... ... a 1 1 2 3 ... ... p 2 1 1 2 ... ... l 3 2 1 1 1 ... ... y 4 3 2 2 2 ... ... e 5 4 3 2 2 s 6 5 4 3 3

apple

String correction with respect to a deterministic FSA (2/4)

Word to be corrected : *aply, threshold 2 Part of the matrix calculated only once for all valid words sharing the same prefix appl

Each time a transition is followed a

new column is calculated in the edit distance matrix

If we get to a final state and the edit

distance remains within the thershold  a new candidate has been found

SLIDE 21

21

1 2 4 5 3 6 7 8 9

a p p l y e s p l y e a

 a p p l ... ...  1 2 3 4 ... ... a 1 1 2 3 ... ... p 2 1 1 2 ... ... l 3 2 1 1 1 ... ... y 4 3 2 2 2 ... ... e 5 4 3 2 2

A backtrancking results in deleting

the current column apple

s 6 5 4 3 3

String correction with respect to a deterministic FSA (3/4)

Word to be corrected : *aply, threshold 2 Part of the matrix calculated only once for all valid words sharing the same prefix appl

Each time a transition is followed a

new column is calculated in the edit distance matrix

If we get to a final state and the edit

distance remains within the thershold  a new candidate has been found

SLIDE 22

22

1 2 4 5 3 6 7 8 9

a p p l y e s p l y e a

 a p p l ... ...  1 2 3 4 ... ... a 1 1 2 3 ... ... p 2 1 1 2 ... ... l 3 2 1 1 1 ... ... y 4 3 2 2 2 ... ... y 5 4 3 2 1

apple apply

String correction with respect to a deterministic FSA (4/4)

A backtrancking results in deleting

the current column Word to be corrected : *aply, threshold 2 Part of the matrix calculated only once for all valid words sharing the same prefix appl

Each time a transition is followed a

new column is calculated in the edit distance matrix

If we get to a final state and the edit

distance remains within the thershold  a new candidate has been found

SLIDE 23

23

1 2 8 9

a c d Word to be corrected : abcbb, t=2

 a b b b b b b

2
1

1 2 3 4 5 6

2 + + + + + + + + +



1 + 0

1 2 3 4 5 6 7 a

0 + 1

1 2 3 4 5 6 b

1 + 2

1 1 2 3 4 5 c

2 + 3

2 1 1 2 3 4 5 b

3 + 4

3 2 1 1 2 3 4 b

4 + 5

4 3 2 1 1 2 3

b b

If the current column

exceeds the threshold the whole path is cut

ff

Controlling the searchspace by the threshold

SLIDE 24

24

Tree-to-tree correction

SLIDE 25

25

Tree-to-tree correction

(Selkow 1977,…)

CONTEXT:

– Finite set of node symbols (alphabet) – Elementary edit operations on trees:

Insertion of a leaf
Deletion of a leaf
Renaming of a node (leaf or internal node)

– Non negatif cost for each elementary operation – Edit sequences (sequences of edit operations) with their costs (sums of costs of editing operations involved) – Edit distance between two trees A and B: minimum cost of all edit sequences transforming A into B

INPUT:

– Two trees A and B

OUTPUT:

– Distance between A and B

SLIDE 26

26

A partial tree A0:i is the root of A and its subtrees A0,...,Ai
The comparison is based on comparing roots, and then recursively comparing the roots’

subtrees

Comparing two trees

(Selkow 1977,…) A

root(A) A0 A1 A2

B

root(B) B0 B1 B2 B3 A0:1 a b c d c c d e c e e e f b d b b b B0:2

SLIDE 27

27

case [-1,-1] contains the cost of renaming root(A) into root(B)

Edit distance matrix between two trees

(Selkow 1977,…)

case [n,m] contains the edit distance between the 2 trees case [i,j] contains the edit distance between the partial trees A0:i and B0:j

1

1 2 3

1

1 4 14 15 16 4 2 12 13 14 1 15 13 3 4 5 2 16 14 4 4 4 i j n m

SLIDE 28

28

Calculation of the tree matrix

(Selkow 1977,…)

1

1 2 3

1

1 4 14 15 16 4 2 12 13 14 1 15 13 3 4 5 2 16 14 4 4 ? i j

Adding the cost of inserting Bj (here +1) Adding the edit distance between Ai and Bj (here +0) Adding the cost od deleting Ai (here +1) Taking the minimum (here min(4+0, 5+1, 4+1) = 4

SLIDE 29

29

Extension to the correction of XML-documents

The validity of a node is described by a set of regular expressions, e.g. E = ab*c + db*
The „horizontal” correction on a siblings’ level is similar to the string-to-language correction (Oflazer 1996)
The „vertical” correction is inspired from the tree-to-tree correction (Selkow 1977)

SLIDE 30

30

Main idea

String-to-string (Wagner&Fischer 1974) String-to-(regular) language (Oflazer 1996) Tree-to-tree (Selkow 1977) Tree-to-(regular) tree language (Cheriat, Savary, Bouchou, Halfeld, to be continued)

SLIDE 31

31

Edit distance matrix with edit sequences

case [i,j] contains the edit distance between the partial trees A0:i and B0:j, and the edit sequence necessary to transform A0:i into B0:j

1

1 2 3

1

... ... ... ... ... ... ... ... ... ... 1 ... ... ...

[3, <(R,0.1,f),(D,1.1,/),(I,2,e)>]

... 2 ... ... ... ... ... i j

SLIDE 32

32

Bibliography

Clarke, G., Barnard, D.T., Duncan N. (1995) Tree-to-tree Correction for Document
Trees. Technical Report 95-372, Department of Computing and Information Science,

Queen’s University, Kingston, Ontario.

Du, M. W., Chang, S. C. (1992): A model and a fast algorithm for multiple errors

spelling correction. Acta Informatica, Vol. 29. Springer Verlag, pp. 281-302

Hall, P., Dowling, G. (1980): Approximate String Matching. ACM Computing

Surveys, Vol. 12(4). ACM, New York., pp. 381-402

Lowrance, R., Wagner, R. A. (1975): An Extension of the String-to-String Correction
Problem. Journal of the ACM, Vol. 22(2), pp. 177-183
Mihov, S., Schultz, K. (2004): Fast approximate search in large dictionaries.

Computational Linguistics, Vol. 30(4). MIT Press, Cambridge, Massachusetts pp. 451-477

Oflazer, K. (1996): Error-tolerant finite state recognition with applications to

morphological analysis and spelling correction. Computational Linguistics, Vol. 22(1). MIT Press, Cambridge, Massachusetts pp. 73-89

Selkow, S. (1977): The tree-to-tree editing problem, Information Processing Letters

6(6), pp. 184-186

Wagner, R. A. (1974): Order-n Correction for Regular Languages. Communications
f the ACM, 17(5), pp. 265-268
Wagner, R. A., Fischer, M. J. (1974): The String-to-String Correction Problem.

Journal of the ACM, Vol. 21(1), pp. 168-173

SLIDE 33

33

Some details of the state of the art

Wagner & Fischer (1974):

– Elegant and solid theoretical definition of the string-to-string correction problem – 3 elementary operations on single letters admitted (insertion, deletion, replacement) – Model of a trace describing the edit distance between two strings – Dynamic programming method

Lowrance & Wagner (1975)

– Additional elementary operation: inversion of two adjacent letters – Restriction of the cost function

Du & Chang (1992):

– Cost 1 for each elementary operation – Restriction to linear editing sequences – Application to the nearest neighbor search in a dictionary, with a threshold

Oflazer (1996):

– Nearest-neighbor search in finite-state automata – Application to large natural-language dictionaries

Selkow (1977), Tai (1979), Zhang & Shasha (1989), Clarke, Barnard & Duncan (1995), de

Rougemont (2003):

– Tree-to-tree correction problem

Mihov & Schulz (2004):

– Levenshtein automaton – Backward dictionary

Bouchou, B. & Halfeld Ferrari Alves, M. (2003):

– Incremental validation of XML documents resulting from updates: human-computer interaction