Similarity and Correction of Strings and Trees : Towards a - - PowerPoint PPT Presentation

similarity and correction of strings and trees towards a
SMART_READER_LITE
LIVE PREVIEW

Similarity and Correction of Strings and Trees : Towards a - - PowerPoint PPT Presentation

Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Universit-Franois Rabelais de Tours, Campus de Blois, Laboratoire dInformatique Seminarium IPIPAN, 24 kwietnia, 2006 String-to-string


slide-1
SLIDE 1

Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents

Agata SAVARY Université-François Rabelais de Tours, Campus de Blois, Laboratoire d’Informatique

Seminarium IPIPAN, 24 kwietnia, 2006

slide-2
SLIDE 2

2

String-to-string correction

slide-3
SLIDE 3

3

T raditional string-to-string correction

(Wagner&Fischer 1974, Lawrence&Wagner 1975,…)

  • CONTEXT:

– Finite set of symbols (alphabet) – Elementary operations on symbols (editing operations, e.g. deletion, insertion, or replacement of a letter, inversion of two adjacent letters) with their costs (usually 1 per operation) – Sequences of editing operations (edit sequences; each operation applies to a word resulting from the previous operations) with their costs (sums of costs of editing operations involved) – Measure of similarity between words A and B (edit distance or error distance): minimum cost of all edit sequences transforming A to B

  • INPUT:

– Two words A and B

  • OUTPUT:

– Distance between A and B

slide-4
SLIDE 4

4

Examples of elementary edit

  • perations
  • Insertion of a letter

monter  montaer, monter  montrer

  • Deletion of a letter

monter  montr, monter  monte

  • Replacement of a letter by another

monter  ponter, monter  conter

  • Transposition of two adjacent letters

monter  mnoter, monter  montre

Each elementary operation has a non negatif cost. From now on we admit cost 1 for each elementary operation.

slide-5
SLIDE 5

5

Edit sequence

  • Edit sequence = sequence of elementary edit operations
  • For each couple of words X and Y many edit sequences exist that transform

X into Y.

  • Example 1: transforming sorting into string :

– sorting  srting  sting  string (3 operations) – sorting  sotring  string (2 operations) – sorting  srting  string (2 operations) – sorting  strting  string (2 operations) – sorting  srting  sting  sing  sring  string (5 operations) – .................

  • Example 2: transforming abc into ca :

– abc  ac ca (2 operations) – abc  cabc  cac  ca (3 operations)

  • From now on, we’ll be interested in linear edit sequences (Du&Chang

1992), i.e. such that the operations are performed from left to right, and no further operation may alter the result of a previous operation.

Linear sequence Linear sequence Linear sequence Linear sequence

slide-6
SLIDE 6

6

Edit (error) distance

  • Cost of an edit sequence = sum of costs of all elementary
  • perations included in the sequence

– sortingsrtingstingstring (3 operations)  cost = 3 – sortingsotringstring (2 operations)  cost = 2 – sortingsrtingstingsingsringstring (5 operations)  cost = 5

  • Edit distance (error distance) between two words X and Y

(ed(X,Y)) = minimal cost of all edit sequences transforming X into Y :

ed(sorting, string) = 2 ed(abc,ca) = 2, if all edit sequences are taken into account ed(abc,ca) = 3, if only the linear edit sequences are taken into account

slide-7
SLIDE 7

7

Calculating the edit distance (1/4)

If xi+1 = yj+1 then ed(X[i+1],Y[j+1]) = ed(X[i],Y[j])

X[i+1] Y[j+1] i j Notation : word X= x1 x2 ... xi ...xn; the prefix of lenght i of X : X[i] = x1 x2 ... xi X i X[i] It is possible to calculate the distance between two prefixes X[i+1] and Y[j+1] on the basis of the distances between shorter prefixes: 3 cases

x1 x2 x3 ... xi ... xn

slide-8
SLIDE 8

8

Transposition’s cost

If xi = yj+1 and xi+1 = yj (the 2 last characters may be inverted) then 4 sub-cases are possible:

  • The cheapest sequence transforming X[i+1]

into Y[j+1] contains a transposition of xi and xi+1 : ed(X[i+1],Y[j+1]) = ed(X[i-1],Y[j-1]) + 1

X[i+1] Y[j+1] i j

  • The cheapest sequence transforming X[i+1] into Y[j+1] contains the replacement of xi+1

by yj+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) + 1

  • The cheapest sequence transforming X[i+1] into Y[j+1] contains the l’insertion of yj+1 :

ed(X[i+1],Y[j+1]) = ed(X[i+1],Y[j]) + 1

  • The cheapest sequence transforming X[i+1] into Y[j+1] contains the deletion of xi+1 :

ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]) + 1 Replacement’s cost Insertion’s cost Deletion’s cost

Calculating the edit distance (2/4)

slide-9
SLIDE 9

9

OTHERWISE (if xi+1  yj+1, and (xi  yj+1 or xi+1  yj)) then 3 sub-cases are possible:

X[i+1] Y[j+1] i j

  • The cheapest sequence transforming X[i+1] into Y[j+1] contains the replacement of xi+1

by yj+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) + 1

  • The cheapest sequence transforming X[i+1] into Y[j+1] contains the insertion of yj+1 :

ed(X[i+1],Y[j+1]) = ed(X[i+1],Y[j]) + 1

  • The cheapest sequence transforming X[i+1] into Y[j+1] contains the deletion of xi+1 :

ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]) + 1 Replacement’s cost Insertion’s cost Deletion’s cost

Calculating the edit distance (3/4)

slide-10
SLIDE 10

10 Edit distance between X[i] and Y[j] - recursive definition: For i=0,...,m, j=0,...,n: 1° ed(X[-1],Y[j]) = ed(X[i], Y[-1]) = max(m,n)

2° ed(X[0],Y[j]) = j ed(X[i],Y[0]) = i ed(X[i],Y[j]) if xi+1 = yj+1 1 + min{ ed(X[i],Y[j])), ed(X[i+1],Y[j]), if xi=yj+1 et xi+1 = yj 3° ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]), ed(X[i-1],Y[j-1]) } 1 + min{ ed(X[i],Y[j])), ed(X[i+1],Y[j]), otherwise ed(X[i],Y[j+1])}

Calculating the edit distance (4/4)

slide-11
SLIDE 11

11

case [n,m] contains the edit distance between the 2 words case [i,j] contains the edit distance between the prefix [1,..,i] of the one word and the prefixe [1,...,j] of the other word

Calculation the edit distance : dynamic programming

 s

  • r

t i n g  1 2 3 4 5 6 7 s 1 1 2 3 4 5 6 t 2 1 1 2 2 3 4 5 r 3 2 2 1 2 3 4 5 i 4 3 3 2 3 2 3 4 n 5 4 4 3 4 3 2 3 g 6 5 5 4 5 4 3 2 i j n m

slide-12
SLIDE 12

12

Dynamic programming: case 1

 s

  • r

t i n g  1 2 3 4 ? ? ? s 1 1 2 3 ? ? ? t 2 1 1 2 2 ? ? ? r ? ? ? ? ? ? ? ? i ? ? ? ? ? ? ? ? n ? ? ? ? ? ? ? ? g ? ? ? ? ? ? ? ? i+1 j+1

xi+1 = yj+1

slide-13
SLIDE 13

13

Dynamic programming : case 2

 s

  • r

t i n g  1 2 3 4 ? ? ? s 1 1 2 3 ? ? ? t 2 1 1 2 2 ? ? ? r 3 2 2 1 2 ? ? ? i ? ? ? ? ? ? ? ? n ? ? ? ? ? ? ? ? g ? ? ? ? ? ? ? ? i+1 j+1

xi+1 = yj and xi+1 = yj

slide-14
SLIDE 14

14

Dynamic programming : case 3

 s

  • r

t i n g  1 2 3 4 ? ? ? s 1 1 2 3 ? ? ? t 2 1 1 2 2 ? ? ? r 3 2 2 1 2 ? ? ? i 4 3 3 2 2 ? ? ? n ? ? ? ? ? ? ? ? g ? ? ? ? ? ? ? ? i+1 j+1

xi+1  yj+1 et (xi+1  yj ou xi+1  yj)

slide-15
SLIDE 15

15

String-to-language correction

slide-16
SLIDE 16

16

String-to-language correction: problem defjnition

  • CONTEXT:

– Finite set of symbols (alphabet) – Elementary edit operations on symbols (as before) with their costs (1 per

  • peration)

– Edit sequences (as before) – Edit distance (error distance) between words: as before

  • INPUT:

– Regular grammar describing words (a finite set of words in particular) – Incorrect word A (unrecognizable by the grammar) – Threshold t

  • OUTPUT:

– A set of correct words B1, B2, …, Bn whose distance from A stays within t (the nearest neighbors of A)

slide-17
SLIDE 17

17

String-to-language correction: simplistic approach

  • METHOD:

– For each word B recognizable by the grammar calculate the edit distance matrix between A and B. – Propose candidates whose distance from A does not exceed the threshold t (ed(A,B)  t).

  • FAISABILITY:

– Impossible in case of infinite languages

  • COMPLEXITY:

O(n * m * |D|)

slide-18
SLIDE 18

18

String-to-language correction: threshold-controlled depth-fjrst exploration of an FSA

(Ofmazer 1996, …)

slide-19
SLIDE 19

19

Part of the matrix calculated only once for all valid words sharing the same prefix appl

String correction with respect to a deterministic FSA (1/4)

1 2 4 5 3 6 7 8 9

a p p l y e s p l y e a Word to be corrected : *aply, threshold 2

 a p p l ... ...  1 2 3 4 ... ... a 1 1 2 3 ... ... p 2 1 1 2 ... ... l 3 2 1 1 1 ... ... y 4 3 2 2 2 ... ...

  • Each time a transition is followed a

new column is calculated in the edit distance matrix

e 5 4 3 2 2

  • If we get to a final state and the edit

distance remains within the thershold  a new candidate has been found apple

slide-20
SLIDE 20

20

1 2 4 5 3 6 7 8 9

a p p l y e s p l y e a

 a p p l ... ...  1 2 3 4 ... ... a 1 1 2 3 ... ... p 2 1 1 2 ... ... l 3 2 1 1 1 ... ... y 4 3 2 2 2 ... ... e 5 4 3 2 2 s 6 5 4 3 3

apple

String correction with respect to a deterministic FSA (2/4)

Word to be corrected : *aply, threshold 2 Part of the matrix calculated only once for all valid words sharing the same prefix appl

  • Each time a transition is followed a

new column is calculated in the edit distance matrix

  • If we get to a final state and the edit

distance remains within the thershold  a new candidate has been found

slide-21
SLIDE 21

21

1 2 4 5 3 6 7 8 9

a p p l y e s p l y e a

 a p p l ... ...  1 2 3 4 ... ... a 1 1 2 3 ... ... p 2 1 1 2 ... ... l 3 2 1 1 1 ... ... y 4 3 2 2 2 ... ... e 5 4 3 2 2

  • A backtrancking results in deleting

the current column apple

s 6 5 4 3 3

String correction with respect to a deterministic FSA (3/4)

Word to be corrected : *aply, threshold 2 Part of the matrix calculated only once for all valid words sharing the same prefix appl

  • Each time a transition is followed a

new column is calculated in the edit distance matrix

  • If we get to a final state and the edit

distance remains within the thershold  a new candidate has been found

slide-22
SLIDE 22

22

1 2 4 5 3 6 7 8 9

a p p l y e s p l y e a

 a p p l ... ...  1 2 3 4 ... ... a 1 1 2 3 ... ... p 2 1 1 2 ... ... l 3 2 1 1 1 ... ... y 4 3 2 2 2 ... ... y 5 4 3 2 1

apple apply

String correction with respect to a deterministic FSA (4/4)

  • A backtrancking results in deleting

the current column Word to be corrected : *aply, threshold 2 Part of the matrix calculated only once for all valid words sharing the same prefix appl

  • Each time a transition is followed a

new column is calculated in the edit distance matrix

  • If we get to a final state and the edit

distance remains within the thershold  a new candidate has been found

slide-23
SLIDE 23

23

1 2 8 9

a c d Word to be corrected : abcbb, t=2

 a b b b b b b

  • 2
  • 1

1 2 3 4 5 6

  • 2 + + + + + + + + +

  • 1 + 0

1 2 3 4 5 6 7 a

0 + 1

1 2 3 4 5 6 b

1 + 2

1 1 2 3 4 5 c

2 + 3

2 1 1 2 3 4 5 b

3 + 4

3 2 1 1 2 3 4 b

4 + 5

4 3 2 1 1 2 3

b b

  • If the current column

exceeds the threshold the whole path is cut

  • ff

Controlling the searchspace by the threshold

slide-24
SLIDE 24

24

Tree-to-tree correction

slide-25
SLIDE 25

25

Tree-to-tree correction

(Selkow 1977,…)

  • CONTEXT:

– Finite set of node symbols (alphabet) – Elementary edit operations on trees:

  • Insertion of a leaf
  • Deletion of a leaf
  • Renaming of a node (leaf or internal node)

– Non negatif cost for each elementary operation – Edit sequences (sequences of edit operations) with their costs (sums of costs of editing operations involved) – Edit distance between two trees A and B: minimum cost of all edit sequences transforming A into B

  • INPUT:

– Two trees A and B

  • OUTPUT:

– Distance between A and B

slide-26
SLIDE 26

26

  • A partial tree A0:i is the root of A and its subtrees A0,...,Ai
  • The comparison is based on comparing roots, and then recursively comparing the roots’

subtrees

Comparing two trees

(Selkow 1977,…) A

root(A) A0 A1 A2

B

root(B) B0 B1 B2 B3 A0:1 a b c d c c d e c e e e f b d b b b B0:2

slide-27
SLIDE 27

27

case [-1,-1] contains the cost of renaming root(A) into root(B)

Edit distance matrix between two trees

(Selkow 1977,…)

case [n,m] contains the edit distance between the 2 trees case [i,j] contains the edit distance between the partial trees A0:i and B0:j

  • 1

1 2 3

  • 1

1 4 14 15 16 4 2 12 13 14 1 15 13 3 4 5 2 16 14 4 4 4 i j n m

slide-28
SLIDE 28

28

Calculation of the tree matrix

(Selkow 1977,…)

  • 1

1 2 3

  • 1

1 4 14 15 16 4 2 12 13 14 1 15 13 3 4 5 2 16 14 4 4 ? i j

Adding the cost of inserting Bj (here +1) Adding the edit distance between Ai and Bj (here +0) Adding the cost od deleting Ai (here +1) Taking the minimum (here min(4+0, 5+1, 4+1) = 4

slide-29
SLIDE 29

29

Extension to the correction of XML-documents

  • The validity of a node is described by a set of regular expressions, e.g. E = ab*c + db*
  • The „horizontal” correction on a siblings’ level is similar to the string-to-language correction (Oflazer 1996)
  • The „vertical” correction is inspired from the tree-to-tree correction (Selkow 1977)

<y> </y> <root> </root> <x> </x> <z> </z> <a> </a> <b> </b> <c> </c> <b> </b> <b> </b>

slide-30
SLIDE 30

30

Main idea

String-to-string (Wagner&Fischer 1974) String-to-(regular) language (Oflazer 1996) Tree-to-tree (Selkow 1977) Tree-to-(regular) tree language (Cheriat, Savary, Bouchou, Halfeld, to be continued)

slide-31
SLIDE 31

31

Edit distance matrix with edit sequences

case [i,j] contains the edit distance between the partial trees A0:i and B0:j, and the edit sequence necessary to transform A0:i into B0:j

  • 1

1 2 3

  • 1

... ... ... ... ... ... ... ... ... ... 1 ... ... ...

[3, <(R,0.1,f),(D,1.1,/),(I,2,e)>]

... 2 ... ... ... ... ... i j

slide-32
SLIDE 32

32

Bibliography

  • Clarke, G., Barnard, D.T., Duncan N. (1995) Tree-to-tree Correction for Document
  • Trees. Technical Report 95-372, Department of Computing and Information Science,

Queen’s University, Kingston, Ontario.

  • Du, M. W., Chang, S. C. (1992): A model and a fast algorithm for multiple errors

spelling correction. Acta Informatica, Vol. 29. Springer Verlag, pp. 281-302

  • Hall, P., Dowling, G. (1980): Approximate String Matching. ACM Computing

Surveys, Vol. 12(4). ACM, New York., pp. 381-402

  • Lowrance, R., Wagner, R. A. (1975): An Extension of the String-to-String Correction
  • Problem. Journal of the ACM, Vol. 22(2), pp. 177-183
  • Mihov, S., Schultz, K. (2004): Fast approximate search in large dictionaries.

Computational Linguistics, Vol. 30(4). MIT Press, Cambridge, Massachusetts pp. 451-477

  • Oflazer, K. (1996): Error-tolerant finite state recognition with applications to

morphological analysis and spelling correction. Computational Linguistics, Vol. 22(1). MIT Press, Cambridge, Massachusetts pp. 73-89

  • Selkow, S. (1977): The tree-to-tree editing problem, Information Processing Letters

6(6), pp. 184-186

  • Wagner, R. A. (1974): Order-n Correction for Regular Languages. Communications
  • f the ACM, 17(5), pp. 265-268
  • Wagner, R. A., Fischer, M. J. (1974): The String-to-String Correction Problem.

Journal of the ACM, Vol. 21(1), pp. 168-173

slide-33
SLIDE 33

33

Some details of the state of the art

  • Wagner & Fischer (1974):

– Elegant and solid theoretical definition of the string-to-string correction problem – 3 elementary operations on single letters admitted (insertion, deletion, replacement) – Model of a trace describing the edit distance between two strings – Dynamic programming method

  • Lowrance & Wagner (1975)

– Additional elementary operation: inversion of two adjacent letters – Restriction of the cost function

  • Du & Chang (1992):

– Cost 1 for each elementary operation – Restriction to linear editing sequences – Application to the nearest neighbor search in a dictionary, with a threshold

  • Oflazer (1996):

– Nearest-neighbor search in finite-state automata – Application to large natural-language dictionaries

  • Selkow (1977), Tai (1979), Zhang & Shasha (1989), Clarke, Barnard & Duncan (1995), de

Rougemont (2003):

– Tree-to-tree correction problem

  • Mihov & Schulz (2004):

– Levenshtein automaton – Backward dictionary

  • Bouchou, B. & Halfeld Ferrari Alves, M. (2003):

– Incremental validation of XML documents resulting from updates: human-computer interaction