Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents
Agata SAVARY Université-François Rabelais de Tours, Campus de Blois, Laboratoire d’Informatique
Seminarium IPIPAN, 24 kwietnia, 2006
Similarity and Correction of Strings and Trees : Towards a - - PowerPoint PPT Presentation
Similarity and Correction of Strings and Trees : Towards a Correction of XML Documents Agata SAVARY Universit-Franois Rabelais de Tours, Campus de Blois, Laboratoire dInformatique Seminarium IPIPAN, 24 kwietnia, 2006 String-to-string
Seminarium IPIPAN, 24 kwietnia, 2006
– Finite set of symbols (alphabet) – Elementary operations on symbols (editing operations, e.g. deletion, insertion, or replacement of a letter, inversion of two adjacent letters) with their costs (usually 1 per operation) – Sequences of editing operations (edit sequences; each operation applies to a word resulting from the previous operations) with their costs (sums of costs of editing operations involved) – Measure of similarity between words A and B (edit distance or error distance): minimum cost of all edit sequences transforming A to B
– Two words A and B
– Distance between A and B
Each elementary operation has a non negatif cost. From now on we admit cost 1 for each elementary operation.
X into Y.
– sorting srting sting string (3 operations) – sorting sotring string (2 operations) – sorting srting string (2 operations) – sorting strting string (2 operations) – sorting srting sting sing sring string (5 operations) – .................
– abc ac ca (2 operations) – abc cabc cac ca (3 operations)
1992), i.e. such that the operations are performed from left to right, and no further operation may alter the result of a previous operation.
Linear sequence Linear sequence Linear sequence Linear sequence
– sortingsrtingstingstring (3 operations) cost = 3 – sortingsotringstring (2 operations) cost = 2 – sortingsrtingstingsingsringstring (5 operations) cost = 5
ed(sorting, string) = 2 ed(abc,ca) = 2, if all edit sequences are taken into account ed(abc,ca) = 3, if only the linear edit sequences are taken into account
If xi+1 = yj+1 then ed(X[i+1],Y[j+1]) = ed(X[i],Y[j])
X[i+1] Y[j+1] i j Notation : word X= x1 x2 ... xi ...xn; the prefix of lenght i of X : X[i] = x1 x2 ... xi X i X[i] It is possible to calculate the distance between two prefixes X[i+1] and Y[j+1] on the basis of the distances between shorter prefixes: 3 cases
x1 x2 x3 ... xi ... xn
Transposition’s cost
If xi = yj+1 and xi+1 = yj (the 2 last characters may be inverted) then 4 sub-cases are possible:
into Y[j+1] contains a transposition of xi and xi+1 : ed(X[i+1],Y[j+1]) = ed(X[i-1],Y[j-1]) + 1
X[i+1] Y[j+1] i j
by yj+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) + 1
ed(X[i+1],Y[j+1]) = ed(X[i+1],Y[j]) + 1
ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]) + 1 Replacement’s cost Insertion’s cost Deletion’s cost
OTHERWISE (if xi+1 yj+1, and (xi yj+1 or xi+1 yj)) then 3 sub-cases are possible:
X[i+1] Y[j+1] i j
by yj+1 : ed(X[i+1],Y[j+1]) = ed(X[i],Y[j]) + 1
ed(X[i+1],Y[j+1]) = ed(X[i+1],Y[j]) + 1
ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]) + 1 Replacement’s cost Insertion’s cost Deletion’s cost
2° ed(X[0],Y[j]) = j ed(X[i],Y[0]) = i ed(X[i],Y[j]) if xi+1 = yj+1 1 + min{ ed(X[i],Y[j])), ed(X[i+1],Y[j]), if xi=yj+1 et xi+1 = yj 3° ed(X[i+1],Y[j+1]) = ed(X[i],Y[j+1]), ed(X[i-1],Y[j-1]) } 1 + min{ ed(X[i],Y[j])), ed(X[i+1],Y[j]), otherwise ed(X[i],Y[j+1])}
case [n,m] contains the edit distance between the 2 words case [i,j] contains the edit distance between the prefix [1,..,i] of the one word and the prefixe [1,...,j] of the other word
xi+1 = yj+1
xi+1 = yj and xi+1 = yj
xi+1 yj+1 et (xi+1 yj ou xi+1 yj)
– Finite set of symbols (alphabet) – Elementary edit operations on symbols (as before) with their costs (1 per
– Edit sequences (as before) – Edit distance (error distance) between words: as before
– Regular grammar describing words (a finite set of words in particular) – Incorrect word A (unrecognizable by the grammar) – Threshold t
– A set of correct words B1, B2, …, Bn whose distance from A stays within t (the nearest neighbors of A)
– For each word B recognizable by the grammar calculate the edit distance matrix between A and B. – Propose candidates whose distance from A does not exceed the threshold t (ed(A,B) t).
– Impossible in case of infinite languages
Part of the matrix calculated only once for all valid words sharing the same prefix appl
1 2 4 5 3 6 7 8 9
a p p l y e s p l y e a Word to be corrected : *aply, threshold 2
new column is calculated in the edit distance matrix
distance remains within the thershold a new candidate has been found apple
1 2 4 5 3 6 7 8 9
a p p l y e s p l y e a
apple
Word to be corrected : *aply, threshold 2 Part of the matrix calculated only once for all valid words sharing the same prefix appl
new column is calculated in the edit distance matrix
distance remains within the thershold a new candidate has been found
1 2 4 5 3 6 7 8 9
a p p l y e s p l y e a
the current column apple
Word to be corrected : *aply, threshold 2 Part of the matrix calculated only once for all valid words sharing the same prefix appl
new column is calculated in the edit distance matrix
distance remains within the thershold a new candidate has been found
1 2 4 5 3 6 7 8 9
a p p l y e s p l y e a
apple apply
the current column Word to be corrected : *aply, threshold 2 Part of the matrix calculated only once for all valid words sharing the same prefix appl
new column is calculated in the edit distance matrix
distance remains within the thershold a new candidate has been found
1 2 8 9
a c d Word to be corrected : abcbb, t=2
a b b b b b b
1 2 3 4 5 6
0 + 1
1 + 2
2 + 3
3 + 4
4 + 5
b b
exceeds the threshold the whole path is cut
– Finite set of node symbols (alphabet) – Elementary edit operations on trees:
– Non negatif cost for each elementary operation – Edit sequences (sequences of edit operations) with their costs (sums of costs of editing operations involved) – Edit distance between two trees A and B: minimum cost of all edit sequences transforming A into B
– Two trees A and B
– Distance between A and B
subtrees
root(A) A0 A1 A2
root(B) B0 B1 B2 B3 A0:1 a b c d c c d e c e e e f b d b b b B0:2
case [-1,-1] contains the cost of renaming root(A) into root(B)
case [n,m] contains the edit distance between the 2 trees case [i,j] contains the edit distance between the partial trees A0:i and B0:j
Adding the cost of inserting Bj (here +1) Adding the edit distance between Ai and Bj (here +0) Adding the cost od deleting Ai (here +1) Taking the minimum (here min(4+0, 5+1, 4+1) = 4
<y> </y> <root> </root> <x> </x> <z> </z> <a> </a> <b> </b> <c> </c> <b> </b> <b> </b>
String-to-string (Wagner&Fischer 1974) String-to-(regular) language (Oflazer 1996) Tree-to-tree (Selkow 1977) Tree-to-(regular) tree language (Cheriat, Savary, Bouchou, Halfeld, to be continued)
case [i,j] contains the edit distance between the partial trees A0:i and B0:j, and the edit sequence necessary to transform A0:i into B0:j
[3, <(R,0.1,f),(D,1.1,/),(I,2,e)>]
Queen’s University, Kingston, Ontario.
spelling correction. Acta Informatica, Vol. 29. Springer Verlag, pp. 281-302
Surveys, Vol. 12(4). ACM, New York., pp. 381-402
Computational Linguistics, Vol. 30(4). MIT Press, Cambridge, Massachusetts pp. 451-477
morphological analysis and spelling correction. Computational Linguistics, Vol. 22(1). MIT Press, Cambridge, Massachusetts pp. 73-89
6(6), pp. 184-186
Journal of the ACM, Vol. 21(1), pp. 168-173
– Elegant and solid theoretical definition of the string-to-string correction problem – 3 elementary operations on single letters admitted (insertion, deletion, replacement) – Model of a trace describing the edit distance between two strings – Dynamic programming method
– Additional elementary operation: inversion of two adjacent letters – Restriction of the cost function
– Cost 1 for each elementary operation – Restriction to linear editing sequences – Application to the nearest neighbor search in a dictionary, with a threshold
– Nearest-neighbor search in finite-state automata – Application to large natural-language dictionaries
Rougemont (2003):
– Tree-to-tree correction problem
– Levenshtein automaton – Backward dictionary
– Incremental validation of XML documents resulting from updates: human-computer interaction