[PPT] - Approximate Search of Regular Expressions Using Bit-Parallel PowerPoint Presentation

SLIDE 1

Approximate Search of Regular Expressions Using Bit-Parallel Algorithms

Kristo Tammeoja Jaak Vilo Teooriapäevad Rõuges, 2007

SLIDE 2

Regular expression

Syntax

(, ) | Quantifier

*, +, ?, {m,n}, {m,}

Character classes (example [a-z])

3

SLIDE 4

Regular expression

Syntax

(, ) | Quantifier

*, +, ?, {m,n}, {m,}

Character classes (example [a-z])

Matching as used in presentation

Regular expression A* AAAAA

match

BAAAC no match

4

SLIDE 5

Regular expression

1 error allowed R(E|G)(EX) * 1:R(E|G)(EX)* 1:R<E|G>(EX)* 1:R(E|G)<EX>*

5

SLIDE 6

Regular expression

1 error allowed R(E|G)(EX) * 1:R(E|G)(EX)* 1:R<E|G>(EX)* R E R G R E E X 1:R(E|G)<EX>* R G E X R E E X E X

6

SLIDE 7

Regular expression

1 error allowed R(E|G)(EX) * 1:R(E|G)(EX)* 1:R<E|G>(EX)* R R R E R G R E E X R G E X R E E X E X R R E X R E G E X R E E E X E X R E E R X E X

subst. G E del.

1:R(E|G)<EX>*

ins. 7

SLIDE 8

Regular expression

1 error allowed R(E|G)(EX) * 1:R(E|G)(EX)* 1:R<E|G>(EX)* R R R E

no match

R G R E E X R G E X R E E X E X R R E X R E G E X R E E E X E X R E E R X E X

subst. no match G E del. match

1:R(E|G)<EX>*

ins. 8

SLIDE 9

Regular expression

1 error allowed R(E|G)(EX) * 1:R(E|G)(EX)* 1:R<E|G>(EX)* R R R E

no match

R G R E E X R G E X R E E X E X R R E X R E G E X R E E E X E X R E E R X E X

subst. no match G E del. match

1:R(E|G)<EX>*

match

ins. match

no match 9

SLIDE 10

Glushkov’s automaton

R ( E | G ) ( E X ) *

10

SLIDE 11

Glushkov’s automaton

Character in RE = state in automaton

R E G E X

R ( E | G ) ( E X ) *

11

SLIDE 12

Glushkov’s automaton

Character in RE = state in automaton

+ one state for the beginning of the RE

R ( E | G ) ( E X ) *

R E G E X

12

SLIDE 13

Glushkov’s automaton

Character in RE = state in automaton

+ one state for the beginning of the RE

Transitions show which characters/positions

can precede each other

R ( E | G ) ( E X ) *

E G E X R R...

13

SLIDE 14

Glushkov’s automaton

Character in RE = state in automaton

+ one state for the beginning of the RE

Transitions show which characters/positions

can precede each other

R ( E | G ) ( E X ) *

R R E G E X R...

14

SLIDE 15

Glushkov’s automaton

Character in RE = state in automaton

+ one state for the beginning of the RE

Transitions show which characters/positions

can precede each other

R ( E | G ) ( E X ) *

R R E G E X R...

15

SLIDE 16

Glushkov’s automaton

Character in RE = state in automaton

+ one state for the beginning of the RE

Transitions show which characters/positions

can precede each other

R ( E | G ) ( E X ) *

R E G E G R E X RE... RG...

16

SLIDE 17

Glushkov’s automaton

Character in RE = state in automaton

+ one state for the beginning of the RE

Transitions show which characters/positions

can precede each other

R ( E | G ) ( E X ) *

R E G E G R E X RE...

17

SLIDE 18

Glushkov’s automaton

Character in RE = state in automaton

+ one state for the beginning of the RE

Transitions show which characters/positions

can precede each other

R ( E | G ) ( E X ) *

R E G E E G E R X REE...

18

SLIDE 19

Glushkov’s automaton

Character in RE = state in automaton

+ one state for the beginning of the RE

Transitions show which characters/positions

can precede each other

R ( E | G ) ( E X ) *

R E G E E G E E R X RGE...

19

SLIDE 20

Glushkov’s automaton

Character in RE = state in automaton

+ one state for the beginning of the RE

Transitions show which characters/positions

can precede each other

R ( E | G ) ( E X ) *

R E G E X R E G E E X RGEX...

20

SLIDE 21

Glushkov’s automaton

Character in RE = state in automaton

+ one state for the beginning of the RE

Transitions show which characters/positions

can precede each other

R ( E | G ) ( E X ) *

R E G E X R E G E E X E RGEXE...

21

SLIDE 22

Glushkov’s automaton

Character in RE = state in automaton

+ one state for the beginning of the RE

Transitions show which characters/positions

can precede each other

R ( E | G ) ( E X ) *

R E G E X R E G E E X E

22

SLIDE 23

Glushkov’s automaton

All labels entering a node are labeled by the

same character

R ( E | G ) ( E X ) *

R E G E X R E G E E X E

23

SLIDE 24

Glushkov’s automaton

All labels entering a node are labeled by the

same character

R ( E | G ) ( E X ) *

R E G E X R E G E E X E

24

SLIDE 25

Glushkov’s automaton

All labels entering a node are labeled by the

same character for example after reading character ‘E’

nly states with label ‘E’ can be active

R E G E X R E G E E X E

25

SLIDE 26

Exact search

Simulation of NFA = changing active states

based on the character read from the text

We use bit-vectors (one bit for each state) to

hold active states δ(D, a)

D – bit-vector of active states a – character read Returns new bit-vector

2|D| · |Σ| different sets of parameters

|D| – number of states in automaton |Σ| - alphabet's size

26

SLIDE 27

Exact search

“After reading character ‘E’ only states with

label ‘E’ can be active” so ...

δ(D, a) = T[D] & B[a]

T[D] – states that can be reached

from states in D by any character

B[a] – states that can be reached

by character a

27

SLIDE 28

Exact search

δ(D, a) = T[D] & B[a]

‘A’ 0111010 ‘B’ ‘C’ A A A B A C A B C A A A

AA|AB|AC

a B[a] D T[D] 1000000 0100000 ... 0101010 ...

28

SLIDE 29

Exact search

δ(D, a) = T[D] & B[a]

‘A’ 0111010 ‘B’ 0000100 ‘C’ A A A B A C A B C A A A

AA|AB|AC

a B[a] D T[D] 1000000 0100000 ... 0101010 ...

29

SLIDE 30

Exact search

δ(D, a) = T[D] & B[a]

‘A’ 0111010 ‘B’ 0000100 ‘C’ 0000001 A A A B A C A B C A A A

AA|AB|AC

a B[a] D T[D] 1000000 0100000 ... 0101010 ...

30

SLIDE 31

Exact search

δ(D, a) = T[D] & B[a]

‘A’ 0111010 ‘B’ 0000100 ‘C’ 0000001 A A A B A C A B C A A A

AA|AB|AC

a B[a] 1000000 0101010 0100000 ... 0101010 ... D T[D]

31

SLIDE 32

Exact search

δ(D, a) = T[D] & B[a]

‘A’ 0111010 ‘B’ 0000100 ‘C’ 0000001 A A A B A C A B C A A A

AA|AB|AC

a B[a] 1000000 0101010 0100000 0010000 ... 0101010 ... D T[D]

32

SLIDE 33

Exact search

δ(D, a) = T[D] & B[a]

‘A’ 0111010 ‘B’ 0000100 ‘C’ 0000001 A A A B A C A B C A A A

AA|AB|AC

a B[a] 1000000 0101010 0100000 0010000 ... 0101010 0010101 ... D T[D]

33

SLIDE 34

Exact search

δ(D, a) = T[D] & B[a]

1000000 0101010 0100000 0010000 ... 0101010 0010101 ... ‘A’ 0111010 ‘B’ 0000100 ‘C’ 0000001 A A A B A C A B C A A A

AA|AB|AC

δ(0101010, ‘A’) a B[a] D T[D]

34

SLIDE 35

Exact search

δ(D, a) = T[D] & B[a]

1000000 0101010 0100000 0010000 ... 0101010 0010101 ... ‘A’ 0111010 ‘B’ 0000100 ‘C’ 0000001 A A A B A C A B C A A A

AA|AB|AC

δ(0101010, ‘A’) a B[a] D T[D] 0010101 T[D] & 0111010 B[a] 0010000

35

SLIDE 36

Exact search

D ← 100..00 // initial state active F ← bit-vector of final states For pos ∈ 1 ... n Do // scanning text D ← T[D] & B[tpos] If D & F ≠ 000..00 Then match End of For

36

SLIDE 37

Approximate search

Errors

Insertion Deletion Substitution

37

SLIDE 38

Approximate search

When searching with k errors we make

k+1 replicas of the automaton, one for each error-level

Plus we need transitions for errors

R E R E G G E E X X No errors ? ? ? ? ? Up to 1 error R E R E G G E E X X

38

SLIDE 39

Approximate search

R0, R1 – current bit-vectors R0’, R1’ – bit-vectors after processing

character a R0’ = T[R0] & B[c] R1’ = ?

39

SLIDE 40

Approximate search

R1’ = T[R1] & B[c] | ...

Same as in exact search

no errors EGEX R E R E G G E E X X No errors Up to 1 error R E R E G G E E X X

40

SLIDE 41

Approximate search

R1’ = T[R1] & B[c] | R0 | ...

Active states remain the same

no errors del RAEGEX R E G E X R E G E X No errors R E G E X Up to 1 error R E G E X Σ Σ Σ Σ Σ Σ

41

SLIDE 42

Approximate search

R1’ = T[R1] & B[c] | R0 | T[R0’] | ...

Insert new character after the current one Just one step in automaton

no errors del ins R E G E X R E G E X No errors Up to 1 error R E G E X R E G E X Σ Σ Σ Σ Σ Σ ε ε ε ε ε REEX

42

SLIDE 43

Approximate search

R1’ = T[R1] & B[c] | R0 | T[R0’] | T[R0]

Similar to exact matching except... ... we don’t care about the character read

no errors del ins subst R E G E X R E G E X No errors Up to 1 error R E G E X R E G E X Σ Σ Σ Σ Σ Σ ε ε ε ε ε Σ Σ Σ Σ Σ RAGEX

43

SLIDE 44

Error-free regions

Approximate search

R1’ = T[R1] & B[c] | R0 | T [R0’] | T [R0]

With no-error regions

R1’ = T[R1] & B[c] | R0 & I | Te[R0’] | Te[R0]

no errors del ins subst

44

SLIDE 45

Error-free regions

Deletion from text Make two copies of ‘C’

Those characters that error-free regions

can end with must be duplicated A(BC)+B A<(BC)+>B A<BC>+B AXBCBCB match match match ABXCBCB match no match no mat ABCBCXB match match match ABCXBCB match no match match

45

SLIDE 46

Error-free regions

A B C B B C A(BC)+B A B B A B C1 C2 B B B A B C

Error-free region

A<(BC)+>B

46

SLIDE 47

Error-free regions

A B C1 C2 B B B A<(BC)+>B A B C A B C1 C2 B B B A B C

Error-free region

R1’ = T[R1] & B[c] | R0 | T [R0’] | T [R0] R1’ = T[R1] & B[c] | ...

no errors

47

SLIDE 48

Error-free regions

A B C1 C2 B B B A<(BC)+>B A B C A B C1 C2 B B B A B C

Error-free region

Σ Σ Σ Σ Σ Σ

R1’ = T[R1] & B[c] | R0 | T [R0’] | T [R0] R1’ = T[R1] & B[c] | ???

no errors del

48

SLIDE 49

Error-free regions

A B C1 C2 B B B A<(BC)+>B A B C A B C1 C2 B B B A B C

Error-free region

Σ Σ Σ Σ

R1’ = T[R1] & B[c] | R0 | T [R0’] | T [R0] R1’ = T[R1] & B[c] | R0 & I | ...

I = 1 1 0 0 1 1 no errors del

49

SLIDE 50

Error-free regions

A B C1 C2 B B B A<(BC)+>B A B C A B C1 C2 B B B A B C

Error-free region

R1’ = T[R1] & B[c] | R0 | T [R0’] | T [R0] R1’ = T[R1] & B[c] | R0 & I | ???

no errors del ins ε - transitions

50

SLIDE 51

Error-free regions

A B C1 C2 B B B A<(BC)+>B A B C A B C1 C2 B B B A B C

Error-free region

R1’ = T[R1] & B[c] | R0 | T [R0’] | T [R0] R1’ = T[R1] & B[c] | R0 & I | Te[R0’] | ...

no errors del ins ε - transitions

51

SLIDE 52

Error-free regions

A B C1 C2 B B B A<(BC)+>B A B C A B C1 C2 B B B A B C

Error-free region

R1’ = T[R1] & B[c] | R0 | T [R0’] | T [R0] R1’ = T[R1] & B[c] | R0 & I | Te[R0’] | Te[R0]

no errors del ins subst Σ - transitions

52

Approximate Search of Regular Expressions Using Bit-Parallel Algorithms

Kristo Tammeoja Jaak Vilo Teooriapäevad Rõuges, 2007

Contents

Regular expression

Regular expression

match

Regular expression

1 error allowed R(E|G)(EX) * 1:R(E|G)(EX)* 1:R<E|G>(EX)* 1:R(E|G)<EX>*

Regular expression

1 error allowed R(E|G)(EX) * 1:R(E|G)(EX)* 1:R<E|G>(EX)* R E R G R E E X 1:R(E|G)<EX>* R G E X R E E X E X

Regular expression

1 error allowed R(E|G)(EX) * 1:R(E|G)(EX)* 1:R<E|G>(EX)* R R R E R G R E E X R G E X R E E X E X R R E X R E G E X R E E E X E X R E E R X E X

1:R(E|G)<EX>*

Regular expression

1 error allowed R(E|G)(EX) * 1:R(E|G)(EX)* 1:R<E|G>(EX)* R R R E

R G R E E X R G E X R E E X E X R R E X R E G E X R E E E X E X R E E R X E X

1:R(E|G)<EX>*

Regular expression

1 error allowed R(E|G)(EX) * 1:R(E|G)(EX)* 1:R<E|G>(EX)* R R R E

R G R E E X R G E X R E E X E X R R E X R E G E X R E E E X E X R E E R X E X

1:R(E|G)<EX>*

Glushkov’s automaton

R ( E | G ) ( E X ) *

Glushkov’s automaton

R E G E X

R ( E | G ) ( E X ) *

Glushkov’s automaton

+ one state for the beginning of the RE

R ( E | G ) ( E X ) *

R E G E X

Glushkov’s automaton

+ one state for the beginning of the RE

can precede each other

R ( E | G ) ( E X ) *

E G E X R R...

Glushkov’s automaton

+ one state for the beginning of the RE

can precede each other

R ( E | G ) ( E X ) *

R R E G E X R...

Glushkov’s automaton

+ one state for the beginning of the RE

can precede each other

R ( E | G ) ( E X ) *

R R E G E X R...

Glushkov’s automaton

+ one state for the beginning of the RE

can precede each other

R ( E | G ) ( E X ) *

R E G E G R E X RE... RG...

Glushkov’s automaton

+ one state for the beginning of the RE

can precede each other

R ( E | G ) ( E X ) *

R E G E G R E X RE...

Glushkov’s automaton

+ one state for the beginning of the RE

can precede each other

R ( E | G ) ( E X ) *

R E G E E G E R X REE...

Glushkov’s automaton

+ one state for the beginning of the RE

can precede each other

R ( E | G ) ( E X ) *

R E G E E G E E R X RGE...

Glushkov’s automaton

+ one state for the beginning of the RE

can precede each other

R ( E | G ) ( E X ) *

R E G E X R E G E E X RGEX...

Glushkov’s automaton

+ one state for the beginning of the RE

can precede each other

R ( E | G ) ( E X ) *

R E G E X R E G E E X E RGEXE...

Glushkov’s automaton

+ one state for the beginning of the RE

can precede each other

R ( E | G ) ( E X ) *

R E G E X R E G E E X E