Approximate Search of Regular Expressions Using Bit-Parallel - - PowerPoint PPT Presentation

approximate search of regular expressions using bit
SMART_READER_LITE
LIVE PREVIEW

Approximate Search of Regular Expressions Using Bit-Parallel - - PowerPoint PPT Presentation

Approximate Search of Regular Expressions Using Bit-Parallel Algorithms Kristo Tammeoja Jaak Vilo Teooriapevad Ruges, 2007 Contents Regular expression (RE) syntax Glushkovs automaton Existing bit-parallel algorithms Exact


slide-1
SLIDE 1

Approximate Search of Regular Expressions Using Bit-Parallel Algorithms

Kristo Tammeoja Jaak Vilo Teooriapäevad Rõuges, 2007

slide-2
SLIDE 2

Contents

Regular expression (RE) syntax Glushkov’s automaton Existing bit-parallel algorithms

Exact matching Approximate matching

New feature added

Error-free regions

2

slide-3
SLIDE 3

Regular expression

Syntax

(, ) | Quantifier

*, +, ?, {m,n}, {m,}

Character classes (example [a-z])

3

slide-4
SLIDE 4

Regular expression

Syntax

(, ) | Quantifier

*, +, ?, {m,n}, {m,}

Character classes (example [a-z])

Matching as used in presentation

Regular expression A* AAAAA

match

BAAAC no match

4

slide-5
SLIDE 5

Regular expression

1 error allowed R(E|G)(EX) * 1:R(E|G)(EX)* 1:R<E|G>(EX)* 1:R(E|G)<EX>*

5

slide-6
SLIDE 6

Regular expression

1 error allowed R(E|G)(EX) * 1:R(E|G)(EX)* 1:R<E|G>(EX)* R E R G R E E X 1:R(E|G)<EX>* R G E X R E E X E X

6

slide-7
SLIDE 7

Regular expression

1 error allowed R(E|G)(EX) * 1:R(E|G)(EX)* 1:R<E|G>(EX)* R R R E R G R E E X R G E X R E E X E X R R E X R E G E X R E E E X E X R E E R X E X

subst. G E del.

1:R(E|G)<EX>*

ins. 7

slide-8
SLIDE 8

Regular expression

1 error allowed R(E|G)(EX) * 1:R(E|G)(EX)* 1:R<E|G>(EX)* R R R E

no match

R G R E E X R G E X R E E X E X R R E X R E G E X R E E E X E X R E E R X E X

subst. no match G E del. match

1:R(E|G)<EX>*

ins. 8

slide-9
SLIDE 9

Regular expression

1 error allowed R(E|G)(EX) * 1:R(E|G)(EX)* 1:R<E|G>(EX)* R R R E

no match

R G R E E X R G E X R E E X E X R R E X R E G E X R E E E X E X R E E R X E X

subst. no match G E del. match

1:R(E|G)<EX>*

match

  • ins. match

no match 9

slide-10
SLIDE 10

Glushkov’s automaton

R ( E | G ) ( E X ) *

10

slide-11
SLIDE 11

Glushkov’s automaton

Character in RE = state in automaton

R E G E X

R ( E | G ) ( E X ) *

11

slide-12
SLIDE 12

Glushkov’s automaton

Character in RE = state in automaton

+ one state for the beginning of the RE

R ( E | G ) ( E X ) *

R E G E X

12

slide-13
SLIDE 13

Glushkov’s automaton

Character in RE = state in automaton

+ one state for the beginning of the RE

Transitions show which characters/positions

can precede each other

R ( E | G ) ( E X ) *

E G E X R R...

13

slide-14
SLIDE 14

Glushkov’s automaton

Character in RE = state in automaton

+ one state for the beginning of the RE

Transitions show which characters/positions

can precede each other

R ( E | G ) ( E X ) *

R R E G E X R...

14

slide-15
SLIDE 15

Glushkov’s automaton

Character in RE = state in automaton

+ one state for the beginning of the RE

Transitions show which characters/positions

can precede each other

R ( E | G ) ( E X ) *

R R E G E X R...

15

slide-16
SLIDE 16

Glushkov’s automaton

Character in RE = state in automaton

+ one state for the beginning of the RE

Transitions show which characters/positions

can precede each other

R ( E | G ) ( E X ) *

R E G E G R E X RE... RG...

16

slide-17
SLIDE 17

Glushkov’s automaton

Character in RE = state in automaton

+ one state for the beginning of the RE

Transitions show which characters/positions

can precede each other

R ( E | G ) ( E X ) *

R E G E G R E X RE...

17

slide-18
SLIDE 18

Glushkov’s automaton

Character in RE = state in automaton

+ one state for the beginning of the RE

Transitions show which characters/positions

can precede each other

R ( E | G ) ( E X ) *

R E G E E G E R X REE...

18

slide-19
SLIDE 19

Glushkov’s automaton

Character in RE = state in automaton

+ one state for the beginning of the RE

Transitions show which characters/positions

can precede each other

R ( E | G ) ( E X ) *

R E G E E G E E R X RGE...

19

slide-20
SLIDE 20

Glushkov’s automaton

Character in RE = state in automaton

+ one state for the beginning of the RE

Transitions show which characters/positions

can precede each other

R ( E | G ) ( E X ) *

R E G E X R E G E E X RGEX...

20

slide-21
SLIDE 21

Glushkov’s automaton

Character in RE = state in automaton

+ one state for the beginning of the RE

Transitions show which characters/positions

can precede each other

R ( E | G ) ( E X ) *

R E G E X R E G E E X E RGEXE...

21

slide-22
SLIDE 22

Glushkov’s automaton

Character in RE = state in automaton

+ one state for the beginning of the RE

Transitions show which characters/positions

can precede each other

R ( E | G ) ( E X ) *

R E G E X R E G E E X E

22

slide-23
SLIDE 23

Glushkov’s automaton

All labels entering a node are labeled by the

same character

R ( E | G ) ( E X ) *

R E G E X R E G E E X E

23

slide-24
SLIDE 24

Glushkov’s automaton

All labels entering a node are labeled by the

same character

R ( E | G ) ( E X ) *

R E G E X R E G E E X E

24

slide-25
SLIDE 25

Glushkov’s automaton

All labels entering a node are labeled by the

same character for example after reading character ‘E’

  • nly states with label ‘E’ can be active

R E G E X R E G E E X E

25

slide-26
SLIDE 26

Exact search

Simulation of NFA = changing active states

based on the character read from the text

We use bit-vectors (one bit for each state) to

hold active states δ(D, a)

D – bit-vector of active states a – character read Returns new bit-vector

2|D| · |Σ| different sets of parameters

|D| – number of states in automaton |Σ| - alphabet's size

26

slide-27
SLIDE 27

Exact search

“After reading character ‘E’ only states with

label ‘E’ can be active” so ...

δ(D, a) = T[D] & B[a]

T[D] – states that can be reached

from states in D by any character

B[a] – states that can be reached

by character a

27

slide-28
SLIDE 28

Exact search

δ(D, a) = T[D] & B[a]

‘A’ 0111010 ‘B’ ‘C’ A A A B A C A B C A A A

AA|AB|AC

a B[a] D T[D] 1000000 0100000 ... 0101010 ...

28

slide-29
SLIDE 29

Exact search

δ(D, a) = T[D] & B[a]

‘A’ 0111010 ‘B’ 0000100 ‘C’ A A A B A C A B C A A A

AA|AB|AC

a B[a] D T[D] 1000000 0100000 ... 0101010 ...

29

slide-30
SLIDE 30

Exact search

δ(D, a) = T[D] & B[a]

‘A’ 0111010 ‘B’ 0000100 ‘C’ 0000001 A A A B A C A B C A A A

AA|AB|AC

a B[a] D T[D] 1000000 0100000 ... 0101010 ...

30

slide-31
SLIDE 31

Exact search

δ(D, a) = T[D] & B[a]

‘A’ 0111010 ‘B’ 0000100 ‘C’ 0000001 A A A B A C A B C A A A

AA|AB|AC

a B[a] 1000000 0101010 0100000 ... 0101010 ... D T[D]

31

slide-32
SLIDE 32

Exact search

δ(D, a) = T[D] & B[a]

‘A’ 0111010 ‘B’ 0000100 ‘C’ 0000001 A A A B A C A B C A A A

AA|AB|AC

a B[a] 1000000 0101010 0100000 0010000 ... 0101010 ... D T[D]

32

slide-33
SLIDE 33

Exact search

δ(D, a) = T[D] & B[a]

‘A’ 0111010 ‘B’ 0000100 ‘C’ 0000001 A A A B A C A B C A A A

AA|AB|AC

a B[a] 1000000 0101010 0100000 0010000 ... 0101010 0010101 ... D T[D]

33

slide-34
SLIDE 34

Exact search

δ(D, a) = T[D] & B[a]

1000000 0101010 0100000 0010000 ... 0101010 0010101 ... ‘A’ 0111010 ‘B’ 0000100 ‘C’ 0000001 A A A B A C A B C A A A

AA|AB|AC

δ(0101010, ‘A’) a B[a] D T[D]

34

slide-35
SLIDE 35

Exact search

δ(D, a) = T[D] & B[a]

1000000 0101010 0100000 0010000 ... 0101010 0010101 ... ‘A’ 0111010 ‘B’ 0000100 ‘C’ 0000001 A A A B A C A B C A A A

AA|AB|AC

δ(0101010, ‘A’) a B[a] D T[D] 0010101 T[D] & 0111010 B[a] 0010000

35

slide-36
SLIDE 36

Exact search

D ← 100..00 // initial state active F ← bit-vector of final states For pos ∈ 1 ... n Do // scanning text D ← T[D] & B[tpos] If D & F ≠ 000..00 Then match End of For

36

slide-37
SLIDE 37

Approximate search

Errors

Insertion Deletion Substitution

37

slide-38
SLIDE 38

Approximate search

When searching with k errors we make

k+1 replicas of the automaton, one for each error-level

Plus we need transitions for errors

R E R E G G E E X X No errors ? ? ? ? ? Up to 1 error R E R E G G E E X X

38

slide-39
SLIDE 39

Approximate search

R0, R1 – current bit-vectors R0’, R1’ – bit-vectors after processing

character a R0’ = T[R0] & B[c] R1’ = ?

39

slide-40
SLIDE 40

Approximate search

R1’ = T[R1] & B[c] | ...

Same as in exact search

no errors EGEX R E R E G G E E X X No errors Up to 1 error R E R E G G E E X X

40

slide-41
SLIDE 41

Approximate search

R1’ = T[R1] & B[c] | R0 | ...

Active states remain the same

no errors del RAEGEX R E G E X R E G E X No errors R E G E X Up to 1 error R E G E X Σ Σ Σ Σ Σ Σ

41

slide-42
SLIDE 42

Approximate search

R1’ = T[R1] & B[c] | R0 | T[R0’] | ...

Insert new character after the current one Just one step in automaton

no errors del ins R E G E X R E G E X No errors Up to 1 error R E G E X R E G E X Σ Σ Σ Σ Σ Σ ε ε ε ε ε REEX

42

slide-43
SLIDE 43

Approximate search

R1’ = T[R1] & B[c] | R0 | T[R0’] | T[R0]

Similar to exact matching except... ... we don’t care about the character read

no errors del ins subst R E G E X R E G E X No errors Up to 1 error R E G E X R E G E X Σ Σ Σ Σ Σ Σ ε ε ε ε ε Σ Σ Σ Σ Σ RAGEX

43

slide-44
SLIDE 44

Error-free regions

Approximate search

R1’ = T[R1] & B[c] | R0 | T [R0’] | T [R0]

With no-error regions

R1’ = T[R1] & B[c] | R0 & I | Te[R0’] | Te[R0]

no errors del ins subst

44

slide-45
SLIDE 45

Error-free regions

Deletion from text Make two copies of ‘C’

Those characters that error-free regions

can end with must be duplicated A(BC)+B A<(BC)+>B A<BC>+B AXBCBCB match match match ABXCBCB match no match no mat ABCBCXB match match match ABCXBCB match no match match

45

slide-46
SLIDE 46

Error-free regions

A B C B B C A(BC)+B A B B A B C1 C2 B B B A B C

Error-free region

A<(BC)+>B

46

slide-47
SLIDE 47

Error-free regions

A B C1 C2 B B B A<(BC)+>B A B C A B C1 C2 B B B A B C

Error-free region

R1’ = T[R1] & B[c] | R0 | T [R0’] | T [R0] R1’ = T[R1] & B[c] | ...

no errors

47

slide-48
SLIDE 48

Error-free regions

A B C1 C2 B B B A<(BC)+>B A B C A B C1 C2 B B B A B C

Error-free region

Σ Σ Σ Σ Σ Σ

R1’ = T[R1] & B[c] | R0 | T [R0’] | T [R0] R1’ = T[R1] & B[c] | ???

no errors del

48

slide-49
SLIDE 49

Error-free regions

A B C1 C2 B B B A<(BC)+>B A B C A B C1 C2 B B B A B C

Error-free region

Σ Σ Σ Σ

R1’ = T[R1] & B[c] | R0 | T [R0’] | T [R0] R1’ = T[R1] & B[c] | R0 & I | ...

I = 1 1 0 0 1 1 no errors del

49

slide-50
SLIDE 50

Error-free regions

A B C1 C2 B B B A<(BC)+>B A B C A B C1 C2 B B B A B C

Error-free region

R1’ = T[R1] & B[c] | R0 | T [R0’] | T [R0] R1’ = T[R1] & B[c] | R0 & I | ???

no errors del ins ε - transitions

50

slide-51
SLIDE 51

Error-free regions

A B C1 C2 B B B A<(BC)+>B A B C A B C1 C2 B B B A B C

Error-free region

R1’ = T[R1] & B[c] | R0 | T [R0’] | T [R0] R1’ = T[R1] & B[c] | R0 & I | Te[R0’] | ...

no errors del ins ε - transitions

51

slide-52
SLIDE 52

Error-free regions

A B C1 C2 B B B A<(BC)+>B A B C A B C1 C2 B B B A B C

Error-free region

R1’ = T[R1] & B[c] | R0 | T [R0’] | T [R0] R1’ = T[R1] & B[c] | R0 & I | Te[R0’] | Te[R0]

no errors del ins subst Σ - transitions

52