[PPT] - Methods of Analysis of Textual Data (MATD) Ji Dvorsk October 11, PowerPoint Presentation

SLIDE 1

Methods of Analysis of Textual Data (MATD)

Jiří Dvorský October 11, 2019

Department of Computer Science VŠB – TU Ostrava 1/31

SLIDE 2

Lectures Outline

1. Pattern Matching

Exact pattern matching

Searching for fjnite set of patterns Searching for (Regular) Infjnite Set of Patterns in Text

Approximate pattern matching

2/31

SLIDE 3

Pattern Matching

Jiří Dvorský

Department of Computer Science VŠB – TU Ostrava 3/31

SLIDE 4

Pattern Matching

Exact pattern matching

SLIDE 5

Searching for (Regular) Infjnite Set of Patterns in Text

1. How to describe infjnte set of pattern i.e. string?

Regular Expressions

2. What shall we use to perform matching?

Finite Automata

4/31

SLIDE 6

Regular Expressions and Languages

Regular expression 𝑆 Value of expression ℎ(𝑆) Atomic expressions ∅ ∅ 𝜁 {𝜁} 𝑏, 𝑏 ∈ Σ {𝑏} Operations 𝑉 ⋅ 𝑊 {𝑣𝑤|𝑣 ∈ ℎ(𝑉) ∧ 𝑤 ∈ ℎ(𝑊)} 𝑉 + 𝑊 ℎ(𝑉) ∪ ℎ(𝑊) 𝑊𝑙 = 𝑊 ⋅ 𝑊 ⋅ … ⋅ 𝑊 ⏟

𝑙 𝑢𝑗𝑛𝑓𝑡

𝑊+ = 𝑊1 + 𝑊2 + 𝑊3 + … 𝑊∗ = 𝑊0 + 𝑊1 + 𝑊2 + …

5/31

SLIDE 7

Regular Expression Features

𝑉 + (𝑊 + 𝑋) = (𝑉 + 𝑊) + 𝑋 𝑉 ⋅ (𝑊 ⋅ 𝑋) = (𝑉 ⋅ 𝑊) ⋅ 𝑋 𝑉 + 𝑊 = 𝑊 + 𝑉 (𝑉 + 𝑊) ⋅ 𝑋 = (𝑉 ⋅ 𝑋) + (𝑊 ⋅ 𝑋) 𝑉 ⋅ (𝑊 + 𝑋) = (𝑉 ⋅ 𝑊) + (𝑉 ⋅ 𝑋) 𝑉 + 𝑉 = 𝑉 𝜁 ⋅ 𝑉 = 𝑉 ∅ ⋅ 𝑉 = ∅ 𝑉 + ∅ = 𝑉 𝑉∗ = 𝜁 + 𝑉+

6/31

SLIDE 8

Deterministic Finite Automaton

Defjnition Deterministic Finite Automaton (DFA) is a quintuple 𝐵 = (𝑅, Σ, 𝑟0, 𝜀, 𝐺), where

𝑅 is a fjnite set of states
Σ is an alphabet
𝑟0 ∈ 𝑅 is an initial state
𝜀 ∶ 𝑅 × Σ → 𝑅 is a transition function
𝐺 ⊆ 𝑅 is a set of fjnal states

7/31

SLIDE 9

Deterministic Finite Automaton (cont.)

Confjguration of Finite Automaton (𝑟, 𝑥) ∈ 𝑅 × Σ∗ Transition of Finite Automaton is a relation ↦∶ (𝑅 × Σ∗) × (𝑅 × Σ∗) such as (𝑟, 𝑏𝑥) ↦ (𝑟’, 𝑥) ⟺ 𝜀(𝑟, 𝑏) = 𝑟’ Automaton accepts word 𝑥 if (𝑟0, 𝑥) ↦∗ (𝑟, 𝜁), 𝑟 ∈ 𝐺

8/31

SLIDE 10

Nondeterministic Finite Automaton

Defjnition Nondeterministic Finite Automaton (NFA) is a quintuple 𝐵 = (𝑅, Σ, 𝑟0, 𝜀, 𝐺), where

𝑅 is a fjnite set of states
Σ is an alphabet
𝑟0 ∈ 𝑅 is an initial state
𝜀 ∶ 𝑅 × Σ → 𝑄(𝑅) is a transition function
𝐺 ⊆ 𝑅 is a set of fjnal states
Alternatively NFA can be defjned as 𝐵 = (𝑅, Σ, 𝑇, 𝜀, 𝐺), where

𝑇 ⊆ 𝑅 is a set of initial states.

For each NFA, there is a DFA such that it recognizes the

same formal language.

9/31

SLIDE 11

Nondeterministic Finite Automaton – example

Set of patterns 𝑄 = {he, her, she}

𝑟1 start 𝑟2 𝑟3 𝑟4 start 𝑟5 𝑟6 𝑟7 𝑟8 start 𝑟9 𝑟10 𝑟11 Σ Σ Σ h e h e r s h e 𝑟1 start 𝑟2 𝑟3 𝑟4 𝑟5 𝑟6 𝑟7 Σ h e r s h e

10/31

SLIDE 12

NFA ⟶ DFA Conversion

The DFA can be constructed using the powerset construction. NFA 𝐵 = (𝑅, Σ, 𝑇, 𝜀, 𝐺) ⟶ DFA 𝐵′ = (𝑅′, Σ′, 𝑟′

0, 𝜀′, 𝐺′)

𝑅′ ⊆ 𝑄(𝑅)
Σ′ = Σ
𝑟′

0 = 𝑇

𝜀′(𝑟′, 𝑦) = ∪𝜀(𝑟, 𝑦) for all 𝑟 ∈ 𝑟′
𝐺′ = {𝑟′ ∈ 𝑅′|𝑟′ ∩ 𝐺 ≠ ∅}

11/31

SLIDE 13

NFA ⟶ DFA Conversion I

State Label 𝑓 ℎ 𝑠 𝑡

ther

{1, 4, 8} 𝑟′

1

{1, 4, 8} {1, 2, 4, 5, 8} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} {1, 2, 4, 5, 8} 𝑟′

2

{1, 3, 4, 6, 8} {1, 2, 4, 5, 8} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} {1, 4, 8, 9} 𝑟′

3

{1, 4, 8} {1, 2, 4, 5, 8, 10} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} {1, 3, 4, 6, 8} 𝑟′

4

{1, 4, 8} {1, 2, 4, 5, 8} {1, 4, 7, 8} {1, 4, 8, 9} {1, 4, 8} {1, 2, 4, 5, 8, 10} 𝑟′

5

{1, 3, 4, 6, 8, 11} {1, 2, 4, 5, 8} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} {1, 4, 7, 8} 𝑟′

6

{1, 4, 8} {1, 2, 4, 5, 8} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} {1, 3, 4, 6, 8, 11} 𝑟′

7

{1, 4, 8} {1, 2, 4, 5, 8} {1, 4, 7, 8} {1, 4, 8, 9} {1, 4, 8}

𝑟1 start 𝑟2 𝑟3 𝑟4 start 𝑟5 𝑟6 𝑟7 𝑟8 start 𝑟9 𝑟10 𝑟11 Σ Σ Σ h e h e r s h e

𝑟′

1

start 𝑟′

2

𝑟′

4

𝑟′

6

𝑟′

3

𝑟′

5

𝑟′

7

h s h e s s h h r s h s e s h r h s

Only reachable states, transitions to state 𝑟1 are not shown.

12/31

SLIDE 14

NFA ⟶ DFA Conversion II

State Label 𝑓 ℎ 𝑠 𝑡

ther

{1} 𝑟′

1

{1} {1, 2} {1} {1, 5} {1} {1, 2} 𝑟′

2

{1, 3} {1, 2} {1} {1, 5} {1} {1, 5} 𝑟′

3

{1} {1, 2, 6} {1} {1, 5} {1} {1, 3} 𝑟′

4

{1} {1, 2} {1, 4} {1, 5} {1} {1, 2, 6} 𝑟′

5

{1, 3, 7} {1, 2} {1} {1, 5} {1} {1, 4} 𝑟′

6

{1} {1, 2} {1} {1, 5} {1} {1, 3, 7} 𝑟′

7

{1} {1, 2} {1, 4} {1, 5} {1}

𝑟1 start 𝑟2 𝑟3 𝑟4 𝑟5 𝑟6 𝑟7 Σ h e r s h e

𝑟′

1

start 𝑟′

2

𝑟′

4

𝑟′

6

𝑟′

3

𝑟′

5

𝑟′

7

h s h e s s h h r s h s e s h r h s

13/31

SLIDE 15

Derivation of Regular Expression

For given regular expression 𝑆, derivation is defjned as ℎ (d𝑆 d𝑦 ) = {𝑧|𝑦𝑧 ∈ ℎ(𝑆)} Example For 𝑆 = 𝑏 + 𝑡ℎ𝑓𝑚𝑚 + 𝑡𝑢𝑝𝑞 + 𝑞𝑚𝑝𝑢 and its value ℎ(𝑆) = {𝑏, 𝑡ℎ𝑓𝑚𝑚, 𝑡𝑢𝑝𝑞, 𝑞𝑚𝑝𝑢} derivations are ℎ (d𝑆 d𝑏) = {𝜁} ℎ (d𝑆 d𝑡 ) = {ℎ𝑓𝑚𝑚, 𝑢𝑝𝑞} ℎ (d𝑆 d𝑢 ) = ∅

14/31

SLIDE 16

Derivation of Regular Expression – properties

d∅ d𝑏 = ∅, ∀𝑏 ∈ Σ d𝜁 d𝑏 = ∅, ∀𝑏 ∈ Σ d𝑏 d𝑏 = 𝜁, ∀𝑏 ∈ Σ d𝑐 d𝑏 = ∅, ∀𝑐 ≠ 𝑏 d(𝑉 + 𝑊) d𝑏 = d𝑉 d𝑏 + d𝑊 d𝑏 d(𝑉 ⋅ 𝑊) d𝑏 = d𝑉 d𝑏 ⋅ 𝑊, 𝜁 ∉ 𝑉 d(𝑉 ⋅ 𝑊) d𝑏 = d𝑉 d𝑏 ⋅ 𝑊 + d𝑊 d𝑏 , 𝜁 ∈ 𝑉 d𝑊∗ d𝑏 = d𝑊 d𝑏 ⋅ 𝑊∗ d𝑊 d𝑦 = d d𝑏𝑜 ( d d𝑏𝑜−1 (⋯ d d𝑏2 ( d𝑊 d𝑏1 ))) , for 𝑦 = 𝑏1𝑏2 … 𝑏𝑜

15/31

SLIDE 17

Construction of DFA Derivations of RE

Derivation of regular expressions allows directly and

algorithmically build DFA for any regular expression.

Let 𝑊 is given regular expression in alphabet Σ.
Each state of DFA defjnes a set of words, that move the

DFA from this state to any of fjnal states. So, every state can be associated with regular expression, defjning this set of words 𝑟0 = 𝑊 𝜀(𝑟, 𝑦) = d𝑟 d𝑦 𝐺 = {𝑟 ∈ 𝑅|𝜁 ∈ ℎ(𝑟)}

16/31

SLIDE 18

Construction of DFA Derivations of RE – example

Lest’s have 𝑊 = (0 + 1)∗ ⋅ 01 over alphabet Σ{0, 1}. Then 𝑟0 = (0 + 1)∗ ⋅ 01 Example of derivations: d((0 + 1)∗ ⋅ 01) d0 = d((0 + 1)∗) d0 ⋅ 01 + d01 d0 = d(0 + 1) d0 ⋅ (0 + 1)∗ ⋅ 01 + 1 = (d0 d0 + d1 d0) ⋅ (0 + 1)∗ ⋅ 01 + 1 = (𝜁 + ∅) ⋅ (0 + 1)∗ ⋅ 01 + 1 = (0 + 1)∗ ⋅ 01 + 1

17/31

SLIDE 19

Construction of DFA Derivations of RE – example (cont.)

d((0 + 1)∗ ⋅ 01) d1 = d((0 + 1)∗) d1 ⋅ 01 + d01 d1 = d(0 + 1) d1 ⋅ (0 + 1)∗ ⋅ 01 + ∅ = (d0 d1 + d1 d1) ⋅ (0 + 1)∗ ⋅ 01 = (∅ + 𝜁) ⋅ (0 + 1)∗ ⋅ 01 = (0 + 1)∗ ⋅ 01

18/31

SLIDE 20

Construction of DFA Derivations of RE – example (cont.)

Regular Expression State 1 (0 + 1)∗ ⋅ 01 𝑟0 (0 + 1)∗ ⋅ 01 + 1 (0 + 1)∗ ⋅ 01 (0 + 1)∗ ⋅ 01 + 1 𝑟1 (0 + 1)∗ ⋅ 01 + 1 (0 + 1)∗ ⋅ 01 + 𝜁 (0 + 1)∗ ⋅ 01 + 𝜁 𝑟2 (0 + 1)∗ ⋅ 01 + 1 (0 + 1)∗ ⋅ 01 𝑟0 start 𝑟1 𝑟2 1 1 1

19/31

SLIDE 21

Pattern Matching

Approximate pattern matching

SLIDE 22

Approximate pattern matching

String metric (string distance function) is a metric that

measures distance between two text strings for approximate string matching.

String metric can be considered as “inverse similarity” –

how two strings are dissimilar.

There are two classic metrics
1. Hamming distance
2. Levenshtein distance
Yes, string dissimilarity, distance can be measured. Both

distances are metrics from mathematical point of view – non-negativity, identity, symmetry, and triangle inequality.

20/31

SLIDE 23

Hamming distance

Defjnition Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are difgerent. In other words, it measures the minimum number of substitutions required to change one string into the other. Example Hamming distance of “karolin” and “kathrin” is 3. k a r

l

i n k a t h r i n 1 1 1

21/31

SLIDE 24

Levenshtein distance

Defjnition Levenshtein distance (1965) between two strings is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one string into the other.

22/31

SLIDE 25

Levenshtein distance (cont.)

Example Levenshtein distance between “kitten” and “sitting” is 3:

1. kitten → sitten (substitution of “s” for “k”)
2. sitten → sittin (substitution of “i” for “e”)
3. sittin → sitting (insertion of “g” at the end).

There is no way to do it with fewer than three edits.

23/31

SLIDE 26

Levenshtein distance (cont.)

Upper and lower bounds The Levenshtein distance has several simple upper and lower bounds:

It is at least the difgerence of the sizes of the two strings.
It is at most the length of the longer string.
It is zero if and only if the strings are equal.
If the strings are the same size, the Hamming distance is

an upper bound on the Levenshtein distance.

The Levenshtein distance between two strings is no

greater than the sum of their Levenshtein distances from a third string (triangle inequality).

24/31

SLIDE 27

Levenshtein distance (cont.)

𝑒(𝑗, 𝑘) = ⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ 𝑗, if 𝑘 = 0 𝑘, if 𝑗 = 0 min ( 𝑒(𝑗 − 1, 𝑘) + 1, 𝑒(𝑗, 𝑘 − 1) + 1, 𝑒(𝑗 − 1, 𝑘 − 1) + 𝑑(𝑗, 𝑘) ) where 𝑑(𝑗, 𝑘) = {0 if 𝑏𝑗 = 𝑐𝑘 1 otherwise First element in the minimum corresponds to deletion (from 𝑏 to 𝑐), the second to insertion and the third to match or mismatch.

25/31

SLIDE 28

Levenshtein distance (cont.)

1 int LevenshteinDistance(const char *s, int len_s, const

char *t, int len_t)

2 { 3

int cost;

4 5

/* base case: empty strings */

6

if (len_s == 0) return len_t;

7

if (len_t == 0) return len_s;

8 9

/* test if last characters of the strings match */

10

if (s[len_s-1] == t[len_t-1])

11

cost = 0;

12

else

13

cost = 1;

14

26/31

SLIDE 29

Levenshtein distance (cont.)

15

/* return minimum of delete char from s, delete char from t, and delete char from both */

16

return minimum

17

(

18

LevenshteinDistance(s, len_s-1, t, len_t) + 1,

19

LevenshteinDistance(s, len_s, t, len_t-1) + 1,

20

LevenshteinDistance(s, len_s-1, t, len_t-1) + cost

21

);

22 }

27/31

SLIDE 30

Levenshtein distance (cont.)

k i t t e n 1 2 3 4 5 6 s 1 1a 2 3 4 5 6 i 2 2 1b 2 3 4 5 t 3 3 2 1c 2 3 4 t 4 4 3 2 1d 2 3 i 5 5 4 3 2 2e 3 n 6 6 5 4 3 3 2f g 7 7 6 5 4 4 3g

asubst. of k for s

b i is equal i c t is equal t d t is equal t

esubst. of e for i

f n is equal n g delete g 28/31

SLIDE 31

Approximate pattern matching using fjnite automata

𝑟0 start 𝑟1 𝑟2 𝑟3 𝑟4 Σ 𝑞1 𝑞2 𝑞3 𝑞4 NFA for the exact string matching (𝑛 = 4)

29/31

SLIDE 32

Approximate pattern matching using fjnite automata (cont.)

𝑟0 start 𝑟1 𝑟2 𝑟3 𝑟4 𝑟5 𝑟6 𝑟7 𝑟8 𝑟9 𝑟10 𝑟11 𝑟12 𝑟13

Σ 𝑞1 𝑞2 𝑞3 𝑞4 𝑞2 𝑞3 𝑞4 𝑞3 𝑞4 𝑞4 𝑞1 𝑞2 𝑞3 𝑞4 𝑞2 𝑞3 𝑞4 𝑞3 𝑞4

left to right – match diagonal – replace NFA for the approximate string matching using the Hamming distance (𝑛 = 4, 𝑙 = 3)

30/31

SLIDE 33

Approximate pattern matching using fjnite automata (cont.)

𝑟0 start 𝑟1 𝑟2 𝑟3 𝑟4 𝑟5 𝑟6 𝑟7 𝑟8 𝑟9 𝑟10 𝑟11 𝑟12 𝑟13

Σ 𝑞1 𝑞2 𝑞3 𝑞4 𝑞2 𝑞3 𝑞4 𝑞3 𝑞4 𝑞4 𝑞1 𝑞2 𝑞3 𝑞4 𝑞2 𝑞3 𝑞4 𝑞3 𝑞4 𝑞2 𝑞3 𝑞4 𝑞3 𝑞4 𝑞4 𝜁 𝜁 𝜁 𝜁 𝜁 𝜁 𝜁 𝜁 𝜁

left to right – match diagonal – replace down – insert diagonal 𝜁– delete NFA for the approximate string matching using the Levenshtein distance (𝑛 = 4, 𝑙 = 3)