Methods of Analysis of Textual Data (MATD) Ji Dvorsk October 11, - - PowerPoint PPT Presentation
Methods of Analysis of Textual Data (MATD) Ji Dvorsk October 11, - - PowerPoint PPT Presentation
Methods of Analysis of Textual Data (MATD) Ji Dvorsk October 11, 2019 Department of Computer Science VB TU Ostrava 1/31 Lectures Outline 1. Pattern Matching Exact pattern matching Searching for fjnite set of patterns Searching
Lectures Outline
- 1. Pattern Matching
Exact pattern matching
Searching for fjnite set of patterns Searching for (Regular) Infjnite Set of Patterns in Text
Approximate pattern matching
2/31
Pattern Matching
Jiří Dvorský
Department of Computer Science VŠB – TU Ostrava 3/31
Pattern Matching
Exact pattern matching
Searching for (Regular) Infjnite Set of Patterns in Text
- 1. How to describe infjnte set of pattern i.e. string?
Regular Expressions
- 2. What shall we use to perform matching?
Finite Automata
4/31
Regular Expressions and Languages
Regular expression 𝑆 Value of expression ℎ(𝑆) Atomic expressions ∅ ∅ 𝜁 {𝜁} 𝑏, 𝑏 ∈ Σ {𝑏} Operations 𝑉 ⋅ 𝑊 {𝑣𝑤|𝑣 ∈ ℎ(𝑉) ∧ 𝑤 ∈ ℎ(𝑊)} 𝑉 + 𝑊 ℎ(𝑉) ∪ ℎ(𝑊) 𝑊𝑙 = 𝑊 ⋅ 𝑊 ⋅ … ⋅ 𝑊 ⏟
𝑙 𝑢𝑗𝑛𝑓𝑡
𝑊+ = 𝑊1 + 𝑊2 + 𝑊3 + … 𝑊∗ = 𝑊0 + 𝑊1 + 𝑊2 + …
5/31
Regular Expression Features
𝑉 + (𝑊 + 𝑋) = (𝑉 + 𝑊) + 𝑋 𝑉 ⋅ (𝑊 ⋅ 𝑋) = (𝑉 ⋅ 𝑊) ⋅ 𝑋 𝑉 + 𝑊 = 𝑊 + 𝑉 (𝑉 + 𝑊) ⋅ 𝑋 = (𝑉 ⋅ 𝑋) + (𝑊 ⋅ 𝑋) 𝑉 ⋅ (𝑊 + 𝑋) = (𝑉 ⋅ 𝑊) + (𝑉 ⋅ 𝑋) 𝑉 + 𝑉 = 𝑉 𝜁 ⋅ 𝑉 = 𝑉 ∅ ⋅ 𝑉 = ∅ 𝑉 + ∅ = 𝑉 𝑉∗ = 𝜁 + 𝑉+
6/31
Deterministic Finite Automaton
Defjnition Deterministic Finite Automaton (DFA) is a quintuple 𝐵 = (𝑅, Σ, 𝑟0, 𝜀, 𝐺), where
- 𝑅 is a fjnite set of states
- Σ is an alphabet
- 𝑟0 ∈ 𝑅 is an initial state
- 𝜀 ∶ 𝑅 × Σ → 𝑅 is a transition function
- 𝐺 ⊆ 𝑅 is a set of fjnal states
7/31
Deterministic Finite Automaton (cont.)
Confjguration of Finite Automaton (𝑟, 𝑥) ∈ 𝑅 × Σ∗ Transition of Finite Automaton is a relation ↦∶ (𝑅 × Σ∗) × (𝑅 × Σ∗) such as (𝑟, 𝑏𝑥) ↦ (𝑟’, 𝑥) ⟺ 𝜀(𝑟, 𝑏) = 𝑟’ Automaton accepts word 𝑥 if (𝑟0, 𝑥) ↦∗ (𝑟, 𝜁), 𝑟 ∈ 𝐺
8/31
Nondeterministic Finite Automaton
Defjnition Nondeterministic Finite Automaton (NFA) is a quintuple 𝐵 = (𝑅, Σ, 𝑟0, 𝜀, 𝐺), where
- 𝑅 is a fjnite set of states
- Σ is an alphabet
- 𝑟0 ∈ 𝑅 is an initial state
- 𝜀 ∶ 𝑅 × Σ → 𝑄(𝑅) is a transition function
- 𝐺 ⊆ 𝑅 is a set of fjnal states
- Alternatively NFA can be defjned as 𝐵 = (𝑅, Σ, 𝑇, 𝜀, 𝐺), where
𝑇 ⊆ 𝑅 is a set of initial states.
- For each NFA, there is a DFA such that it recognizes the
same formal language.
9/31
Nondeterministic Finite Automaton – example
Set of patterns 𝑄 = {he, her, she}
𝑟1 start 𝑟2 𝑟3 𝑟4 start 𝑟5 𝑟6 𝑟7 𝑟8 start 𝑟9 𝑟10 𝑟11 Σ Σ Σ h e h e r s h e 𝑟1 start 𝑟2 𝑟3 𝑟4 𝑟5 𝑟6 𝑟7 Σ h e r s h e
10/31
NFA ⟶ DFA Conversion
The DFA can be constructed using the powerset construction. NFA 𝐵 = (𝑅, Σ, 𝑇, 𝜀, 𝐺) ⟶ DFA 𝐵′ = (𝑅′, Σ′, 𝑟′
0, 𝜀′, 𝐺′)
- 𝑅′ ⊆ 𝑄(𝑅)
- Σ′ = Σ
- 𝑟′
0 = 𝑇
- 𝜀′(𝑟′, 𝑦) = ∪𝜀(𝑟, 𝑦) for all 𝑟 ∈ 𝑟′
- 𝐺′ = {𝑟′ ∈ 𝑅′|𝑟′ ∩ 𝐺 ≠ ∅}
11/31
NFA ⟶ DFA Conversion I
State Label 𝑓 ℎ 𝑠 𝑡
- ther
{1, 4, 8} 𝑟′
1
{1, 4, 8} {1, 2, 4, 5, 8} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} {1, 2, 4, 5, 8} 𝑟′
2
{1, 3, 4, 6, 8} {1, 2, 4, 5, 8} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} {1, 4, 8, 9} 𝑟′
3
{1, 4, 8} {1, 2, 4, 5, 8, 10} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} {1, 3, 4, 6, 8} 𝑟′
4
{1, 4, 8} {1, 2, 4, 5, 8} {1, 4, 7, 8} {1, 4, 8, 9} {1, 4, 8} {1, 2, 4, 5, 8, 10} 𝑟′
5
{1, 3, 4, 6, 8, 11} {1, 2, 4, 5, 8} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} {1, 4, 7, 8} 𝑟′
6
{1, 4, 8} {1, 2, 4, 5, 8} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} {1, 3, 4, 6, 8, 11} 𝑟′
7
{1, 4, 8} {1, 2, 4, 5, 8} {1, 4, 7, 8} {1, 4, 8, 9} {1, 4, 8}
𝑟1 start 𝑟2 𝑟3 𝑟4 start 𝑟5 𝑟6 𝑟7 𝑟8 start 𝑟9 𝑟10 𝑟11 Σ Σ Σ h e h e r s h e
𝑟′
1
start 𝑟′
2
𝑟′
4
𝑟′
6
𝑟′
3
𝑟′
5
𝑟′
7
h s h e s s h h r s h s e s h r h s
Only reachable states, transitions to state 𝑟1 are not shown.
12/31
NFA ⟶ DFA Conversion II
State Label 𝑓 ℎ 𝑠 𝑡
- ther
{1} 𝑟′
1
{1} {1, 2} {1} {1, 5} {1} {1, 2} 𝑟′
2
{1, 3} {1, 2} {1} {1, 5} {1} {1, 5} 𝑟′
3
{1} {1, 2, 6} {1} {1, 5} {1} {1, 3} 𝑟′
4
{1} {1, 2} {1, 4} {1, 5} {1} {1, 2, 6} 𝑟′
5
{1, 3, 7} {1, 2} {1} {1, 5} {1} {1, 4} 𝑟′
6
{1} {1, 2} {1} {1, 5} {1} {1, 3, 7} 𝑟′
7
{1} {1, 2} {1, 4} {1, 5} {1}
𝑟1 start 𝑟2 𝑟3 𝑟4 𝑟5 𝑟6 𝑟7 Σ h e r s h e
𝑟′
1
start 𝑟′
2
𝑟′
4
𝑟′
6
𝑟′
3
𝑟′
5
𝑟′
7
h s h e s s h h r s h s e s h r h s
13/31
Derivation of Regular Expression
For given regular expression 𝑆, derivation is defjned as ℎ (d𝑆 d𝑦 ) = {𝑧|𝑦𝑧 ∈ ℎ(𝑆)} Example For 𝑆 = 𝑏 + 𝑡ℎ𝑓𝑚𝑚 + 𝑡𝑢𝑝𝑞 + 𝑞𝑚𝑝𝑢 and its value ℎ(𝑆) = {𝑏, 𝑡ℎ𝑓𝑚𝑚, 𝑡𝑢𝑝𝑞, 𝑞𝑚𝑝𝑢} derivations are ℎ (d𝑆 d𝑏) = {𝜁} ℎ (d𝑆 d𝑡 ) = {ℎ𝑓𝑚𝑚, 𝑢𝑝𝑞} ℎ (d𝑆 d𝑢 ) = ∅
14/31
Derivation of Regular Expression – properties
d∅ d𝑏 = ∅, ∀𝑏 ∈ Σ d𝜁 d𝑏 = ∅, ∀𝑏 ∈ Σ d𝑏 d𝑏 = 𝜁, ∀𝑏 ∈ Σ d𝑐 d𝑏 = ∅, ∀𝑐 ≠ 𝑏 d(𝑉 + 𝑊) d𝑏 = d𝑉 d𝑏 + d𝑊 d𝑏 d(𝑉 ⋅ 𝑊) d𝑏 = d𝑉 d𝑏 ⋅ 𝑊, 𝜁 ∉ 𝑉 d(𝑉 ⋅ 𝑊) d𝑏 = d𝑉 d𝑏 ⋅ 𝑊 + d𝑊 d𝑏 , 𝜁 ∈ 𝑉 d𝑊∗ d𝑏 = d𝑊 d𝑏 ⋅ 𝑊∗ d𝑊 d𝑦 = d d𝑏𝑜 ( d d𝑏𝑜−1 (⋯ d d𝑏2 ( d𝑊 d𝑏1 ))) , for 𝑦 = 𝑏1𝑏2 … 𝑏𝑜
15/31
Construction of DFA Derivations of RE
- Derivation of regular expressions allows directly and
algorithmically build DFA for any regular expression.
- Let 𝑊 is given regular expression in alphabet Σ.
- Each state of DFA defjnes a set of words, that move the
DFA from this state to any of fjnal states. So, every state can be associated with regular expression, defjning this set of words 𝑟0 = 𝑊 𝜀(𝑟, 𝑦) = d𝑟 d𝑦 𝐺 = {𝑟 ∈ 𝑅|𝜁 ∈ ℎ(𝑟)}
16/31
Construction of DFA Derivations of RE – example
Lest’s have 𝑊 = (0 + 1)∗ ⋅ 01 over alphabet Σ{0, 1}. Then 𝑟0 = (0 + 1)∗ ⋅ 01 Example of derivations: d((0 + 1)∗ ⋅ 01) d0 = d((0 + 1)∗) d0 ⋅ 01 + d01 d0 = d(0 + 1) d0 ⋅ (0 + 1)∗ ⋅ 01 + 1 = (d0 d0 + d1 d0) ⋅ (0 + 1)∗ ⋅ 01 + 1 = (𝜁 + ∅) ⋅ (0 + 1)∗ ⋅ 01 + 1 = (0 + 1)∗ ⋅ 01 + 1
17/31
Construction of DFA Derivations of RE – example (cont.)
d((0 + 1)∗ ⋅ 01) d1 = d((0 + 1)∗) d1 ⋅ 01 + d01 d1 = d(0 + 1) d1 ⋅ (0 + 1)∗ ⋅ 01 + ∅ = (d0 d1 + d1 d1) ⋅ (0 + 1)∗ ⋅ 01 = (∅ + 𝜁) ⋅ (0 + 1)∗ ⋅ 01 = (0 + 1)∗ ⋅ 01
18/31
Construction of DFA Derivations of RE – example (cont.)
Regular Expression State 1 (0 + 1)∗ ⋅ 01 𝑟0 (0 + 1)∗ ⋅ 01 + 1 (0 + 1)∗ ⋅ 01 (0 + 1)∗ ⋅ 01 + 1 𝑟1 (0 + 1)∗ ⋅ 01 + 1 (0 + 1)∗ ⋅ 01 + 𝜁 (0 + 1)∗ ⋅ 01 + 𝜁 𝑟2 (0 + 1)∗ ⋅ 01 + 1 (0 + 1)∗ ⋅ 01 𝑟0 start 𝑟1 𝑟2 1 1 1
19/31
Pattern Matching
Approximate pattern matching
Approximate pattern matching
- String metric (string distance function) is a metric that
measures distance between two text strings for approximate string matching.
- String metric can be considered as “inverse similarity” –
how two strings are dissimilar.
- There are two classic metrics
- 1. Hamming distance
- 2. Levenshtein distance
- Yes, string dissimilarity, distance can be measured. Both
distances are metrics from mathematical point of view – non-negativity, identity, symmetry, and triangle inequality.
20/31
Hamming distance
Defjnition Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are difgerent. In other words, it measures the minimum number of substitutions required to change one string into the other. Example Hamming distance of “karolin” and “kathrin” is 3. k a r
- l
i n k a t h r i n 1 1 1
21/31
Levenshtein distance
Defjnition Levenshtein distance (1965) between two strings is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one string into the other.
22/31
Levenshtein distance (cont.)
Example Levenshtein distance between “kitten” and “sitting” is 3:
- 1. kitten → sitten (substitution of “s” for “k”)
- 2. sitten → sittin (substitution of “i” for “e”)
- 3. sittin → sitting (insertion of “g” at the end).
There is no way to do it with fewer than three edits.
23/31
Levenshtein distance (cont.)
Upper and lower bounds The Levenshtein distance has several simple upper and lower bounds:
- It is at least the difgerence of the sizes of the two strings.
- It is at most the length of the longer string.
- It is zero if and only if the strings are equal.
- If the strings are the same size, the Hamming distance is
an upper bound on the Levenshtein distance.
- The Levenshtein distance between two strings is no
greater than the sum of their Levenshtein distances from a third string (triangle inequality).
24/31
Levenshtein distance (cont.)
𝑒(𝑗, 𝑘) = ⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ 𝑗, if 𝑘 = 0 𝑘, if 𝑗 = 0 min ( 𝑒(𝑗 − 1, 𝑘) + 1, 𝑒(𝑗, 𝑘 − 1) + 1, 𝑒(𝑗 − 1, 𝑘 − 1) + 𝑑(𝑗, 𝑘) ) where 𝑑(𝑗, 𝑘) = {0 if 𝑏𝑗 = 𝑐𝑘 1 otherwise First element in the minimum corresponds to deletion (from 𝑏 to 𝑐), the second to insertion and the third to match or mismatch.
25/31
Levenshtein distance (cont.)
1 int LevenshteinDistance(const char *s, int len_s, const
char *t, int len_t)
2 { 3
int cost;
4 5
/* base case: empty strings */
6
if (len_s == 0) return len_t;
7
if (len_t == 0) return len_s;
8 9
/* test if last characters of the strings match */
10
if (s[len_s-1] == t[len_t-1])
11
cost = 0;
12
else
13
cost = 1;
14
26/31
Levenshtein distance (cont.)
15
/* return minimum of delete char from s, delete char from t, and delete char from both */
16
return minimum
17
(
18
LevenshteinDistance(s, len_s-1, t, len_t) + 1,
19
LevenshteinDistance(s, len_s, t, len_t-1) + 1,
20
LevenshteinDistance(s, len_s-1, t, len_t-1) + cost
21
);
22 }
27/31
Levenshtein distance (cont.)
k i t t e n 1 2 3 4 5 6 s 1 1a 2 3 4 5 6 i 2 2 1b 2 3 4 5 t 3 3 2 1c 2 3 4 t 4 4 3 2 1d 2 3 i 5 5 4 3 2 2e 3 n 6 6 5 4 3 3 2f g 7 7 6 5 4 4 3g
- asubst. of k for s
b i is equal i c t is equal t d t is equal t
- esubst. of e for i
f n is equal n g delete g 28/31
Approximate pattern matching using fjnite automata
𝑟0 start 𝑟1 𝑟2 𝑟3 𝑟4 Σ 𝑞1 𝑞2 𝑞3 𝑞4 NFA for the exact string matching (𝑛 = 4)
29/31
Approximate pattern matching using fjnite automata (cont.)
𝑟0 start 𝑟1 𝑟2 𝑟3 𝑟4 𝑟5 𝑟6 𝑟7 𝑟8 𝑟9 𝑟10 𝑟11 𝑟12 𝑟13
Σ 𝑞1 𝑞2 𝑞3 𝑞4 𝑞2 𝑞3 𝑞4 𝑞3 𝑞4 𝑞4 𝑞1 𝑞2 𝑞3 𝑞4 𝑞2 𝑞3 𝑞4 𝑞3 𝑞4
left to right – match diagonal – replace NFA for the approximate string matching using the Hamming distance (𝑛 = 4, 𝑙 = 3)
30/31
Approximate pattern matching using fjnite automata (cont.)
𝑟0 start 𝑟1 𝑟2 𝑟3 𝑟4 𝑟5 𝑟6 𝑟7 𝑟8 𝑟9 𝑟10 𝑟11 𝑟12 𝑟13
Σ 𝑞1 𝑞2 𝑞3 𝑞4 𝑞2 𝑞3 𝑞4 𝑞3 𝑞4 𝑞4 𝑞1 𝑞2 𝑞3 𝑞4 𝑞2 𝑞3 𝑞4 𝑞3 𝑞4 𝑞2 𝑞3 𝑞4 𝑞3 𝑞4 𝑞4 𝜁 𝜁 𝜁 𝜁 𝜁 𝜁 𝜁 𝜁 𝜁