[PPT] - Faster Longest Common Extension Queries in Strings over General PowerPoint Presentation

SLIDE 1

Faster Longest Common Extension Queries in Strings over General Alphabets

Paweł Gawrychowski1,2 Tomasz Kociumaka1 Wojciech Rytter1 Tomasz Waleń1

1University of Warsaw, Poland

[gawry,kociumaka,rytter,walen]@mimuw.edu.pl

2University of Haifa, Israel

CPM 2016 Tel Aviv, Israel 2016–06–27

1/23

SLIDE 2

Introduction

LCE problem

We consider Longest Common Extension problem (LCE) in case

f general ordered alphabet.

← only comparisons of characters! Preprocess a given word w of length n for queries: LCE(i, j) — the length of the longest common factor starting at position i and j in w.

2/23

SLIDE 3

Introduction

LCE problem

We consider Longest Common Extension problem (LCE) in case

f general ordered alphabet.

← only comparisons of characters! Preprocess a given word w of length n for queries: LCE(i, j) — the length of the longest common factor starting at position i and j in w.

Example

w =

a

1

b

2

a

3

b

4

b

5

a

6

b

7

b

8

a

9

b

10

a

11

a

12

SLIDE 4

Introduction

LCE problem

We consider Longest Common Extension problem (LCE) in case

f general ordered alphabet.

← only comparisons of characters! Preprocess a given word w of length n for queries: LCE(i, j) — the length of the longest common factor starting at position i and j in w.

Example

w =

a

1

b

2

a

3

b

4

b

5

a

6

b

7

b

8

a

9

b

10

a

11

a

12

LCE(2, 8) =?

SLIDE 5

Introduction

LCE problem

We consider Longest Common Extension problem (LCE) in case

f general ordered alphabet.

← only comparisons of characters! Preprocess a given word w of length n for queries: LCE(i, j) — the length of the longest common factor starting at position i and j in w.

Example

w =

a

1

b

2

a

3

b

4

b

5

a

6

b

7

b

8

a

9

b

10

a

11

a

12

b a b b b a b a

LCE(2, 8) = 3

2/23

SLIDE 6

Results

Naive solution

Answering n LCE queries can be done in: O(n log n) time (reduce alphabet to [1..n] via sorting).

Previous results: Kosolobov (IPL, 2016)

Answering n LCE queries can be done in: O(n log2/3 n) time. Conjectured that O(n) time is possible. Motivation: efficient computation of runs (Bannai et al 2015).

Our result:

Answering n LCE queries can be done in: O(n log log n) time, using O(n) symbol comparisons.

3/23

SLIDE 7

Difference cover

t-Cover / difference cover

A set S(t) ⊆ [1..n] is called a t-cover of [1..n] if:

◮ S(t) is t-periodic, for each i ∈ [1..n − t]:

i ∈ S(t) ⇔ i + t ∈ S(t)

◮ there is a constant-time computable function h, such that for

1 ≤ i, j ≤ n − t: 0 ≤ h(i, j) ≤ t and i + h(i, j), j + h(i, j) ∈ S(t)

Lemma

For each t ≤ n there is a t-cover S(t) of size O( n

√t ) which can be

constructed in O( n

√t ) time.

4/23

SLIDE 8

t-Cover, example

S(6) = {2, 3, 5, 8, 9, 11, 14, 15, 17, 20, 21, 23}.

6 6 6 6 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

h(3, 10) = 5 h(3, 10) = 5 For i = 3, j = 10: we have h(3, 10) = 5, since 3 + 5, 10 + 5 ∈ S(6).

5/23

SLIDE 9

ShortLCEt vs CoarseLCEt

ShortLCEt

ShortLCEt(i, j) = min(LCE(i, j), t). is used to find the length of LCE but up to maximal length t.

CoarseLCEt

CoarseLCEt(i, j) =

⌊LCE(i, j)/t⌋

if i, j ∈ S(t), ⊥

therwise.

is used to find the length of LCE, but up to t characters precision.

6/23

SLIDE 10

Generic algorithm

Algorithm 1: GenericLCEt(i, j) ℓ1 = ShortLCEt(i, j) if ℓ1 < t then return ℓ1 ∆ = ht(i, j) ⊲ i + ∆, j + ∆ ∈ S(t) ℓ2 = t · CoarseLCEt(i + ∆, j + ∆) ℓ3 = ShortLCEt(i + ∆ + ℓ2, j + ∆ + ℓ2) return ∆ + ℓ2 + ℓ3 ℓ1 ℓ1 ℓ2 ℓ2 ℓ3 ℓ3

∆ ∆

i j

t t

CoarseLCE ShortLCE

7/23

SLIDE 11

CoarseLCEt

CoarseLCEt algorithm for t = Ω(log2 n):

◮ reduce word w to new word code(w), that is:

◮ shorter (of length O(n/√t)) ◮ over small alphabet [1..n]

◮ use naive solution (with suffix arrays)

8/23

SLIDE 12

CoarseLCEt

CoarseLCEt algorithm for t = Ω(log2 n):

1. sort all t-blocks starting in S(t) and remove duplicates,
2. encode every t-block with its rank on the sorted list,
3. construct a new string code(w) of length O(n/ log n) over

alphabet [1..n], such that any CoarseLCEt query can be reduced to an LCE query on code(w),

4. preprocess code(w) for LCE queries.

a a a a a a a a a a a a b b b b b b b b b b b b * * * *

2 3 5 8 9 11 1415 17 2021 23

1 8 6 2 3 5 1 4 6 1 8 7

w : α β γ code(w) : 1 8 6 2 $ 3 5 1 4 # 6 1 8 7 & α β γ

9/23

SLIDE 13

CoarseLCEt

CoarseLCEt algorithm for t = Ω(log2 n):

1. sort all t-blocks starting in S(t) and remove duplicates,
2. encode every t-block with its rank on the sorted list,
3. construct a new string code(w) of length O(n/ log n) over

alphabet [1..n], such that any CoarseLCEt query can be reduced to an LCE query on code(w),

4. preprocess code(w) for LCE queries.

a a a a a a a a a a a a b b b b b b b b b b b b * * * *

2 3 5 8 9 11 1415 17 2021 23

1 8 6 2 3 5 1 4 6 1 8 7

w : α β γ code(w) : 1 8 6 2 $ 3 5 1 4 # 6 1 8 7 & α β γ

9/23

SLIDE 14

CoarseLCEt continued

Lemma

For t = Ω(log2 n) we can lexicographically sort all t-blocks of w starting in S(t) using O(n) ShortLCEt queries and O(n) additional time.

10/23

SLIDE 15

CoarseLCEt continued

Lemma

For t = Ω(log2 n) we can lexicographically sort all t-blocks of w starting in S(t) using O(n) ShortLCEt queries and O(n) additional time.

Lemma

For t = Ω(log2 n) if we can answer O(n) ShortLCEt queries in T(n) time (e.g. O(n log t)), then we can preprocess w in O(T(n) + n) time (resp. O(n log t)), so that any CoarseLCEt query can be answered in constant time.

10/23

SLIDE 16

ShortLCEt

ShortLCEt is computed recursively, for t = 2k:

◮ we have k levels (level h handles queries up to length 2h), ◮ each level has its separate Union-Find structure, ◮ if at level h we find out that two positions i and j have

LCE(i, j) ≥ 2h then we union those positions,

◮ so if Findh(i) = Findh(j) then LCE(i, j) ≥ 2h otherwise we

have no information about LCE(i, j).

11/23

SLIDE 17

ShortLCEt

Algorithm 2: ShortLCE2k(i, j): compute LCE(i, j) up to length 2k if Findk(i) = Findk(j) then return 2k if k = 0 then if w[i] = w[j] then ℓ = 1 else ℓ = 0 else ℓ = ShortLCE2k−1(i, j) if ℓ = 2k−1 then ℓ = 2k−1 + ShortLCE2k−1(i + 2k−1, j + 2k−1) if ℓ = 2k then Unionk(i, j) return ℓ

12/23

SLIDE 18

ShortLCEt, continued

Lemma

For t = 2k, a sequence of q ShortLCEt(i, j) queries can be executed on-line in total time O((q + n)k · α(n)) = O((q + n) · log t · α(n)).

13/23

SLIDE 19

ShortLCEt, continued

Lemma

For t = 2k, a sequence of q ShortLCEt(i, j) queries can be executed on-line in total time O((q + n)k · α(n)) = O((q + n) · log t · α(n)).

Proof.

We inductively bound the number of recursive calls triggered by ShortLCE2k(i, j): 2k + 1 + 2#union if w[i..i + 2k − 1] = w[j..j + 2k − 1], 1 + 2#union if w[i..i + 2k − 1] = w[j..j + 2k − 1].

13/23

SLIDE 20

Where are we now?

With those results we currently have:

Current result

Answering n LCE queries can be done in: O(n log log n · α(n)) time, using O(n log log n · α(n)) symbol comparisons. How can we improve it?

14/23

SLIDE 21

Faster ShortLCEt queries

We introduce new difference cover S(t′) with t′ ≪ t. Sparse version of ShortLCE queries (queries restricted to positions from S(t′)): SparseShortLCEt,t′(i, j) =

ShortLCEt(i, j)

if i, j ∈ S(t′) ⊥

therwise

15/23

SLIDE 22

SparseShortLCEt,t′

Algorithm 3: SparseShortLCE2k,2k′(i, j) if Findk(i) = Findk(j) then return 2k ⊲ i, j ∈ S(2k′) if k = k′ then Compute naively ℓ = ShortLCE2k′(i, j) else ℓ = SparseShortLCE2k−1,2k′(i, j) if ℓ = 2k−1 then ℓ = 2k−1 + SparseShortLCE2k−1,2k′(i + 2k−1, j + 2k−1) if ℓ = 2k then Unionk(i, j) return ℓ

16/23

SLIDE 23

Faster ShortLCEt queries

Lemma

A sequence of q SparseShortLCE2k,2k′ queries can be executed

n-line in total time O(q(k + 2k′) + n

√ 2k′ +

nk 2k′/2 log∗ n).

17/23

SLIDE 24

Faster ShortLCEt queries

Lemma

A sequence of q SparseShortLCE2k,2k′ queries can be executed

n-line in total time O(q(k + 2k′) + n

√ 2k′ +

nk 2k′/2 log∗ n).

Lemma

For t = 2k, a sequence of q ShortLCEt queries can be executed

n-line in total time

O(qk + n √ k log∗ n) = O(q log t + n

log t log∗ n).

17/23

SLIDE 25

Faster ShortLCEt queries

Lemma

A sequence of q SparseShortLCE2k,2k′ queries can be executed

n-line in total time O(q(k + 2k′) + n

√ 2k′ +

nk 2k′/2 log∗ n).

Lemma

For t = 2k, a sequence of q ShortLCEt queries can be executed

n-line in total time

O(qk + n √ k log∗ n) = O(q log t + n

log t log∗ n).

Proof.

Pick t′ = Θ(log t) = 2k′. For query i, j:

◮ compute naively ℓ = ShortLCE2k′(i, j) ◮ if ℓ = 2k′, shift (i, j) by h2k′(i, j) and use

SparseShortLCE2k,2k′.

17/23

SLIDE 26

Where are we now?

Theorem

A sequence of O(n) LCE queries for a string over a general ordered alphabet can be executed on-line in total time O(n log log n) making only O(n) symbol comparisons.

18/23

SLIDE 27

Where are we now?

Theorem

A sequence of O(n) LCE queries for a string over a general ordered alphabet can be executed on-line in total time O(n log log n) making only O(n) symbol comparisons.

O(n) symbol comparisons

Easy but technical, requires additional Union-Find data structure to keep track of the same characters.

18/23

SLIDE 28

Faster solution for sublinear number of queries

We use stronger notion of t-covers: S(40), S(41), S(42), . . . ⊆ [1, n] is a monotone family of covers if the following conditions hold for every k:

1. S(4k) is a 4k-cover.
2. S(4k+1) ⊆ S(4k).
3. For any i, j ∈ S(4k) we have that h4k+1(i, j) ∈ {0, 4k, 2 · 4k}.
4. |S(4k)| ≤ ( 3

4)kn.

Lemma (Gawrychowski et al. (WADS 2015))

Let S(4k) be the set of non-negative integers i ∈ [1, n] such that none of the k least significant digits of the base-4 representation of i is zero. Then S(40), S(41), S(42), . . . is a monotone family of covers, which can be constructed in O(n) total time.

19/23

SLIDE 29

Final result

Lemma

For t = 4k, a sequence of q ShortLCEt queries can be answered

nline in total time O(qk + n log∗ n) = O(q log t + n log∗ n).

20/23

SLIDE 30

Final result

Lemma

For t = 4k, a sequence of q ShortLCEt queries can be answered

nline in total time O(qk + n log∗ n) = O(q log t + n log∗ n).

Lemma

For t = Ω(log6 n) we can preprocess a string of length n in O(n log∗ n) time, so that each CoarseLCEt query can be answered in constant time.

20/23

SLIDE 31

Final result

Lemma

For t = 4k, a sequence of q ShortLCEt queries can be answered

nline in total time O(qk + n log∗ n) = O(q log t + n log∗ n).

Lemma

For t = Ω(log6 n) we can preprocess a string of length n in O(n log∗ n) time, so that each CoarseLCEt query can be answered in constant time.

Theorem

A sequence of q LCE queries for a string over a general ordered alphabet can be executed on-line in total time O(q log log n + n log∗ n) making O(q + n) symbol comparisons.

20/23

SLIDE 32

Summary

Main results

For a given string of length n over a general ordered alphabet:

◮ we can answer q LCE queries in:

O(q log log n + n log∗ n) time and O(q + n) comparisons,

◮ in particular, for q = O(n) this gives O(n log log n) time, ◮ all runs can be computed in:

O(n log log n) time making O(n) symbol comparisons.

21/23

SLIDE 33

Follow up

In this paper

All runs can be computed in: O(n log log n) time making O(n) symbol comparisons.

Recent result

All runs can be computed in O(nα(n)) time. Near-Optimal Computation of Runs over General Alphabet via Non-Crossing LCE Queries accepted to SPIRE 2016. But this improvement is due to faster data structure for restricted LCE queries, so the Kosolobov hypothesis is still open.

22/23

SLIDE 34

Thank you for your attention!

23/23