Time-Space Trade-Offs for the Longest Common Substring Problem - - PowerPoint PPT Presentation

time space trade offs for the longest common substring
SMART_READER_LITE
LIVE PREVIEW

Time-Space Trade-Offs for the Longest Common Substring Problem - - PowerPoint PPT Presentation

Time-Space Trade-Offs for the Longest Common Substring Problem Tatiana Starikovskaya 1 and Hjalte Wedel Vildhj 2 1 Moscow State University, Department of Mechanics and Mathematics, tat.starikovskaya@gmail.com 2 Technical University of Denmark,


slide-1
SLIDE 1

Time-Space Trade-Offs for the Longest Common Substring Problem

Tatiana Starikovskaya1 and Hjalte Wedel Vildhøj2

1Moscow State University, Department of Mechanics and Mathematics,

tat.starikovskaya@gmail.com

2Technical University of Denmark, DTU Compute, hwv@hwv.dk

CPM 2013, Bad Herrenalb, Germany June 17, 2013

1 / 27

slide-2
SLIDE 2

The Longest Common Substring Problem

Definition

Problem: Given T1, T2, . . . , Tm of total length n. Compute the longest substring, which appears in at least 2 ≤ d ≤ m strings. Example

T1 = a

1

g

2

g

3

c

4

t

5

a

6

g

7

c

8

t

9

a

10

c

11

c

12

t

13

T2 = a c a c c t a c c c t a g T3 = a c t a g t a a t g c a t

2 / 27

slide-3
SLIDE 3

The Longest Common Substring Problem

Definition

Problem: Given T1, T2, . . . , Tm of total length n. Compute the longest substring, which appears in at least 2 ≤ d ≤ m strings. Example

T1 = a

1

g

2

g

3

c

4

t

5

a

6

g

7

c

8

t

9

a

10

c

11

c

12

t

13

T2 = a c a c c t a c c c t a g T3 = a c t a g t a a t g c a t d = 3 ⇒ LCS = c t a g

3 / 27

slide-4
SLIDE 4

The Longest Common Substring Problem

Definition

Problem: Given T1, T2, . . . , Tm of total length n. Compute the longest substring, which appears in at least 2 ≤ d ≤ m strings. Example

T1 = a

1

g

2

g

3

c

4

t

5

a

6

g

7

c

8

t

9

a

10

c

11

c

12

t

13

T2 = a c a c c t a c c c t a g T3 = a c t a g t a a t g c a t d = 3 ⇒ LCS = c t a g d = 2 ⇒ LCS = c t a c c

4 / 27

slide-5
SLIDE 5

The Longest Common Substring Problem

A patented solution

5 / 27

slide-6
SLIDE 6

A Textbook Solution

1

acctaccctag$

7

c t a g $

10

$

3

a c c c t a g $ t c c

12

$

6

ctacct$

1

gctagctacct$ g a

2

a c c t a c c c t a g $

8

ctag$

11

$

4

ccctag$

9

g$ a t c

12

$

5

ctag$

8

t$ c c

10

ccctag$

4

g$ g a t c

13

$

7

cct$

3

gctacct$ cta

2

gctagctacct$ g

13

$

6

ctag$

9

t$ cc

11

ccctag$

5

g$ g a t

Build Generalized Suffix Tree

T1 = a

1

g

2

g

3

c

4

t

5

a

6

g

7

c

8

t

9

a

10

c

11

c

12

t

13

T2 = a c a c c t a c c c t a g

6 / 27

slide-7
SLIDE 7

A Textbook Solution

1

acctaccctag$

7

c t a g $

10

$

3

a c c c t a g $ t c c

12

$

6

ctacct$

1

gctagctacct$ g a

2

a c c t a c c c t a g $

8

ctag$

11

$

4

ccctag$

9

g$ a t c

12

$

5

ctag$

8

t$

cc

10

ccctag$

4

g$ g

a t c

13

$

7

cct$

3

gctacct$ cta

2

gctagctacct$ g

13

$

6

ctag$

9

t$ cc

11

ccctag$

5

g$ g a t

Build Generalized Suffix Tree

T1 = a

1

g

2

g

3

c

4

t

5

a

6

g

7

c

8

t

9

a

10

c

11

c

12

t

13

T2 = a c a c c t a c c c t a g

7 / 27

slide-8
SLIDE 8

A Textbook Solution

1

acctaccctag$

7

c t a g $

10

$

3

a c c c t a g $ t c c

12

$

6

ctacct$

1

gctagctacct$ g a

2

a c c t a c c c t a g $

8

ctag$

11

$

4

ccctag$

9

g$ a t c

12

$

5

ctag$

8

t$

cc

10

ccctag$

4

g$ g

a t c

13

$

7

cct$

3

gctacct$ cta

2

gctagctacct$ g

13

$

6

ctag$

9

t$ cc

11

ccctag$

5

g$ g a t

Space: Θ(n)

  • Build Generalized Suffix Tree

T1 = a

1

g

2

g

3

c

4

t

5

a

6

g

7

c

8

t

9

a

10

c

11

c

12

t

13

T2 = a c a c c t a c c c t a g

8 / 27

slide-9
SLIDE 9

Our Results

Question

Can the LCS problem be solved (deterministically) in O

  • n1−ε

space and O

  • n1+ε

time for 0 ≤ ε ≤ 1?

Our Answer

Yes if 0 ≤ ε ≤ 1

  • 3. More precisely,

For two strings (d = m = 2), the problem can be solved in: Time: O

  • n1+ε

Space: O

  • n1−ε

for any 0 < ε ≤ 1

3.

In the general case (2 ≤ d ≤ m), the problem can be solved in: Time: O

  • n1+ε log2 n(d log2 n + d2)
  • Space:

O

  • n1−ε

for any 0 ≤ ε < 1

3.

9 / 27

slide-10
SLIDE 10

A Solution for Two Strings

When the LCS is long

Idea: Preprocess a sparse sample of the n suffixes for LCP queries. T = a

1

g

2

g

3

c

4

t

5

a

6

g

7

c

8

t

9

a

10

c

11

c

12

t

13

$1

14

a

15

c

16

a

17

c

18

c

19

t

20

a

21

c

22

c

23

c

24

t

25

a

26

g

27

$2

28 10 / 27

slide-11
SLIDE 11

A Solution for Two Strings

When the LCS is long

Idea: Preprocess a sparse sample of the n suffixes for LCP queries. T = a

1

g

2

g

3

c

4

t

5

a

6

g

7

c

8

t

9

a

10

c

11

c

12

t

13

$1

14

a

15

c

16

a

17

c

18

c

19

t

20

a

21

c

22

c

23

c

24

t

25

a

26

g

27

$2

28

DCτ DCτ DCτ DCτ DCτ DCτ

Difference Covers

A difference cover modulo τ is a set of integers DCτ ⊆ {0, 1, . . . , τ − 1} such that for any distance d ∈ {0, 1, . . . , τ − 1}, DCτ contains two elements separated by distance d modulo τ. Ex: The set DCτ = {1, 2, 4} is a difference cover modulo 5. d 1 2 3 4 i, j 1, 1 2, 1 1, 4 4, 1 1, 2

1 2 4 3 1 4 2 3

11 / 27

slide-12
SLIDE 12

A Solution for Two Strings

When the LCS is long

Idea: Preprocess a sparse sample of the n suffixes for LCP queries. T = a

1

g

2

g

3

c

4

t

5

a

6

g

7

c

8

t

9

a

10

c

11

c

12

t

13

$1

14

a

15

c

16

a

17

c

18

c

19

t

20

a

21

c

22

c

23

c

24

t

25

a

26

g

27

$2

28

DCτ DCτ DCτ DCτ DCτ DCτ ◮ Number of sampled suffixes: O

n

τ |DCτ|

  • = O

n

√τ

  • .

12 / 27

slide-13
SLIDE 13

A Solution for Two Strings

When the LCS is long

Idea: Preprocess a sparse sample of the n suffixes for LCP queries. T = a

1

g

2

g

3

c

4

t

5

a

6

g

7

c

8

t

9

a

10

c

11

c

12

t

13

$1

14

a

15

c

16

a

17

c

18

c

19

t

20

a

21

c

22

c

23

c

24

t

25

a

26

g

27

$2

28

◮ Number of sampled suffixes: O

n

τ |DCτ|

  • = O

n

√τ

  • .

◮ The LCS is the LCP of two suffixes.

13 / 27

slide-14
SLIDE 14

A Solution for Two Strings

When the LCS is long

Idea: Preprocess a sparse sample of the n suffixes for LCP queries. T = a

1

g

2

g

3

c

4

t

5

a

6

g

7

c

8

t

9

a

10

c

11

c

12

t

13

$1

14

a

15

c

16

a

17

c

18

c

19

t

20

a

21

c

22

c

23

c

24

t

25

a

26

g

27

$2

28

◮ Number of sampled suffixes: O

n

τ |DCτ|

  • = O

n

√τ

  • .

◮ The LCS is the LCP of two suffixes. ◮ If |LCS| ≥ τ one of the first τ characters of the LCS is sampled in

both strings.

14 / 27

slide-15
SLIDE 15

A Solution for Two Strings

When the LCS is long

Idea: Preprocess a sparse sample of the n suffixes for LCP queries. T = a

1

g

2

g

3

c

4

t

5

a

6

g

7

c

8

t

9

a

10

c

11

c

12

t

13

$1

14

a

15

c

16

a

17

c

18

c

19

t

20

a

21

c

22

c

23

c

24

t

25

a

26

g

27

$2

28

◮ Number of sampled suffixes: O

n

τ |DCτ|

  • = O

n

√τ

  • .

◮ The LCS is the LCP of two suffixes. ◮ If |LCS| ≥ τ one of the first τ characters of the LCS is sampled in

both strings.

◮ Hence the LCS corresponds to a pair (p∗ 1, p∗ 2) maximizing

lcp

  • RB(p1), RB(p2)
  • + lcp
  • T[p1..], T[p2..]
  • − 1

15 / 27

slide-16
SLIDE 16

A Solution for Two Strings

When the LCS is long

Idea: Preprocess a sparse sample of the n suffixes for LCP queries. T = a

1

g

2

g

3

c

4

t

5

a

6

g

7

c

8

t

9

a

10

c

11

c

12

t

13

$1

14

a

15

c

16

a

17

c

18

c

19

t

20

a

21

c

22

c

23

c

24

t

25

a

26

g

27

$2

28

RB(11)= (g c t a c)R = c a t c g

◮ Number of sampled suffixes: O

n

τ |DCτ|

  • = O

n

√τ

  • .

◮ The LCS is the LCP of two suffixes. ◮ If |LCS| ≥ τ one of the first τ characters of the LCS is sampled in

both strings.

◮ Hence the LCS corresponds to a pair (p∗ 1, p∗ 2) maximizing

lcp

  • RB(p1), RB(p2)
  • + lcp
  • T[p1..], T[p2..]
  • − 1

16 / 27

slide-17
SLIDE 17

A Solution for Two Strings

When the LCS is long

How to compute the pair (p∗

1, p∗ 2) faster than O

n2

τ

  • ?

T = a

1

g

2

g

3

c

4

t

5

a

6

g

7

c

8

t

9

a

10

c

11

c

12

t

13

$1

14

a

15

c

16

a

17

c

18

c

19

t

20

a

21

c

22

c

23

c

24

t

25

a

26

g

27

$2

28

SAτ = 14 21 17 26 6 1 16 22 11 12 19 24 4 27 7 2 9 [ , , , , , , , , , , , , , , , , ] LCPτ = 3 1 2 2 1 2 1 2 3 4 1 1 [ , , , , , , , , , , , , , , , ] SAR

τ =

14 1 17 21 26 6 16 22 11 19 12 24 4 2 27 7 9 [ , , , , , , , , , , , , , , , , ] LCPR

τ =

1 1 4 3 2 4 1 3 2 1 2 4 [ , , , , , , , , , , , , , , , ]

Main observation: lcp(T[p∗

1..], T[p∗ 2..]) ∈ [ℓmax − τ + 1; ℓmax], so we can

ignore all pairs with lcp values smaller than ℓmax − τ + 1.

17 / 27

slide-18
SLIDE 18

A Solution for Two Strings

When the LCS is long

How to compute the pair (p∗

1, p∗ 2) faster than O

n2

τ

  • ?

T = a

1

g

2

g

3

c

4

t

5

a

6

g

7

c

8

t

9

a

10

c

11

c

12

t

13

$1

14

a

15

c

16

a

17

c

18

c

19

t

20

a

21

c

22

c

23

c

24

t

25

a

26

g

27

$2

28

SAτ = 14 21 17 26 6 1 16 22 11 12 19 24 4 27 7 2 9 [ , , , , , , , , , , , , , , , , ] LCPτ = 3 1 2 2 1 2 1 2 3 4 1 1 [ , , , , , , , , , , , , , , , ] SAR

τ =

14 1 17 21 26 6 16 22 11 19 12 24 4 2 27 7 9 [ , , , , , , , , , , , , , , , , ] LCPR

τ =

1 1 4 3 2 4 1 3 2 1 2 4 [ , , , , , , , , , , , , , , , ]

Main observation: lcp(T[p∗

1..], T[p∗ 2..]) ∈ [ℓmax − τ + 1; ℓmax], so we can

ignore all pairs with lcp values smaller than ℓmax − τ + 1.

18 / 27

slide-19
SLIDE 19

A Solution for Two Strings

When the LCS is long

How to compute the pair (p∗

1, p∗ 2) faster than O

n2

τ

  • ?

T = a

1

g

2

g

3

c

4

t

5

a

6

g

7

c

8

t

9

a

10

c

11

c

12

t

13

$1

14

a

15

c

16

a

17

c

18

c

19

t

20

a

21

c

22

c

23

c

24

t

25

a

26

g

27

$2

28

SAτ = 14 21 17 26 6 1 16 22 11 12 19 24 4 27 7 2 9 [ , , , , , , , , , , , , , , , , ] LCPτ = 3 1 2 2 1 2 1 2 3 4 1 1 [ , , , , , , , , , , , , , , , ] SAR

τ =

14 1 17 21 26 6 16 22 11 19 12 24 4 2 27 7 9 [ , , , , , , , , , , , , , , , , ] LCPR

τ =

1 1 4 3 2 4 1 3 2 1 2 4 [ , , , , , , , , , , , , , , , ]

Main observation: lcp(T[p∗

1..], T[p∗ 2..]) ∈ [ℓmax − τ + 1; ℓmax], so we can

ignore all pairs with lcp values smaller than ℓmax − τ + 1.

19 / 27

slide-20
SLIDE 20

A Solution for Two Strings

When the LCS is long

How to compute the pair (p∗

1, p∗ 2) faster than O

n2

τ

  • ?

T = a

1

g

2

g

3

c

4

t

5

a

6

g

7

c

8

t

9

a

10

c

11

c

12

t

13

$1

14

a

15

c

16

a

17

c

18

c

19

t

20

a

21

c

22

c

23

c

24

t

25

a

26

g

27

$2

28

SAτ = 14 21 17 26 6 1 16 22 11 12 19 24 4 27 7 2 9 [ , , , , , , , , , , , , , , , , ] LCPτ = 3 1 2 2 1 2 1 2 3 4 1 1 [ , , , , , , , , , , , , , , , ] SAR

τ =

14 1 17 21 26 6 16 22 11 19 12 24 4 2 27 7 9 [ , , , , , , , , , , , , , , , , ] LCPR

τ =

1 1 4 3 2 4 1 3 2 1 2 4 [ , , , , , , , , , , , , , , , ]

Main observation: lcp(T[p∗

1..], T[p∗ 2..]) ∈ [ℓmax − τ + 1; ℓmax], so we can

ignore all pairs with lcp values smaller than ℓmax − τ + 1.

20 / 27

slide-21
SLIDE 21

A Solution for Two Strings

When the LCS is long

How to compute the pair (p∗

1, p∗ 2) faster than O

n2

τ

  • ?

T = a

1

g

2

g

3

c

4

t

5

a

6

g

7

c

8

t

9

a

10

c

11

c

12

t

13

$1

14

a

15

c

16

a

17

c

18

c

19

t

20

a

21

c

22

c

23

c

24

t

25

a

26

g

27

$2

28

SAτ = 14 21 17 26 6 1 16 22 11 12 19 24 4 27 7 2 9 [ , , , , , , , , , , , , , , , , ] LCPτ = 3 1 2 2 1 2 1 2 3 4 1 1 [ , , , , , , , , , , , , , , , ]

O(τ) Analysis (sketch): O(τ) rounds each using O(n/√τ) time and space: Time: O (n√τ) Space: O (n/√τ) Time: O

  • n1+ε

Space: O

  • n1−ε

0 < ε ≤ 1

2.

τ = n2ε

21 / 27

slide-22
SLIDE 22

A Solution for Two Strings

When the LCS is shorter than τ

T1 = 2τ Si

◮ The LCS is a substring of one of the strings of length 2τ. ◮ Build the generalized suffix tree for a batch Si of strings of total

length O( n

√τ ). ◮ Traverse the suffix tree with T2 in O(n) time to find the node of

greatest string depth.

◮ Repeat for all O(√τ) batches.

22 / 27

slide-23
SLIDE 23

A Solution for Two Strings

When the LCS is shorter than τ

T1 = 2τ Si

◮ The LCS is a substring of one of the strings of length 2τ. ◮ Build the generalized suffix tree for a batch Si of strings of total

length O( n

√τ ). ◮ Traverse the suffix tree with T2 in O(n) time to find the node of

greatest string depth.

◮ Repeat for all O(√τ) batches.

Time: O (n√τ) Space: O (n/√τ) Time: O

  • n1+ε

Space: O

  • n1−ε

0 ≤ ε ≤ 1

3.

τ = n2ε τ = O(n/√τ)

23 / 27

slide-24
SLIDE 24

A General Solution for m Strings

When the LCS is long

Challenge: The difference cover property only holds for pairs. T = T1 T2 T3 T4 LCS LCS LCS

24 / 27

slide-25
SLIDE 25

A General Solution for m Strings

When the LCS is long

Challenge: The difference cover property only holds for pairs. T = T1 T2 T3 T4 LCS LCS LCS τ 5 4 1 5 3 2

25 / 27

slide-26
SLIDE 26

A General Solution for m Strings

When the LCS is long

Challenge: The difference cover property only holds for pairs. T = T1 T2 T3 T4 LCS LCS LCS τ 5 4 1 5 3 2 Algorithm: Extract the maximum head until we have d − 1 distinct

  • strings. Repeat everything for all n possible positions of the LCS.

Computing the next element in a list can be done in O(log n(log2 n + d)). Extracting it costs O(√τ). At most O(d√τ) extractions. Time: O

  • n√τ log2 n(log2 n + d)
  • Space: O (n/√τ)

26 / 27

slide-27
SLIDE 27

Conclusion

Results

For two strings (d = m = 2), the LCS problem can be solved in: Time: O

  • n1+ε

Space: O

  • n1−ε

for any 0 < ε ≤ 1

3.

In the general case (2 ≤ d ≤ m), the LCS problem can be solved in: Time: O

  • n1+ε log2 n(d log2 n + d2)
  • Space:

O

  • n1−ε

for any 0 ≤ ε < 1

3.

Open Problems

Can the generalized solution be improved? Can the trade-off interval of

  • ur solutions be extended to 0 ≤ ε ≤ 1

2? Can the problem be solved in

O(n1+ε) time and O(n1−ε) space for any 0 ≤ ε ≤ 1?

27 / 27