Minimal absent words in a sliding window & applications to - - PowerPoint PPT Presentation

minimal absent words in a sliding window applications to
SMART_READER_LITE
LIVE PREVIEW

Minimal absent words in a sliding window & applications to - - PowerPoint PPT Presentation

Minimal absent words in a sliding window & applications to on-line pattern matching Maxime Crochemore 1 , 2 , Alice Hliou 3 , Gregory Kucherov 2 , Laurent Mouchard 4 , Solon Pissis 1 , Yann Ramusat 5 1 Department of Informatics, Kings


slide-1
SLIDE 1

Minimal absent words in a sliding window & applications to on-line pattern matching

Maxime Crochemore1,2, Alice Héliou3, Gregory Kucherov2, Laurent Mouchard4, Solon Pissis1, Yann Ramusat5

1 Department of Informatics, King’s College London, London, UK 2 CNRS & Université Paris-Est 3 LIX, Ecole Polytechnique, CNRS, INRIA, Université Paris-Saclay 4 University of Rouen, LITIS EA 4108, TIBS, Rouen 5 DI ENS, CNRS, PSL Research University & INRIA Paris

11 septembre 2017 – FCT Bordeaux

Alice Héliou 1 / 25

slide-2
SLIDE 2

Minimal absent words

1

Minimal absent words Definition Applications Computation

2

Minimal absent words over a sliding window

Alice Héliou 2 / 25

slide-3
SLIDE 3

Minimal absent words Definition

Definition : Minimal Absent Word A minimal absent word of a sequence is an absent word whose proper factors (longest prefix, and longest suffix) all occur in the sequence. An upper bound on the number of minimal absent words is O(σn).

Crochemore et al. 1998, Mignosi et al. 2002

S=A C

1

A

2

C

3

A

4

A

5

G

6

C

7 Alice Héliou 3 / 25

slide-4
SLIDE 4

Minimal absent words Definition

Definition : Minimal Absent Word A minimal absent word of a sequence is an absent word whose proper factors (longest prefix, and longest suffix) all occur in the sequence. An upper bound on the number of minimal absent words is O(σn).

Crochemore et al. 1998, Mignosi et al. 2002

S=A C

1

A

2

C

3

A

4

A

5

G

6

C

7

AAA, AAC, CACAC, CAG, CC, CG, GA, GCA, GG

Alice Héliou 3 / 25

slide-5
SLIDE 5

Minimal absent words Definition

Definition : Minimal Absent Word A minimal absent word of a sequence is an absent word whose proper factors (longest prefix, and longest suffix) all occur in the sequence. An upper bound on the number of minimal absent words is O(σn).

Crochemore et al. 1998, Mignosi et al. 2002

S=A C

1

A

2

C

3

A

4

A

5

G

6

C

7

AAA, AAC, CACAC, CAG, CC, CG, GA, GCA, GG

Alice Héliou 3 / 25

slide-6
SLIDE 6

Minimal absent words Definition

Definition : Minimal Absent Word A minimal absent word of a sequence is an absent word whose proper factors (longest prefix, and longest suffix) all occur in the sequence. An upper bound on the number of minimal absent words is O(σn).

Crochemore et al. 1998, Mignosi et al. 2002

S=A C

1

A

2

C

3

A

4

A

5

G

6

C

7

AAA, AAC, CACAC, CAG, CC, CG, GA, GCA, GG

Alice Héliou 3 / 25

slide-7
SLIDE 7

Minimal absent words Definition

Definition : Minimal Absent Word A minimal absent word of a sequence is an absent word whose proper factors (longest prefix, and longest suffix) all occur in the sequence. An upper bound on the number of minimal absent words is O(σn).

Crochemore et al. 1998, Mignosi et al. 2002

S=A C

1

A

2

C

3

A

4

A

5

G

6

C

7

AAA, AAC, CACAC, CAG, CC, CG, GA, GCA, GG

Alice Héliou 3 / 25

slide-8
SLIDE 8

Minimal absent words Definition

Definition : Minimal Absent Word A minimal absent word of a sequence is an absent word whose proper factors (longest prefix, and longest suffix) all occur in the sequence. An upper bound on the number of minimal absent words is O(σn).

Crochemore et al. 1998, Mignosi et al. 2002

S=A C

1

A

2

C

3

A

4

A

5

G

6

C

7

AAA, AAC, CACAC, CAG, CC, CG, GA, GCA, GG

Alice Héliou 3 / 25

slide-9
SLIDE 9

Minimal absent words Definition

Definition : Minimal Absent Word A minimal absent word of a sequence is an absent word whose proper factors (longest prefix, and longest suffix) all occur in the sequence. An upper bound on the number of minimal absent words is O(σn).

Crochemore et al. 1998, Mignosi et al. 2002

S=A C

1

A

2

C

3

A

4

A

5

G

6

C

7

AAA, AAC, CACAC, CAG, CC, CG, GA, GCA, GG

Alice Héliou 3 / 25

slide-10
SLIDE 10

Minimal absent words Definition

Definition : Minimal Absent Word A minimal absent word of a sequence is an absent word whose proper factors (longest prefix, and longest suffix) all occur in the sequence. An upper bound on the number of minimal absent words is O(σn).

Crochemore et al. 1998, Mignosi et al. 2002

S=A C

1

A

2

C

3

A

4

A

5

G

6

C

7

AAA, AAC, CACAC, CAG, CC, CG, GA, GCA, GG

Alice Héliou 3 / 25

slide-11
SLIDE 11

Minimal absent words Applications

Applications

Biology 3 sequences (TTTCGCCCGACT, TACGCCCTATCG, CCTACGCGCAAA), found in Ebola genomes as coding for proteins are absent from the Human genome.

Alice Héliou 4 / 25

slide-12
SLIDE 12

Minimal absent words Applications

Applications

Biology 3 sequences (TTTCGCCCGACT, TACGCCCTATCG, CCTACGCGCAAA), found in Ebola genomes as coding for proteins are absent from the Human genome. BioInformatics Metric based on minimal absent words → Phylogeny (Chairungsee et al., 2012, Crochemore et al, 2016).

Alice Héliou 4 / 25

slide-13
SLIDE 13

Minimal absent words Applications

Applications

Biology 3 sequences (TTTCGCCCGACT, TACGCCCTATCG, CCTACGCGCAAA), found in Ebola genomes as coding for proteins are absent from the Human genome. BioInformatics Metric based on minimal absent words → Phylogeny (Chairungsee et al., 2012, Crochemore et al, 2016). Computer Science Data compression using anti-dictionnaries (Crochemore et al., 2000, Fiala and Holub, 2008).

Alice Héliou 4 / 25

slide-14
SLIDE 14

Minimal absent words Computation

Definition : Maximal repeated pair A maximal repeated pair in a S is a triple (i, j, w) such that : w occurs in S at positions i and j S[i − 1] = S[j − 1] S[i + |w|] = S[j + |w|]

Alice Héliou 5 / 25

slide-15
SLIDE 15

Minimal absent words Computation

Definition : Maximal repeated pair A maximal repeated pair in a S is a triple (i, j, w) such that : w occurs in S at positions i and j S[i − 1] = S[j − 1] S[i + |w|] = S[j + |w|] Lemma If awb is a minimal absent word of S, then there exist positions i and j such that (i, j, w) is a maximal repeated pair of S.

Alice Héliou 5 / 25

slide-16
SLIDE 16

Minimal absent words Computation

Definition : Maximal repeated pair A maximal repeated pair in a S is a triple (i, j, w) such that : w occurs in S at positions i and j S[i − 1] = S[j − 1] S[i + |w|] = S[j + |w|] Lemma If awb is a minimal absent word of S, then there exist positions i and j such that (i, j, w) is a maximal repeated pair of S. Sequence S A a minimal absent word of S

Alice Héliou 5 / 25

slide-17
SLIDE 17

Minimal absent words Computation

Definition : Maximal repeated pair A maximal repeated pair in a S is a triple (i, j, w) such that : w occurs in S at positions i and j S[i − 1] = S[j − 1] S[i + |w|] = S[j + |w|] Lemma If awb is a minimal absent word of S, then there exist positions i and j such that (i, j, w) is a maximal repeated pair of S. Sequence S A a minimal absent word of S longest prefix of A

Alice Héliou 5 / 25

slide-18
SLIDE 18

Minimal absent words Computation

Definition : Maximal repeated pair A maximal repeated pair in a S is a triple (i, j, w) such that : w occurs in S at positions i and j S[i − 1] = S[j − 1] S[i + |w|] = S[j + |w|] Lemma If awb is a minimal absent word of S, then there exist positions i and j such that (i, j, w) is a maximal repeated pair of S. Sequence S A a minimal absent word of S longest suffix of A

Alice Héliou 5 / 25

slide-19
SLIDE 19

Minimal absent words Computation

Definition : Maximal repeated pair A maximal repeated pair in a S is a triple (i, j, w) such that : w occurs in S at positions i and j S[i − 1] = S[j − 1] S[i + |w|] = S[j + |w|] Lemma If awb is a minimal absent word of S, then there exist positions i and j such that (i, j, w) is a maximal repeated pair of S. Sequence S A a minimal absent word of S w b a a i b j

Alice Héliou 5 / 25

slide-20
SLIDE 20

Minimal absent words Computation

⊥ 4 2 5 7 3 1 6

A ( , ) C(1,1) # ( 8 , 8 ) A ( 2 , 2 ) GC# (6,8) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) CA(1,2) A G C # ( 5 , 8 ) G C # ( 6 , 8 )

Suffix tree of S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

#

8 Alice Héliou 6 / 25

slide-21
SLIDE 21

Minimal absent words Computation

⊥ 4 2 5 7 3 1 6

A ( , ) C(1,1) # ( 8 , 8 ) A ( 2 , 2 ) GC# (6,8) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) CA(1,2) A G C # ( 5 , 8 ) G C # ( 6 , 8 )

Suffix tree of S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

#

8 Alice Héliou 6 / 25

slide-22
SLIDE 22

Minimal absent words Computation

⊥ 4 2 5 7 3 1 6 1

A ( , ) C(1,1) # ( 8 , 8 ) A ( 2 , 2 ) GC# (6,8) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) CA(1,2) A G C # ( 5 , 8 ) G C # ( 6 , 8 )

Suffix tree of S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

#

8 Alice Héliou 6 / 25

slide-23
SLIDE 23

Minimal absent words Computation

⊥ 4 2 5 7 3 1 6 2

A ( , ) C(1,1) # ( 8 , 8 ) A ( 2 , 2 ) GC# (6,8) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) CA(1,2) A G C # ( 5 , 8 ) G C # ( 6 , 8 )

Suffix tree of S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

#

8 Alice Héliou 6 / 25

slide-24
SLIDE 24

Minimal absent words Computation

⊥ 4 2 5 7 3 1 6 3

A ( , ) C(1,1) # ( 8 , 8 ) A ( 2 , 2 ) GC# (6,8) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) CA(1,2) A G C # ( 5 , 8 ) G C # ( 6 , 8 )

Suffix tree of S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

#

8 Alice Héliou 6 / 25

slide-25
SLIDE 25

Minimal absent words Computation

⊥ 4 2 5 7 3 1 6 4

A ( , ) C(1,1) # ( 8 , 8 ) A ( 2 , 2 ) GC# (6,8) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) CA(1,2) A G C # ( 5 , 8 ) G C # ( 6 , 8 )

Suffix tree of S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

#

8 Alice Héliou 6 / 25

slide-26
SLIDE 26

Minimal absent words Computation

⊥ 4 2 5 7 3 1 6 5

A ( , ) C(1,1) # ( 8 , 8 ) A ( 2 , 2 ) GC# (6,8) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) CA(1,2) A G C # ( 5 , 8 ) G C # ( 6 , 8 )

Suffix tree of S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

#

8 Alice Héliou 6 / 25

slide-27
SLIDE 27

Minimal absent words Computation

⊥ 4 2 5 7 3 1 6

A ( , ) C(1,1) # ( 8 , 8 ) A ( 2 , 2 ) GC# (6,8) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) CA(1,2) A G C # ( 5 , 8 ) G C # ( 6 , 8 )

Suffix tree of S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

#

8 Alice Héliou 6 / 25

slide-28
SLIDE 28

Minimal absent words Computation

⊥ 4 2 5 7 3 1 6 7

A ( , ) C(1,1) # ( 8 , 8 ) A ( 2 , 2 ) GC# (6,8) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) CA(1,2) A G C # ( 5 , 8 ) G C # ( 6 , 8 )

Suffix tree of S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

#

8 Alice Héliou 6 / 25

slide-29
SLIDE 29

Minimal absent words Computation

⊥ 4 2 5 7 3 1 6

A ( , ) C(1,1) # ( 8 , 8 ) A ( 2 , 2 ) GC# (6,8) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) CA(1,2) A G C # ( 5 , 8 ) G C # ( 6 , 8 )

Suffix tree of S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

#

8

AAA,AAC,CACAC,CAG,CC,CG,GA,GCA,GG

Alice Héliou 7 / 25

slide-30
SLIDE 30

Minimal absent words Computation

⊥ 4 2 5 7 3 1 6 ⊥

A ( , ) C(1,1) # ( 8 , 8 ) A ( 2 , 2 ) GC# (6,8) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) CA(1,2) A G C # ( 5 , 8 ) G C # ( 6 , 8 )

Suffix tree of S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

#

8

AAA,AAC,CACAC,CAG,CC,CG,GA,GCA,GG

Alice Héliou 7 / 25

slide-31
SLIDE 31

Minimal absent words Computation

⊥ 4 2 5 7 3 1 6 ⊥

A ( , ) C(1,1) # ( 8 , 8 ) A ( 2 , 2 ) GC# (6,8) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) CA(1,2) A G C # ( 5 , 8 ) G C # ( 6 , 8 )

Suffix tree of S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

#

8

AAA,AAC,CACAC,CAG,CC,CG,GA,GCA,GG {A,C} ∅ A

Alice Héliou 7 / 25

slide-32
SLIDE 32

Minimal absent words Computation

⊥ 4 2 5 7 3 1 6 ⊥

A ( , ) C(1,1) # ( 8 , 8 ) A ( 2 , 2 ) GC# (6,8) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) CA(1,2) A G C # ( 5 , 8 ) G C # ( 6 , 8 )

Suffix tree of S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

#

8

AAA,AAC,CACAC,CAG,CC,CG,GA,GCA,GG {A,C} ∅ A {A,G} ∅ C

Alice Héliou 7 / 25

slide-33
SLIDE 33

Minimal absent words Computation

⊥ 4 2 5 7 3 1 6 ⊥ 6

A ( , ) C(1,1) # ( 8 , 8 ) A ( 2 , 2 ) GC# (6,8) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) CA(1,2) A G C # ( 5 , 8 ) G C # ( 6 , 8 )

Suffix tree of S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

#

8

AAA,AAC,CACAC,CAG,CC,CG,GA,GCA,GG {A,C} ∅ A {A,G} ∅ C A ∅ G

Alice Héliou 7 / 25

slide-34
SLIDE 34

Minimal absent words Computation

⊥ 4 2 5 7 3 1 6 ⊥

A ( , ) C(1,1) # ( 8 , 8 ) A ( 2 , 2 ) GC# (6,8) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) CA(1,2) A G C # ( 5 , 8 ) G C # ( 6 , 8 )

Suffix tree of S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

#

8

AAA,AAC,CACAC,CAG,CC,CG,GA,GCA,GG {A,C} ∅ A {A,G} ∅ C A ∅ G

Alice Héliou 7 / 25

slide-35
SLIDE 35

Minimal absent words Computation

⊥ 4 2 5 7 3 1 6 4

A ( , ) C(1,1) # ( 8 , 8 ) A ( 2 , 2 ) GC# (6,8) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) CA(1,2) A G C # ( 5 , 8 ) G C # ( 6 , 8 )

Suffix tree of S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

#

8

AAA,AAC,CACAC,CAG,CC,CG,GA,GCA,GG C A A

Alice Héliou 7 / 25

slide-36
SLIDE 36

Minimal absent words Computation

⊥ 4 2 5 7 3 1 6

A ( , ) C(1,1) # ( 8 , 8 ) A ( 2 , 2 ) GC# (6,8) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) CA(1,2) A G C # ( 5 , 8 ) G C # ( 6 , 8 )

Suffix tree of S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

#

8

AAA,AAC,CACAC,CAG,CC,CG,GA,GCA,GG C A A C A C

Alice Héliou 7 / 25

slide-37
SLIDE 37

Minimal absent words Computation

⊥ 4 2 5 7 3 1 6 5

A ( , ) C(1,1) # ( 8 , 8 ) A ( 2 , 2 ) GC# (6,8) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) CA(1,2) A G C # ( 5 , 8 ) G C # ( 6 , 8 )

Suffix tree of S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

#

8

AAA,AAC,CACAC,CAG,CC,CG,GA,GCA,GG C A A C A C A A G

Alice Héliou 7 / 25

slide-38
SLIDE 38

Minimal absent words Computation

⊥ 4 2 5 7 3 1 6

A ( , ) C(1,1) # ( 8 , 8 ) A ( 2 , 2 ) GC# (6,8) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) CA(1,2) A G C # ( 5 , 8 ) G C # ( 6 , 8 )

Suffix tree of S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

#

8

AAA,AAC,CACAC,CAG,CC,CG,GA,GCA,GG C A A C A C A A G

Alice Héliou 7 / 25

slide-39
SLIDE 39

Minimal absent words Computation

⊥ 4 2 5 7 3 1 6 2

A ( , ) C(1,1) # ( 8 , 8 ) A ( 2 , 2 ) GC# (6,8) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) CA(1,2) A G C # ( 5 , 8 ) G C # ( 6 , 8 )

Suffix tree of S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

#

8

AAA,AAC,CACAC,CAG,CC,CG,GA,GCA,GG C ACA A

Alice Héliou 7 / 25

slide-40
SLIDE 40

Minimal absent words Computation

⊥ 4 2 5 7 3 1 6

A ( , ) C(1,1) # ( 8 , 8 ) A ( 2 , 2 ) GC# (6,8) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) CA(1,2) A G C # ( 5 , 8 ) G C # ( 6 , 8 )

Suffix tree of S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

#

8

AAA,AAC,CACAC,CAG,CC,CG,GA,GCA,GG C ACA A ∅ ACA C

Alice Héliou 7 / 25

slide-41
SLIDE 41

Minimal absent words Computation

⊥ 4 2 5 7 3 1 6

A ( , ) C(1,1) # ( 8 , 8 ) A ( 2 , 2 ) GC# (6,8) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) CA(1,2) A G C # ( 5 , 8 ) G C # ( 6 , 8 )

Suffix tree of S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

#

8

AAA,AAC,CACAC,CAG,CC,CG,GA,GCA,GG C ACA A ∅ ACA C

Alice Héliou 7 / 25

slide-42
SLIDE 42

Minimal absent words Computation

⊥ 4 2 5 7 3 1 6 7

A ( , ) C(1,1) # ( 8 , 8 ) A ( 2 , 2 ) GC# (6,8) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) CA(1,2) A G C # ( 5 , 8 ) G C # ( 6 , 8 )

Suffix tree of S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

#

8

AAA,AAC,CACAC,CAG,CC,CG,GA,GCA,GG G C #

Alice Héliou 7 / 25

slide-43
SLIDE 43

Minimal absent words Computation

⊥ 4 2 5 7 3 1 6

A ( , ) C(1,1) # ( 8 , 8 ) A ( 2 , 2 ) GC# (6,8) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) CA(1,2) A G C # ( 5 , 8 ) G C # ( 6 , 8 )

Suffix tree of S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

#

8

AAA,AAC,CACAC,CAG,CC,CG,GA,GCA,GG G C # A C A

Alice Héliou 7 / 25

slide-44
SLIDE 44

Minimal absent words Computation

⊥ 4 2 5 7 3 1 6

A ( , ) C(1,1) # ( 8 , 8 ) A ( 2 , 2 ) GC# (6,8) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) CA(1,2) A G C # ( 5 , 8 ) G C # ( 6 , 8 )

Suffix tree of S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

#

8

AAA,AAC,CACAC,CAG,CC,CG,GA,GCA,GG G C # A C A

Alice Héliou 7 / 25

slide-45
SLIDE 45

Minimal absent words Computation

⊥ 4 2 5 7 3 1 6 3

A ( , ) C(1,1) # ( 8 , 8 ) A ( 2 , 2 ) GC# (6,8) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) CA(1,2) A G C # ( 5 , 8 ) G C # ( 6 , 8 )

Suffix tree of S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

#

8

AAA,AAC,CACAC,CAG,CC,CG,GA,GCA,GG A CA A

Alice Héliou 7 / 25

slide-46
SLIDE 46

Minimal absent words Computation

⊥ 4 2 5 7 3 1 6 1

A ( , ) C(1,1) # ( 8 , 8 ) A ( 2 , 2 ) GC# (6,8) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) CA(1,2) A G C # ( 5 , 8 ) G C # ( 6 , 8 )

Suffix tree of S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

#

8

AAA,AAC,CACAC,CAG,CC,CG,GA,GCA,GG A CA A A CA C

Alice Héliou 7 / 25

slide-47
SLIDE 47

Minimal absent words Computation

⊥ 4 2 5 7 3 1 6

A ( , ) C(1,1) # ( 8 , 8 ) A ( 2 , 2 ) GC# (6,8) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) A G C # ( 5 , 8 ) C A A G C # ( 3 , 8 ) CA(1,2) A G C # ( 5 , 8 ) G C # ( 6 , 8 )

Suffix tree of S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

#

8

AAA,AAC,CACAC,CAG,CC,CG,GA,GCA,GG A CA A A CA C

Alice Héliou 7 / 25

slide-48
SLIDE 48

Minimal absent words over a sliding window

1

Minimal absent words

2

Minimal absent words over a sliding window Problem definition Ukkonen construction algorithm of the suffix tree The suffix tree for a sliding window Algorithm for maws in a sliding window

Alice Héliou 8 / 25

slide-49
SLIDE 49

Minimal absent words over a sliding window Problem definition

Minimal absent words on a sliding window

Sequence S of size n, over a constant size alphabet, Sliding window of size m on S : S[i . . i + m − 1]. For all word x we denote by M(x) its set of minimal absent words.

Alice Héliou 9 / 25

slide-50
SLIDE 50

Minimal absent words over a sliding window Problem definition

Minimal absent words on a sliding window

Sequence S of size n, over a constant size alphabet, Sliding window of size m on S : S[i . . i + m − 1]. For all word x we denote by M(x) its set of minimal absent words. · · · S · · · i i + m − 1 M(S[i . . i + m − 1])

Alice Héliou 9 / 25

slide-51
SLIDE 51

Minimal absent words over a sliding window Problem definition

Minimal absent words on a sliding window

Lemma

n−m

  • i=0

|M(y[i . . i + m − 1])| is upper bounded by O(nm).

Alice Héliou 10 / 25

slide-52
SLIDE 52

Minimal absent words over a sliding window Problem definition

Minimal absent words on a sliding window

Lemma

n−m

  • i=0

|M(y[i . . i + m − 1])| is upper bounded by O(nm). → We can’t output the set if minimal absent words for each factor of size m in time O(n).

Alice Héliou 10 / 25

slide-53
SLIDE 53

Minimal absent words over a sliding window Problem definition

Minimal absent words on a sliding window

Lemma

n−m

  • i=0

|M(y[i . . i + m − 1])| is upper bounded by O(nm). → We can’t output the set if minimal absent words for each factor of size m in time O(n). Theorem The upper bound of

n−m−1

  • i=0

|M(y[i . . i + m − 1])△M(y[i + 1 . . i + m])| is O(n).

Alice Héliou 10 / 25

slide-54
SLIDE 54

Minimal absent words over a sliding window Problem definition

Minimal absent words on a sliding window

Lemma

n−m

  • i=0

|M(y[i . . i + m − 1])| is upper bounded by O(nm). → We can’t output the set if minimal absent words for each factor of size m in time O(n). Theorem The upper bound of

n−m−1

  • i=0

|M(y[i . . i + m − 1])△M(y[i + 1 . . i + m])| is O(n). → We need a dynamic structure to go from one set to another efficiently

Alice Héliou 10 / 25

slide-55
SLIDE 55

Minimal absent words over a sliding window Ukkonen construction algorithm of the suffix tree

Dynamic construction of the suffix tree From left to right by Weiner 1973, From right to left by McCreight in 1976, Ukkonen algorithm in 1995 provides a more intuitive algorithm.

Alice Héliou 11 / 25

slide-56
SLIDE 56

Minimal absent words over a sliding window Ukkonen construction algorithm of the suffix tree

Ukkonen construction algorithm of the suffix tree

⊥ S=∅

Alice Héliou 12 / 25

slide-57
SLIDE 57

Minimal absent words over a sliding window Ukkonen construction algorithm of the suffix tree

Ukkonen construction algorithm of the suffix tree

⊥ S=∅ ⊥

A(0,0)

S=A

Alice Héliou 12 / 25

slide-58
SLIDE 58

Minimal absent words over a sliding window Ukkonen construction algorithm of the suffix tree

Ukkonen construction algorithm of the suffix tree

⊥ S=∅ ⊥

A(0,0)

S=A ⊥ 1

AC(0,1) C(1,1)

S=A C

1 Alice Héliou 12 / 25

slide-59
SLIDE 59

Minimal absent words over a sliding window Ukkonen construction algorithm of the suffix tree

Ukkonen construction algorithm of the suffix tree

⊥ S=∅ ⊥

A(0,0)

S=A ⊥ 1

AC(0,1) C(1,1)

S=A C

1

⊥ 1

CA(1,2) A ( , ) CA(1,2)

S=A C

1

A

2 Alice Héliou 12 / 25

slide-60
SLIDE 60

Minimal absent words over a sliding window Ukkonen construction algorithm of the suffix tree

⊥ 1

CA(1,2) A ( , ) CA(1,2)

S=A C

1

A

2

⊥ 1

CAC(1,3) A C ( , 1 ) AC(2,3)

S=A C

1

A

2

C

3 Alice Héliou 13 / 25

slide-61
SLIDE 61

Minimal absent words over a sliding window Ukkonen construction algorithm of the suffix tree

⊥ 1

CA(1,2) A ( , ) CA(1,2)

S=A C

1

A

2

⊥ 1

CAC(1,3) A C ( , 1 ) AC(2,3)

S=A C

1

A

2

C

3

⊥ 1

CACA(1,3) A C A ( , 1 ) CA(2,3)

S=A C

1

A

2

C

3

A

4 Alice Héliou 13 / 25

slide-62
SLIDE 62

Minimal absent words over a sliding window Ukkonen construction algorithm of the suffix tree

⊥ 1

CACA(1,3) A C A ( , 1 ) CA(2,3)

S=A C

1

A

2

C

3

A

4

⊥ 2 1

A C A ( , 2 ) A ( 3 , 3 ) C A A ( 3 , 5 ) C A ( 1 , 2 ) CAA(3,5)

S=A C

1

A

2

C

3

A

4

A

5 Alice Héliou 14 / 25

slide-63
SLIDE 63

Minimal absent words over a sliding window Ukkonen construction algorithm of the suffix tree

⊥ 1

CACA(1,3) A C A ( , 1 ) CA(2,3)

S=A C

1

A

2

C

3

A

4

⊥ 2 3 1

C A ( 1 , 2 ) A ( 5 , 5 ) C A A ( 3 , 5 ) A ( 5 , 5 ) C A A ( 3 , 5 ) A(0,0) CA(1,2)

S=A C

1

A

2

C

3

A

4

A

5 Alice Héliou 14 / 25

slide-64
SLIDE 64

Minimal absent words over a sliding window Ukkonen construction algorithm of the suffix tree

⊥ 1

CACA(1,3) A C A ( , 1 ) CA(2,3)

S=A C

1

A

2

C

3

A

4

⊥ 4 2 3 1

A ( , ) CA(1,2) A ( 5 , 5 ) C A A ( 3 , 5 ) A ( 5 , 5 ) C A A ( 3 , 5 ) CA(1,2) A ( 5 , 5 )

S=A C

1

A

2

C

3

A

4

A

5 Alice Héliou 14 / 25

slide-65
SLIDE 65

Minimal absent words over a sliding window Ukkonen construction algorithm of the suffix tree

⊥ 1

CACA(1,3) A C A ( , 1 ) CA(2,3)

S=A C

1

A

2

C

3

A

4

⊥ 4 2 3 1

A(0,0) CA(1,2) A ( 5 , 5 ) C A A ( 3 , 5 ) A ( 5 , 5 ) C A A ( 3 , 5 ) C A ( 1 , 2 ) A ( 5 , 5 )

S=A C

1

A

2

C

3

A

4

A

5 Alice Héliou 14 / 25

slide-66
SLIDE 66

Minimal absent words over a sliding window Ukkonen construction algorithm of the suffix tree

⊥ 4 2 5 3 1 6

A ( , ) C(1,1) A(2,2) GC(6,7) A G C ( 5 , 7 ) C A A G C ( 3 , 7 ) A G C ( 5 , 7 ) C A A G C ( 3 , 7 ) CA(1,2) A G C ( 5 , 7 ) G C ( 6 , 7 )

S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7 Alice Héliou 15 / 25

slide-67
SLIDE 67

Minimal absent words over a sliding window The suffix tree for a sliding window

The suffix tree for a sliding window

The suffix tree for a sliding window, Senft 2005 Remove the leftmost letter, Update edge labels.

Alice Héliou 16 / 25

slide-68
SLIDE 68

Minimal absent words over a sliding window The suffix tree for a sliding window

⊥ 4 2 5 3 1 6

A ( , ) C(1,1) A(2,2) GC(6,7) A G C ( 5 , 7 ) C A A G C ( 3 , 7 ) A G C ( 5 , 7 ) C A A G C ( 3 , 7 ) CA(1,2) A G C ( 5 , 7 ) G C ( 6 , 7 )

S =A C

1

A

2

C

3

A

4

A

5

G

6

C

7 Alice Héliou 17 / 25

slide-69
SLIDE 69

Minimal absent words over a sliding window The suffix tree for a sliding window

Minimal absent words over a sliding window

We have adapted Senft algorithm to compute minimal absent words.

Alice Héliou 18 / 25

slide-70
SLIDE 70

Minimal absent words over a sliding window The suffix tree for a sliding window

Minimal absent words over a sliding window

We have adapted Senft algorithm to compute minimal absent words. Add on the tree the information of the BWT (the set of letters that precede each factor),

Alice Héliou 18 / 25

slide-71
SLIDE 71

Minimal absent words over a sliding window The suffix tree for a sliding window

Minimal absent words over a sliding window

We have adapted Senft algorithm to compute minimal absent words. Add on the tree the information of the BWT (the set of letters that precede each factor), Add the set of minimal absent words

Alice Héliou 18 / 25

slide-72
SLIDE 72

Minimal absent words over a sliding window The suffix tree for a sliding window

Minimal absent words over a sliding window

We have adapted Senft algorithm to compute minimal absent words. Add on the tree the information of the BWT (the set of letters that precede each factor), Add the set of minimal absent words The mapping f is an injection f : M(z) → Σ(z) × V (z) define by f (aub) = (a, vub), where a ∈ Σ and vub is the node corresponding to ub.

Alice Héliou 18 / 25

slide-73
SLIDE 73

Minimal absent words over a sliding window Algorithm for maws in a sliding window

⊥ 4 2 5 3 1 6

A ( , ) C(1,1) A(2,2) GC(6,7) A G C ( 5 , 7 ) C A A G C ( 3 , 7 ) A G C ( 5 , 7 ) C A A G C ( 3 , 7 ) CA(1,2) A G C ( 5 , 7 ) G C ( 6 , 7 )

z0 =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

M(z0)= {AAA, GA, AAC, CACAC, CAG, CC, GCA, CG, GG} B= {C} M= {A} B= {A, C} M= {G} B= {A} M= {C, G+} B= {C} M= ∅ B= ∅ M= {C} B= {A} M= {C} B= {A} M= ∅ B= {A} M= ∅ B= {C} M= {A} B= {A} M= {C, G}

Alice Héliou 19 / 25

slide-74
SLIDE 74

Minimal absent words over a sliding window Algorithm for maws in a sliding window

⊥ 4 2 5 3 1 6

A ( , ) CA(1,2) GCA(6,8) A G C A ( 5 , 8 ) C A A G C A ( 3 , 8 ) A G C A ( 5 , 8 ) C A A G C A ( 3 , 8 ) CA(1,2) A G C A ( 5 , 8 ) G C A ( 6 , 8 )

z0 · A =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

A

8

M(z0 · A)= {AAA, GA, AAC, CACAC, CAG, CC, GCA, GCAA, GCAC, CG, GG} B= {C} M= {A} B= {A, C} M= {G} B= {A, G} M= {C,G} B= {C} M= ∅ B= ∅ M= {C} B= {A} M= {C} B= {A} M= {G} B= {A} M= {G} B= {C} M= {A} B= {A} M= {C,G}

Alice Héliou 20 / 25

slide-75
SLIDE 75

Minimal absent words over a sliding window Algorithm for maws in a sliding window

⊥ 4 2 5 3 1 6

A ( , ) CA(1,2) GCA(6,8) A G C A ( 5 , 8 ) C A A G C A ( 3 , 8 ) A G C A ( 5 , 8 ) C A A G C A ( 3 , 8 ) CA(1,2) A G C A ( 5 , 8 ) G C A ( 6 , 8 )

A · z1 =A C

1

A

2

C

3

A

4

A

5

G

6

C

7

A

8

M(A · z1)\M(z1)= {CACAC} B= {C} M= {A} B= {A, C} M= {G} B= {A, G} M= {C} B= {C} M= ∅ B= ∅ M={C} B= {A} M= {C} B= {A} M= {G} B= {A} M= {G} B= {C} M= {A} B= {A} M= {C,G}

Alice Héliou 21 / 25

slide-76
SLIDE 76

Minimal absent words over a sliding window Algorithm for maws in a sliding window

⊥ 4 2 5 3 1 6

A ( , ) CA(1,2) GCA(6,8) A G C A ( 5 , 8 ) C A A G C A ( 3 , 8 ) CAAGCA(4,8) A G C A ( 5 , 8 ) G C A ( 6 , 8 )

z1 = C

1

A

2

C

3

A

4

A

5

G

6

C

7

A

8

M(z1)= {AAA, GA, AAC, CAG, CC, GCAA, ACAC, GCAC, CG, GG} B= {C} M= {A} B= {A, C} M= {G} B= {A, G} M= {C} B= {C} M= {A} B= {A} M= {C} B= {A} M= {G} B= {A} M= {A, G} B= {A} M= {C, G}

Alice Héliou 22 / 25

slide-77
SLIDE 77

Minimal absent words over a sliding window Algorithm for maws in a sliding window

Applications to on-line pattern matching

Minimal absent words over a sliding window For a sequence S of size n and a window of size m we compute : ∀i, 0 ≤ i ≤ n − m, M(S[i . . i + m − 1]), in time O(n) and in space O(m).

Alice Héliou 23 / 25

slide-78
SLIDE 78

Minimal absent words over a sliding window Algorithm for maws in a sliding window

Applications to on-line pattern matching

Minimal absent words over a sliding window For a sequence S of size n and a window of size m we compute : ∀i, 0 ≤ i ≤ n − m, M(S[i . . i + m − 1]), in time O(n) and in space O(m). Length Weighted Index (LWI), introduced by Chairungsee in 2012 Metric based on the symmetric difference of minimal absent words sets LWI(M(x), M(y)) =

  • w∈M(x)△M(y)

1 |w|2.

Alice Héliou 23 / 25

slide-79
SLIDE 79

Minimal absent words over a sliding window Algorithm for maws in a sliding window

Applications to on-line pattern matching

Minimal absent words over a sliding window For a sequence S of size n and a window of size m we compute : ∀i, 0 ≤ i ≤ n − m, M(S[i . . i + m − 1]), in time O(n) and in space O(m). Length Weighted Index (LWI), introduced by Chairungsee in 2012 Metric based on the symmetric difference of minimal absent words sets LWI(M(x), M(y)) =

  • w∈M(x)△M(y)

1 |w|2. → We obtain the position of minimal distance.

Alice Héliou 23 / 25

slide-80
SLIDE 80

Futur workss

Futur works

Implement the algorithm over a sliding window, Compare the results in Bioinformatics for reads alignment.

Alice Héliou 24 / 25

slide-81
SLIDE 81

Remerciements Alice Héliou 25 / 25

slide-82
SLIDE 82

Remerciements

LOB Hubert Becker Hannu Myllykallio Roxane Lestini Yoann Collien et tous les autres LIX Mireille Régnier Yann Ponty Philippe Chassignet Amélie Héliou Afaf Saaidi Juraj Michalik et tous les autres Université de Rouen Laurent Mouchard King’s College London Solon Pissis Carl Barton Université d’Helsinki Simon Puglisi Université Paris Est Maxime Crochemore Gregory Kucherov ENS Paris Yann Ramusat

Alice Héliou 25 / 25

slide-83
SLIDE 83

Remerciements

Algorithms to compute minimal absent words

References Structures Drawbacks Crochemore et al., 1998 Suffix Automata Expensive in space Belazzougui et al. 2013 Bidirectionnal BWT No implementa- tion available Ota et al. 2014 Suffix tree, dynamic approach Quadratic in time Barton et al. 2014 Suffix Array Linear, fastest available Belazzougui et al. 2015 BWT & complemen- tary structures No implementa- tion

Alice Héliou 25 / 25