[PPT] - CPSC 490 Finale Heavy-Light Decomposition and Suffix Array Lucca PowerPoint Presentation

SLIDE 1

CPSC 490 Finale

Heavy-Light Decomposition and Suffix Array

Lucca Siaudzionis and Jack Spalding-Jamieson 2020/04/07

University of British Columbia

SLIDE 2

Announcements

firt
Congrats on [basically] finishing the course!
Last reminder that A5 is due Sunday April 19th. It will not be extended past this day, so

please work on it and all the upsolvers you want to complete early (before your finals!).

1

SLIDE 3

Heavy-Light Decomposition: The Goal

Input: A tree (not necessarily binary), with an integer stored at each vertex. We want to handle a bunch of requests online:

Updates: Add x to all the vertices on the path between a and b.
Queries: What is the sum along the path from a to b.

For now, we can assume that b is always the root (it will be easy to generalize our answer). Output: The answer to each query.

1 5 3 2 4 3 3 −3 6 −2 1 −1

Figure 1: A tree and a highlighted path with sum 9.

2

SLIDE 4

Heavy-Light Decomposition: The Worst-Case

Recall why this may be hard: What if we had nodes very deep in the tree? Then the path up may be very long. To add even more complexity, we could also have a broomstick:

Figure 2: A small broomstick.

3

SLIDE 5

Heavy-Light Decomposition: Decomposition Plan

Our end-goal is going to be to decompose our tree into paths with a special property:

Figure 3: The heavy-light decomposition of some tree.

The special property is that any node has O(log n) distinct paths between it and the root.

4

SLIDE 6

Heavy-Light Decomposition: Using Paths

The result is: O(log n) distinct heavy paths between each node and the root. Additional properties:

Every edge can be considered to be a ”light edge” or a ”heavy edge”.
The number of light edges from any vertex to the root is O(log n).
Every vertex has at most one heavy edge to its children (it is possible to create a

decomposition that has exactly one path to each child too). Why is this useful? A path is a segment! So we can turn each path into a segment tree, and do range queries/updates on the O(log n) segments up to the root from any node. This means our queries will run in O(log2 n) time generally. In practice, we can use one segment tree for the entire tree, generated with DFS that starts with heavy paths.

5

SLIDE 7

Heavy-Light Decomposition: How To

The algorithm to compute the heavy-light decomposition is quite simple. For each vertex:

Recursively compute the size of each subtree.
Compute the sum of the subtree sizes.
If any subtree has at least half the total number of nodes among them all, extend or

create a heavy edge from the current node. Alternative definition: Create a heavy edge to the child with the largest size, even if it is not more than half of the total. In our implementation, we will actually use the alternative definition, since it’s easier to implement.

6

SLIDE 8

Heavy-Light Decomposition: Property Proof Idea

We need to prove something about the heavy-light decomposition:

The number of light edges from any vertex to the root is O(log n).

More specifically, we show that there are at most log2 n such light edges. This will then imply that there are O(log n) distinct heavy paths between each node and the root. Proof idea: Starting at some vertex v, iteratively move up to the parent. Every time we take a light edge, the total number of elements in the subtree must at least double, by the choice of the heavy edge (EXERCISE: check this for both definitions of the heavy edge). This doubling can only happen up to log n times.

7

SLIDE 9

Heavy-Light Decomposition: Implementation

The implementation is actually extremely simple. Here we will always set the first child to be the heavy child, by swapping:

1 void compute_size(vector<int>& size, int v, vector<vector<int>>& adj) { 2

size[v] = 1;

3

for (int& u : adj[v]) {

4

compute_size(size, u, adj);

5

size[v] += size[u];

6

// make the heaviest child the first child

7

if (size[u] >= size[adj[v][0]]) swap(u, adj[v][0]);

8

}

9 } 10 // next stores the next vertex that is the root or is connected by a light edge to its parent 11 // next[root] must be initialized to root 12 void hld(int v, vector<vector<int>>& adj, vector<int>& next) { 13

for (int u: adj[v]) {

14

next[u] = (u == adj[v][0] ? next[v] : u);

15

hld(u, adj, next);

16

}

17 }

8

SLIDE 10

Heavy-Light Decomposition: Implementation Usage

19

// sg initalized to have length n

20

// query path up to root from v

21

// assume parent[root] == -1

22

void query(int v, int root, vector<int>& next, vector<int>& parent,

23

segtree sg, int q, vector<int>& dfs, int i=0) {

24

dfs[v] = i++;

25

while (v != -1) {

26

int u = next[v];

27

sg.query(dfs[u], dfs[v], q); // add q to the range [u,v]

28

v = parent[u];

29

} } This function could be used to query arbitrary paths in the tree using inverse operations and LCA (we’ve seen this before in the RQ unit). A better way to do these (that would work for min/max) would be to do only two queries up from each of the nodes halting at the LCA (which we can do by recording depths, and finding the depth of the LCA).

9

SLIDE 11

Heavy-Light Decomposition: What You Can Use This For

Now that we’ve turned all trees into a decomposition of segments, you can do many more things:

Compute min, max, argmin, argmax, product, sum modulo, product modulo a prime, etc.

along paths in the tree (anything we could do easily with segment trees).

Query inclusion in tree paths (store sets in the segment trees) in O(log3 n) time.
Combine subtree and path queries (keep track of an Euler tour within the same DFS as

HLD).

Go learn link-cut trees (uses HLD only as part of the proof).

10

SLIDE 12

Suffix Arrays

Let’s start with a motivational problem.

11

SLIDE 13

LCS or LCS?

We have solved the Longest Common Subsequence problem with DP. What about Longest Common Substring??

12

SLIDE 14

Longest Common Substring

Very slow method: run Aho Corasick to find all substrings of S1 that appear in S2 → at least O(m2 + n) Observation: only need to match suffixes of S1 because Aho Corasick can tell you longest prefix match of any suffix, which is good enough. ⇒ All we need to do is to figure out how to build a Suffix Trie of S1 with all the extra arrows for Aho Corasick, and then run S2 through. Suffix Trie can be built in O(n) but algorithm is quite complicated. Instead we will build a simpler data structure – a Suffix Array

13

SLIDE 15

Suffix Array – Definition

A Suffix Array is the representation of all the suffixes of a word S sorted in lexicographical order. 0: BANANA$ $ (6) 1: ANANA$ A$ (5) 2: NANA$ ANA$ (3) 3: ANA$ → ANANA$ (1) 4: NA$ BANANA$ (0) 5: A$ NA$ (4) 6: $ NANA$ (2)

14

SLIDE 16

Longest Common Substring with Suffix Array

Main idea: if we have a sorted list of all suffixes of both S1 and S2, then we can just scan through the list and compare adjacent suffixes. What we will do:

Construct a Suffix Array of a string in O(n log2 n)
O(n log n) if you use radix sort
O(n) algorithms exist, but are more complicated
At the same time, construct a DP table so that we can find Longest Common Prefix of

any two suffixes in O(log n)

15

SLIDE 17

Suffix Array Construction

To avoid O(n2) memory, store suffixes by their starting index. Full comparison of 2 suffixes is slow – possibly O(string length), but comparing only first character is fast! ⇒ Let’s try sorting suffixes by first character

16

SLIDE 18

Suffix Array Construction – Rank

Define rank as the “rank” of a string when sorted by something, not breaking ties.
The rank must be defined such that if rank(a) < rank(b) iff a comes before b in the

sorting.

17

SLIDE 19

Suffix Array Construction – Pass 1

First pass: sort by the first character of the suffix and label with rank {B} 0 = BANANA$ {$} -> {0} 6 = $ {A} 1 = ANANA$ {A} -> {1} 1 = ANANA$ {N} 2 = NANA$ {A} -> {1} 3 = ANA$ {A} 3 = ANA$ => {A} -> {1} 5 = A$ {N} 4 = NA$ {B} -> {2} 0 = BANANA$ {A} 5 = A$ {N} -> {3} 2 = NANA$ {$} 6 = $ {N} -> {3} 4 = NA$ Now we know the rank of all suffixes by their first character.

18

SLIDE 20

Suffix Array Construction – Pass 2

Observation: a suffix of a suffix is a suffix, so if two suffixes share first character, we know their relative rank by second character! ⇒ Sort again with pair(rank 1, rank 2) {0, 0} 6 = $ {0, 0} -> {0} 6 = $ {1, 3} 1 = ANANA$ {1, 0} -> {1} 5 = A$ {1, 3} 3 = ANA$ {1, 3} -> {2} 1 = ANANA$ {1, 0} 5 = A$ => {1, 3} -> {2} 3 = ANA$ {2, 1} 0 = BANANA$ {2, 1} -> {3} 0 = BANANA$ {3, 1} 2 = NANA$ {3, 1} -> {4} 2 = NANA$ {3, 1} 4 = NA$ {3, 1} -> {4} 4 = NA$ Now we know the rank of all suffixes by their first 2 characters.

19

SLIDE 21

Suffix Array Construction – Pass 3

We know the rank by first 2 characters, so if two suffixes have same rank, we know their relative rank by the next 2 characters. ⇒ Sort again with pair(rank of char 1-2, rank of char 3-4) {0, 0} 6 = $ {0, 0} -> {0} 6 = $ {1, 0} 5 = A$ {1, 0} -> {1} 5 = A$ {2, 2} 1 = ANANA$ {2, 1} -> {2} 3 = ANA$ {2, 1} 3 = ANA$ => {2, 2} -> {3} 1 = ANANA$ {3, 4} 0 = BANANA$ {3, 4} -> {4} 0 = BANANA$ {4, 4} 2 = NANA$ {4, 0} -> {5} 4 = NA$ {4, 0} 4 = NA$ {4, 4} -> {6} 2 = NANA$ Now we know the rank of all suffixes by first their 4 characters. For “BANANA” we are done as all ranks are unique. Otherwise, sort again with pair(rank of char 1-4, rank of char 5-8), etc.

20

SLIDE 22

Suffix Array Construction: Summary

Define rank[k][i] = the rank of S[i..n] when sorted by first 2k chars, then the previous construction is equivalent to this DP recurrence:

rank[0][i] = S[i] for 0 ≤ i < n
rank[k][i] = -1 for all i ≥ n
rank[k][i] = rank of s[i..n] after sorting all suffixes by

{rank[k-1][i], rank[k-1][i+2k]} Our suffix array is then rank[log n], but useful to keep entire array. Time complexity: We did O(log n) sorts so O(n log2 n)

21

SLIDE 23

Suffix Array Construction

1

MAKE_SUFFIX_ARRAY(S of length N > 1):

2

initialize array R[1 + log N][N], T[N]

3

for i = 0 to N-1:

4

R[0][i] = S[i]

5

initialize skip = 1, lvl = 1

6

while skip < N:

7

for i = 0 to N-1:

8

T[i] = {{R[lvl -1][i], R[lvl -1][i+skip]}, i}

9

sort T

10

for i = 0 to N-1:

11

if i > 0 && T[i]._1 == T[i -1]._1:

12

R[lvl][T[i]._2] = R[lvl][T[i -1]._2]

13

else:

14

R[lvl][T[i]._2] = i

15

skip = skip * 2, lvl = lvl + 1

16

return R

22

SLIDE 24

Longest Common Prefix of Two Suffixes

Notice that we can use the rank array to compute LCP of two suffixes

If rank[k][i] == rank[k][j] then LCP of S[i..n] and S[j..n] ≥ 2k
⇒ Find largest k such that rank[k][i] = rank[k][j],

return 2k + LCP of S[i+2k..n] and S[j+2k..n] Time complexity: similar to LCA, i.e. O(log n) time

23

SLIDE 25

Longest Common Prefix of Two Suffixes

1

LCP(i, j):

2

if i == j:

3

return N - i

4

initialize len = 0

5

for k = log N to 0:

6

if i >= N || j >= N:

7

break

8

if R[k][i] == R[k][j]:

9

len = len + 2^k

10

i = i + 2^k, j = j + 2^k

11

return len

24

SLIDE 26

Longest Common Substring – Solution

Build suffix array of S1S2
Scan through suffix array, when suffixes of adjacent rank start in different string, compute

LCP of them

Make sure to clamp the LCP to boundary of the 2 strings
Take the max of all LCPs you computed

25

SLIDE 27

Discussion Problem 1

Find the lexicographically least rotation of a string. In other words, if you rotate the string in all possible ways and sort them alphabetically, what is the first string you get? Example: BANANA ⇒ ABANAN

26

SLIDE 28

Discussion Problem 1 – Solution

Construct suffix array of two copies of S concatenated
Output the first suffix of at least length |S|

27

SLIDE 29

Discussion Problem 2

Find the longest palindromic substring of a string

28

SLIDE 30

Discussion Problem 2 – Solution

Idea: palindrome is made of two pieces where the reverse of first piece shares common prefix with second piece

Construct suffix array of S + reverse(S)
Try all possible centers i: compute LCP of reverse(S[0..i]) and S[i..n-1] (or start at i+1 for

even palindromes) Note: Manacher’s algorithm solves this in O(n) but why think when you can just copy paste Suffix Array!

29