Seriation, Spectral Clustering and de novo genome assembly
Antoine Recanati, CNRS & ENS with Alexandre d’Aspremont, Thomas Kerdreux, Thomas Br¨ uls, CNRS - ENS Paris & Genoscope.
- A. Recanati
Symbiose Seminar, Juin 2018, 1/29
Seriation, Spectral Clustering and de novo genome assembly Antoine - - PowerPoint PPT Presentation
Seriation, Spectral Clustering and de novo genome assembly Antoine Recanati , CNRS & ENS with Alexandre dAspremont, Thomas Kerdreux, Thomas Br uls, CNRS - ENS Paris & Genoscope. A. Recanati Symbiose Seminar, Juin 2018, 1/29
Symbiose Seminar, Juin 2018, 1/29
Pairwise similarity information Aij on n variables. Suppose the data has a serial structure, i.e. there is an order π such that
20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160 20 40 60 80 100 120 140 160
Symbiose Seminar, Juin 2018, 2/29
Genomes are cloned multiple times and randomly cut into shorter reads
Reorder the reads to recover the genome.
Symbiose Seminar, Juin 2018, 3/29
Compute overlap between all read pairs. Reorder overlap matrix to recover read order. Average the read values to create a consensus sequence.
Symbiose Seminar, Juin 2018, 4/29
There are base calling errors in the reads, typically 2% to 20% depending on
Entire parts of the genome are repeated, which breaks the serial structure.
Next generation : short reads (∼ 400bp), few errors (∼ 2%). Repeats are
Third generation : long reads (∼ 10kbp), more errors (∼ 15%). Can resolve
Symbiose Seminar, Juin 2018, 5/29
With short accurate reads, the reordering problem is solved by
With long noisy reads, reads are corrected before assembly (hybrid correction
Layout and consensus not clearly separated, many heuristics . . . minimap+miniasm : first long raw reads straight assembler (but consensus
Symbiose Seminar, Juin 2018, 6/29
Introduction Spectral relaxation of Seriation (Spectral Ordering) Multi-dimensional Spectral Ordering Results (Application to genome assembly)
Symbiose Seminar, Juin 2018, 7/29
The 2-SUM problem is written
π∈P n
π∈P n
Symbiose Seminar, Juin 2018, 8/29
A : adjacency matrix of a undirected weighted graph (Aij > 0 iff. there is an
Node i has degree di =
j Aij. Degree matrix D = diag(A1) = diag(d).
Laplacian matrix L = D − A. The Laplacian can be viewed as a quadratic form,
n
Symbiose Seminar, Juin 2018, 9/29
For a vector f = (f1, . . . , fn)T ∈ Rn and a matrix M ∈ Rn×n, we have,
i,j=1 Mijfifj
(λ ∈ R, u ∈ Rn) is a eigenvalue-eigenvector couple of L ∈ Rn×n iff Lu = λu
Symbiose Seminar, Juin 2018, 10/29
n
i=1 f 2 i Dii − n i,j=1 Aijfifj
i=1 f 2 i (n j=1 Aij) − n i,j=1 Aijfifj
i,j=1 Aij(f 2 i − fifj)
1 2
i,j=1 Aij(f 2 j + f 2 i − 2fifj)
1 2
i,j=1 Aij(fi − fj)2
Symbiose Seminar, Juin 2018, 11/29
n
L is symmetric and positive semi-definite. L has n non-negative, real-valued eigenvalues, 0 = λ1 ≤ λ2 ≤ . . . ≤ λn. 1 = (1, . . . , 1)T is eigenvector associated to eigenvalue 0. If A has K connected components, the eigenvalue 0 has multiplicity K + 1,
If f ∈ {−1, +1}n, objective of min-cut (clustering).
Symbiose Seminar, Juin 2018, 12/29
The 2-SUM problem is written
i,j=1 Aπ(i)π(j)(i − j)2
i,j=1 Aij(π(i) − π(j))2
For certain matrices A, 2-SUM ⇐
NP-Complete for generic matrices A. Constraints π ∈ P ?
Symbiose Seminar, Juin 2018, 13/29
π∈P πTLAπ
2
Since L1 = 0, (2SUM) is invariant by π ← π − (n+1)
2
Up to a dilatation, we can chose π2
2 = 1.
Relax the integer constraints and let π(i) ∈ R.
Symbiose Seminar, Juin 2018, 14/29
1T x=0, x2=1
Symbiose Seminar, Juin 2018, 15/29
1: Compute Laplacian L = diag(A1) − A 2: Compute second smallest eigenvector of L, x∗ 3: Sort the values of x∗
20 40 60 80 20 40 60 80
20 40 60 80 100 0.15 0.10 0.05 0.00 0.05 0.10 0.15
Symbiose Seminar, Juin 2018, 16/29
Spectral solution easy to compute and scales well (polynomial time) But sensitive and not flexible (hard to include additional structural constraints) Other (convex) relaxations can handle structural constraints and solve more
Overlap : computed from k-mers, yielding a similarity matrix A Layout : A is thresholded to remove noise-induced overlaps, and reordered
Consensus : Genome sliced in windows
Symbiose Seminar, Juin 2018, 17/29
100 200 300 400 500 100 200 300 400 500 100 200 300 400 500 0.075 0.050 0.025 0.000 0.025 0.050 0.075
Gaussian noise over perfect R-matrix.
Symbiose Seminar, Juin 2018, 18/29
2500 5000 7500 10000 12500 15000 17500 0.020 0.015 0.010 0.005 0.000 0.005
Repeats are a more structured noise that makes the method fail.
Symbiose Seminar, Juin 2018, 19/29
Introduction Spectral relaxation of Seriation (Spectral Ordering) Multi-dimensional Spectral Ordering Results (Application to genome assembly)
Symbiose Seminar, Juin 2018, 20/29
Symbiose Seminar, Juin 2018, 21/29
n
2
Let 0 = λ0 < λ1 ≤ . . . ≤ λn−1, Λ diag (λ0, . . . , λn−1),
For any K < n, Φ(K)
i,j=1 Aijyi − yj2 2
1 , . . . , yT n
Symbiose Seminar, Juin 2018, 22/29
Compute the K lowest non-zero eigenvectors of L,
Run the K-means algorithm on this K-dimensional embedding.
Symbiose Seminar, Juin 2018, 23/29
Construct new similarity matrix S For each point u, take k-NN in the embedding, fit by a line, use projection on
Run basic Spectral Ordering on S. If S is not connected, reorder each connected component, and use A to merge
Symbiose Seminar, Juin 2018, 24/29
Simple generalization of Spectral Ordering Acts like a pre-preocessing on the similarity matrix Improves robustness to noise Handles circular orderings (with 2D embedding in a circle)
0.075 0.050 0.025 0.000 0.025 0.050 0.075 0.100 0.04 0.02 0.00 0.02 0.04
initial laplacian embedding of Hic
Symbiose Seminar, Juin 2018, 25/29
Introduction Spectral relaxation of Seriation (Spectral Ordering) Multi-dimensional Spectral Ordering Results (Application to genome assembly)
Symbiose Seminar, Juin 2018, 26/29
Escherichia coli reads sequenced by Loman et al. [2015]. ∼ 4Mbp Oxford Nanopore Technology MinION’s device (third generation long reads). minimap2 used to compute overlap-based similarity between reads.
Symbiose Seminar, Juin 2018, 27/29
MDSO new similarity matrix S is disconnected. Connected components can be merged by looking at the similarity between
Kendall-Tau score with reference ordering : 99.5% Full assembly pipeline yields ∼ 99% avg. identity (using MSA in sliding
2500 5000 7500 10000 12500 15000 17500
1000000 2000000 3000000 4000000
true position
2500 5000 7500 10000 12500 15000 17500
1000000 2000000 3000000 4000000
true position
Symbiose Seminar, Juin 2018, 28/29
Equivalence 2-SUM ⇐
Spectral ordering : simple relaxation of 2-SUM using spectrum of the
Spectral ordering is sensitive to repeats. Multi-dimensional Spectral Ordering can overcome this issue (and solve
Straightforward assembly pipeline with MSA to perform consensus.
Symbiose Seminar, Juin 2018, 29/29
Symbiose Seminar, Juin 2018, 30/29
16 chromosomes Many repeats Higher threshold on similarity matrix ⇒ many connected components
true ordering (from BWA) recovered (spectral ordering)
Symbiose Seminar, Juin 2018, 31/29
Equivalence 2-SUM ⇐
Layout correctly found by spectral relaxation for bacterial genomes (with
Consensus computed by MSA in sliding windows ⇒∼ 99% avg. identity with
Additional information could help assemble more complex genomes (e.g.
Other problems involving Seriation ? Convex relaxations can also handle constraints (e.g. |π(i) − π(j)| ≤ k) for
Symbiose Seminar, Juin 2018, 32/29
Once layout is computed and fined-grained, slicing in windows Multiple Sequence Alignment using Partial Order Graphs (POA) in windows Windows merging
window 1 window 2 window 3
POA in windows
consensus 1 consensus 2 consensus 3 consensus (1+2) consensus ((1+2) +3)
Symbiose Seminar, Juin 2018, 33/29
The 2-SUM problem is written
π∈P n
π∈P πTLAπ
NP-Complete for generic matrices A.
Symbiose Seminar, Juin 2018, 34/29
π∈P n
Gives a spectral (hence polynomial) solution for 2-SUM on some R-matrices. Write a convex relaxation for 2-SUM and seriation.
Symbiose Seminar, Juin 2018, 35/29
Let Dn the set of doubly stochastic matrices, where
Notice that P = D ∩ O, i.e. Π permutation matrix if and only Π is both
Solve
F
1 Πg + 1 ≤ eT nΠg,
n11T and Y ∈ Rn×p is a matrix
Symbiose Seminar, Juin 2018, 36/29
F
2-SUM term Tr(Y TΠTLAΠY ) = p
i=1 yT i ΠTLAΠyi where yi are small
Orthogonalization penalty −µPΠ2
F, where P = I − 1 n11T.
eT
1 Πg + 1 ≤ eT nΠg breaks degeneracies by imposing π(1) ≤ π(n). Without it,
Π1 = 1, ΠT1 = 1 and Π ≥ 0, keep Π doubly stochastic.
Symbiose Seminar, Juin 2018, 37/29
Relaxations for orthogonality constraints, e.g. SDPs in [???]. Simple idea:
O(√log n) approximation bounds for Minimum Linear Arrangement
All these relaxations form extremely large SDPs.
Symbiose Seminar, Juin 2018, 38/29
Semi-Supervised Seriation. We can add structural constraints to the
i Πg − eT j Πg ≤ b.
Sampling permutations. We can generate permutations from a doubly
Symbiose Seminar, Juin 2018, 39/29
Symbiose Seminar, Juin 2018, 40/29
Symbiose Seminar, Juin 2018, 41/29
Longer reads. Average 10k base pairs in early experiments. Compared with
High error rate. About 20% compared with a few percents for existing
Real-time data. Sequencing data flows continuously.
Symbiose Seminar, Juin 2018, 42/29