Overcoming high nanopore basecaller error rates for DNA storage via basecaller-decoder integration and convolutional codes
Shubham Chandak Stanford University ICASSP 2020
Overcoming high nanopore basecaller error rates for DNA storage via - - PowerPoint PPT Presentation
Overcoming high nanopore basecaller error rates for DNA storage via basecaller-decoder integration and convolutional codes Shubham Chandak Stanford University ICASSP 2020 Team and funding Reyna Peter Shubham Kedar Joachim Jay Billy
Overcoming high nanopore basecaller error rates for DNA storage via basecaller-decoder integration and convolutional codes
Shubham Chandak Stanford University ICASSP 2020
Team and funding
Tsachy Weissman Mary Wootters Hanlee Ji
Shubham Chandak Kedar Tatwawadi Joachim Neu Jay Mardia Billy Lau Peter Griffin Matt Kubit Reyna Hulett
SemiSynBio: Highly scalable random access DNA data storage with nanopore-based reading Beckman Center Innovative Technology Seed Grant Scalable Long-Term DNA Storage with Error Correction and Random-Access Retrieval
40,000 x 5 TByte HDDs 40 tons 10s of years
40,000 x 5 TByte HDDs 40 tons 10s of years DNA 1 gram 1,000s of years
40,000 x 5 TByte HDDs 40 tons 10s of years DNA 1 gram 1,000s of years Easy duplication
Building block: synthesis
Current ability: short ssDNA oligos (~150nt) at scale DNA Synthesis is not perfect: Usually has ~1% insertion/Deletion error
Building block: sequencing
https://directorsblog.nih.gov/2018/02/06/sequencing-human-genome-with-pocket-sized-nanopore-device/
Typical DNA Storage System
Segmentation
File Storage Sequencing + Basecalling Reconstructed file Outer code + indexing Inner code Synthesis
Sequenced reads Decoding
Challenges
Challenges
reading cost
Previous Works
14
[2] L. Organick et al., “Random access in large-scale DNA data storage," Nature biotechnology, vol. 36, no. 3, p. 242, 2018. [3] Randolph Lopez et al., “DNA assembly for nanopore data storage readout,” Nature communications, vol. 10, no. 1, pp. 2933, 2019.
[2] [3]
We want to be here!
Nanopore Physics
… ACGTACGTACGT ... Nanopore sequencing channel
Nanopore Sequencing Model
Source: "Models and Information-Theoretic Bounds for Nanopore Sequencing", Wei Mao et al., IEEE Trans. Inf. Theory 2017
… ACGTACGTACGT ... Nanopore sequencing channel
VERY HARD TO MODEL AND ANALYZE FAITHFULLY
Nanopore Sequencing Model
Source: "Models and Information-Theoretic Bounds for Nanopore Sequencing", Wei Mao et al., IEEE Trans. Inf. Theory 2017
… ACGTACGTACGT ... Nanopore sequencing channel
VERY HARD TO MODEL AND ANALYZE FAITHFULLY COMBINE STRENGTHS OF MACHINE LEARNING & CODING THEORY!
Nanopore Sequencing Model
Source: "Models and Information-Theoretic Bounds for Nanopore Sequencing", Wei Mao et al., IEEE Trans. Inf. Theory 2017
Key idea
Using Flappie basecaller (Oxford Nanopore)
Probabilities
Key idea
Using Flappie basecaller (Oxford Nanopore) Basecalling
Code constraints not used Probabilities
AACGT
Key idea
Using Flappie basecaller (Oxford Nanopore) Basecalling
Code constraints not used Probabilities
Decoding
Code constraints used
AACGT ACGCGT
Key idea
Convolutional Codes as the Inner Code
Incoming bit / output
Convolution code parameters: r = 1/2 (rate) m = 6 (memory)
State diagram snippet
Basecaller-decoder integration
Combining NN-modeling + convolutional codes NN-modeling based transition probabilities Convolutional code Perform Viterbi decoding using the modified state diagram
Overall Inner Code design
Attach index and CRC 12-bit index 8-bit CRC Payload Convolutional encoding Map to DNA (2 bits per base) Convolutional list decoding Select topmost list element with correct CRC (if any) Segment #265 Segment #265
(a) Inner code encoding (b) Inner code decoding
Experiments
Dream” Speech, poem collections, …
length ~165
Results
29
[6] [22]
0.50 0.70 0.90 1.10 1.30 1.50 1.70 1.90 2.10 5 10 15 20 25 30 35
Writing cost (bases/bit) Reading cost (bases/bit) Convolutional code: m=8, L=8 Convolutional code: m=11, L=8 Convolutional code: m=14, L=4 Previous works r = 5/6 r = 3/4 r = 1/2
[6] L. Organick et al., “Random access in large-scale DNA data storage," Nature biotechnology, vol. 36, no. 3, p. 242, 2018. [22] Randolph Lopez et al., “DNA assembly for nanopore data storage readout,” Nature communications, vol. 10, no. 1, pp. 2933, 2019.
3x improvement in reading cost!
Conclusions and future work
DNA storage
signal
Conclusions and future work
DNA storage
signal
Code and data available at https://github.com/shubhamchandak94/nanopore_dna_storage