Overcoming high nanopore basecaller error rates for DNA storage via - - PowerPoint PPT Presentation

overcoming high nanopore basecaller error rates for dna
SMART_READER_LITE
LIVE PREVIEW

Overcoming high nanopore basecaller error rates for DNA storage via - - PowerPoint PPT Presentation

Overcoming high nanopore basecaller error rates for DNA storage via basecaller-decoder integration and convolutional codes Shubham Chandak Stanford University ICASSP 2020 Team and funding Reyna Peter Shubham Kedar Joachim Jay Billy


slide-1
SLIDE 1

Overcoming high nanopore basecaller error rates for DNA storage via basecaller-decoder integration and convolutional codes

Shubham Chandak Stanford University ICASSP 2020

slide-2
SLIDE 2

Team and funding

Tsachy Weissman Mary Wootters Hanlee Ji

Shubham Chandak Kedar Tatwawadi Joachim Neu Jay Mardia Billy Lau Peter Griffin Matt Kubit Reyna Hulett

SemiSynBio: Highly scalable random access DNA data storage with nanopore-based reading Beckman Center Innovative Technology Seed Grant Scalable Long-Term DNA Storage with Error Correction and Random-Access Retrieval

slide-3
SLIDE 3

Motivation

slide-4
SLIDE 4

200 Petabyte

slide-5
SLIDE 5

200 Petabyte

40,000 x 5 TByte HDDs 40 tons 10s of years

slide-6
SLIDE 6

200 Petabyte

40,000 x 5 TByte HDDs 40 tons 10s of years DNA 1 gram 1,000s of years

slide-7
SLIDE 7

200 Petabyte

40,000 x 5 TByte HDDs 40 tons 10s of years DNA 1 gram 1,000s of years Easy duplication

slide-8
SLIDE 8

DNA storage setup

slide-9
SLIDE 9

Building block: synthesis

  • Ability to “write/synthesize” artificial DNA (sequence of {A,C,G,T})

Current ability: short ssDNA oligos (~150nt) at scale DNA Synthesis is not perfect: Usually has ~1% insertion/Deletion error

slide-10
SLIDE 10

Building block: sequencing

  • Nanopore sequencing: portable, real time

https://directorsblog.nih.gov/2018/02/06/sequencing-human-genome-with-pocket-sized-nanopore-device/

slide-11
SLIDE 11

Typical DNA Storage System

Segmentation

File Storage Sequencing + Basecalling Reconstructed file Outer code + indexing Inner code Synthesis

  • Duplication
  • Permutation
  • Loss
  • Corruption

Sequenced reads Decoding

slide-12
SLIDE 12

Challenges

  • High basecall error rates for nanopore sequencing
  • 5-10% edit distance
  • Predominantly insertion and deletion errors
  • Lack of good error correction codes for this setting
slide-13
SLIDE 13

Challenges

  • High basecall error rates for nanopore sequencing
  • 5-10% edit distance
  • Predominantly insertion and deletion errors
  • Lack of good error correction codes for this setting
  • Most previous works rely on consensus over multiple reads – high

reading cost

  • Sequence the input lot of times (~30-40x)
  • Cluster by index, and perform “averaging” to reduce the error
slide-14
SLIDE 14

Previous Works

14

[2] L. Organick et al., “Random access in large-scale DNA data storage," Nature biotechnology, vol. 36, no. 3, p. 242, 2018. [3] Randolph Lopez et al., “DNA assembly for nanopore data storage readout,” Nature communications, vol. 10, no. 1, pp. 2933, 2019.

[2] [3]

We want to be here!

slide-15
SLIDE 15

Methods

slide-16
SLIDE 16

Nanopore Physics

slide-17
SLIDE 17

… ACGTACGTACGT ... Nanopore sequencing channel

  • Memory (inter-symbol interference)
  • Base skips
  • Fading
  • Random symbol duration
  • Noise

Nanopore Sequencing Model

Source: "Models and Information-Theoretic Bounds for Nanopore Sequencing", Wei Mao et al., IEEE Trans. Inf. Theory 2017

slide-18
SLIDE 18

… ACGTACGTACGT ... Nanopore sequencing channel

  • Memory (inter-symbol interference)
  • Base skips
  • Fading
  • Random symbol duration
  • Noise

VERY HARD TO MODEL AND ANALYZE FAITHFULLY

Nanopore Sequencing Model

Source: "Models and Information-Theoretic Bounds for Nanopore Sequencing", Wei Mao et al., IEEE Trans. Inf. Theory 2017

slide-19
SLIDE 19

… ACGTACGTACGT ... Nanopore sequencing channel

  • Memory (inter-symbol interference)
  • Base skips
  • Fading
  • Random symbol duration
  • Noise

VERY HARD TO MODEL AND ANALYZE FAITHFULLY COMBINE STRENGTHS OF MACHINE LEARNING & CODING THEORY!

Nanopore Sequencing Model

Source: "Models and Information-Theoretic Bounds for Nanopore Sequencing", Wei Mao et al., IEEE Trans. Inf. Theory 2017

slide-20
SLIDE 20

Key idea

slide-21
SLIDE 21

Using Flappie basecaller (Oxford Nanopore)

Probabilities

Key idea

slide-22
SLIDE 22

Using Flappie basecaller (Oxford Nanopore) Basecalling

Code constraints not used Probabilities

AACGT

Key idea

slide-23
SLIDE 23

Using Flappie basecaller (Oxford Nanopore) Basecalling

Code constraints not used Probabilities

Decoding

Code constraints used

AACGT ACGCGT

Key idea

slide-24
SLIDE 24

Convolutional Codes as the Inner Code

Incoming bit / output

Convolution code parameters: r = 1/2 (rate) m = 6 (memory)

State diagram snippet

slide-25
SLIDE 25

Basecaller-decoder integration

Combining NN-modeling + convolutional codes NN-modeling based transition probabilities Convolutional code Perform Viterbi decoding using the modified state diagram

slide-26
SLIDE 26

Overall Inner Code design

Attach index and CRC 12-bit index 8-bit CRC Payload Convolutional encoding Map to DNA (2 bits per base) Convolutional list decoding Select topmost list element with correct CRC (if any) Segment #265 Segment #265

(a) Inner code encoding (b) Inner code decoding

slide-27
SLIDE 27

Experiments and results

slide-28
SLIDE 28

Experiments

  • Data: 11KB of data: The Gettysburg Address, UN Declaration, “I have a

Dream” Speech, poem collections, …

  • Final Error Correction Code Design:
  • Reed Solomon outer code: 30% redundancy (default)
  • Pretrained Model from the ONT Flappie Basecaller
  • Synthesis: Data Synthesized using CustomArray synthesis, into oligos of

length ~165

  • Experiments:
  • Rate of convolution code: r = 1/2, 3/4, 5/6
  • Memory: m = 8,11,14
  • List Size: 4, 8
slide-29
SLIDE 29

Results

29

[6] [22]

0.50 0.70 0.90 1.10 1.30 1.50 1.70 1.90 2.10 5 10 15 20 25 30 35

Writing cost (bases/bit) Reading cost (bases/bit) Convolutional code: m=8, L=8 Convolutional code: m=11, L=8 Convolutional code: m=14, L=4 Previous works r = 5/6 r = 3/4 r = 1/2

[6] L. Organick et al., “Random access in large-scale DNA data storage," Nature biotechnology, vol. 36, no. 3, p. 242, 2018. [22] Randolph Lopez et al., “DNA assembly for nanopore data storage readout,” Nature communications, vol. 10, no. 1, pp. 2933, 2019.

3x improvement in reading cost!

slide-30
SLIDE 30

Conclusions and future work

  • Novel error-correction mechanism for nanopore sequencing based

DNA storage

  • Use “soft-information” from raw signal to improve decoding
  • Use neural net in basecaller to distil information from “hard-to-model” raw

signal

  • Use convolutional codes that align nicely with sequential nanopore model
  • Requires 3x fewer reads for decoding than previous works
slide-31
SLIDE 31

Conclusions and future work

  • Novel error-correction mechanism for nanopore sequencing based

DNA storage

  • Use “soft-information” from raw signal to improve decoding
  • Use neural net in basecaller to distil information from “hard-to-model” raw

signal

  • Use convolutional codes that align nicely with sequential nanopore model
  • Requires 3x fewer reads for decoding than previous works
  • Future work:
  • Optimization of convolutional code and CRC parameters
  • Finetuning of neural network model and use of improved basecallers
  • Application to other novel synthesis methodologies
slide-32
SLIDE 32

Thank You!

Code and data available at https://github.com/shubhamchandak94/nanopore_dna_storage