Overcoming high nanopore basecaller error rates for DNA storage via - - PowerPoint PPT Presentation

▶

Aug 12, 2023 19 likes •341 views

Overcoming high nanopore basecaller error rates for DNA storage via basecaller-decoder integration and convolutional codes Shubham Chandak Stanford University ICASSP 2020 Team and funding Reyna Peter Shubham Kedar Joachim Jay Billy

SLIDE 1

Overcoming high nanopore basecaller error rates for DNA storage via basecaller-decoder integration and convolutional codes

Shubham Chandak Stanford University ICASSP 2020

SLIDE 2

Team and funding

Tsachy Weissman Mary Wootters Hanlee Ji

Shubham Chandak Kedar Tatwawadi Joachim Neu Jay Mardia Billy Lau Peter Griffin Matt Kubit Reyna Hulett

SemiSynBio: Highly scalable random access DNA data storage with nanopore-based reading Beckman Center Innovative Technology Seed Grant Scalable Long-Term DNA Storage with Error Correction and Random-Access Retrieval

SLIDE 3

Motivation

SLIDE 4

200 Petabyte

SLIDE 5

200 Petabyte

40,000 x 5 TByte HDDs 40 tons 10s of years

SLIDE 6

200 Petabyte

40,000 x 5 TByte HDDs 40 tons 10s of years DNA 1 gram 1,000s of years

SLIDE 7

200 Petabyte

40,000 x 5 TByte HDDs 40 tons 10s of years DNA 1 gram 1,000s of years Easy duplication

SLIDE 8

DNA storage setup

SLIDE 9

Building block: synthesis

Ability to “write/synthesize” artificial DNA (sequence of {A,C,G,T})

Current ability: short ssDNA oligos (~150nt) at scale DNA Synthesis is not perfect: Usually has ~1% insertion/Deletion error

SLIDE 10

Building block: sequencing

Nanopore sequencing: portable, real time

https://directorsblog.nih.gov/2018/02/06/sequencing-human-genome-with-pocket-sized-nanopore-device/

SLIDE 11

Typical DNA Storage System

Segmentation

File Storage Sequencing + Basecalling Reconstructed file Outer code + indexing Inner code Synthesis

Duplication
Permutation
Loss
Corruption

Sequenced reads Decoding

SLIDE 12

Challenges

High basecall error rates for nanopore sequencing
5-10% edit distance
Predominantly insertion and deletion errors
Lack of good error correction codes for this setting

SLIDE 13

Challenges

High basecall error rates for nanopore sequencing
5-10% edit distance
Predominantly insertion and deletion errors
Lack of good error correction codes for this setting
Most previous works rely on consensus over multiple reads – high

reading cost

Sequence the input lot of times (~30-40x)
Cluster by index, and perform “averaging” to reduce the error

SLIDE 14

Previous Works

[2] L. Organick et al., “Random access in large-scale DNA data storage," Nature biotechnology, vol. 36, no. 3, p. 242, 2018. [3] Randolph Lopez et al., “DNA assembly for nanopore data storage readout,” Nature communications, vol. 10, no. 1, pp. 2933, 2019.

[2] [3]

We want to be here!

SLIDE 15

Methods

SLIDE 16

Nanopore Physics

SLIDE 17

… ACGTACGTACGT ... Nanopore sequencing channel

Memory (inter-symbol interference)
Base skips
Fading
Random symbol duration
Noise

Nanopore Sequencing Model

Source: "Models and Information-Theoretic Bounds for Nanopore Sequencing", Wei Mao et al., IEEE Trans. Inf. Theory 2017

SLIDE 18

… ACGTACGTACGT ... Nanopore sequencing channel

Memory (inter-symbol interference)
Base skips
Fading
Random symbol duration
Noise

VERY HARD TO MODEL AND ANALYZE FAITHFULLY

Nanopore Sequencing Model

Source: "Models and Information-Theoretic Bounds for Nanopore Sequencing", Wei Mao et al., IEEE Trans. Inf. Theory 2017

SLIDE 19

… ACGTACGTACGT ... Nanopore sequencing channel

Memory (inter-symbol interference)
Base skips
Fading
Random symbol duration
Noise

VERY HARD TO MODEL AND ANALYZE FAITHFULLY COMBINE STRENGTHS OF MACHINE LEARNING & CODING THEORY!

Nanopore Sequencing Model

Source: "Models and Information-Theoretic Bounds for Nanopore Sequencing", Wei Mao et al., IEEE Trans. Inf. Theory 2017

SLIDE 20

Key idea

SLIDE 21

Using Flappie basecaller (Oxford Nanopore)

Probabilities

Key idea

SLIDE 22

Using Flappie basecaller (Oxford Nanopore) Basecalling

Code constraints not used Probabilities

AACGT

Key idea

SLIDE 23

Using Flappie basecaller (Oxford Nanopore) Basecalling

Code constraints not used Probabilities

Decoding

Code constraints used

AACGT ACGCGT

Key idea

SLIDE 24

Convolutional Codes as the Inner Code

Incoming bit / output

Convolution code parameters: r = 1/2 (rate) m = 6 (memory)

State diagram snippet

SLIDE 25

Basecaller-decoder integration

Combining NN-modeling + convolutional codes NN-modeling based transition probabilities Convolutional code Perform Viterbi decoding using the modified state diagram

SLIDE 26

Overall Inner Code design

Attach index and CRC 12-bit index 8-bit CRC Payload Convolutional encoding Map to DNA (2 bits per base) Convolutional list decoding Select topmost list element with correct CRC (if any) Segment #265 Segment #265

(a) Inner code encoding (b) Inner code decoding

SLIDE 27

Experiments and results

SLIDE 28

Experiments

Data: 11KB of data: The Gettysburg Address, UN Declaration, “I have a

Dream” Speech, poem collections, …

Final Error Correction Code Design:
Reed Solomon outer code: 30% redundancy (default)
Pretrained Model from the ONT Flappie Basecaller
Synthesis: Data Synthesized using CustomArray synthesis, into oligos of

length ~165

Experiments:
Rate of convolution code: r = 1/2, 3/4, 5/6
Memory: m = 8,11,14
List Size: 4, 8

SLIDE 29

Results

[6] [22]

0.50 0.70 0.90 1.10 1.30 1.50 1.70 1.90 2.10 5 10 15 20 25 30 35

Writing cost (bases/bit) Reading cost (bases/bit) Convolutional code: m=8, L=8 Convolutional code: m=11, L=8 Convolutional code: m=14, L=4 Previous works r = 5/6 r = 3/4 r = 1/2

[6] L. Organick et al., “Random access in large-scale DNA data storage," Nature biotechnology, vol. 36, no. 3, p. 242, 2018. [22] Randolph Lopez et al., “DNA assembly for nanopore data storage readout,” Nature communications, vol. 10, no. 1, pp. 2933, 2019.

3x improvement in reading cost!

SLIDE 30

Conclusions and future work

Novel error-correction mechanism for nanopore sequencing based

DNA storage

Use “soft-information” from raw signal to improve decoding
Use neural net in basecaller to distil information from “hard-to-model” raw

signal

Use convolutional codes that align nicely with sequential nanopore model
Requires 3x fewer reads for decoding than previous works

SLIDE 31

Conclusions and future work

Novel error-correction mechanism for nanopore sequencing based

DNA storage

Use “soft-information” from raw signal to improve decoding
Use neural net in basecaller to distil information from “hard-to-model” raw

signal

Use convolutional codes that align nicely with sequential nanopore model
Requires 3x fewer reads for decoding than previous works
Future work:
Optimization of convolutional code and CRC parameters
Finetuning of neural network model and use of improved basecallers
Application to other novel synthesis methodologies

SLIDE 32

Thank You!

Code and data available at https://github.com/shubhamchandak94/nanopore_dna_storage