Coding for Optimized Writing Rate in DNA Storage Siddharth Jain, - - PowerPoint PPT Presentation

coding for optimized writing rate in dna storage
SMART_READER_LITE
LIVE PREVIEW

Coding for Optimized Writing Rate in DNA Storage Siddharth Jain, - - PowerPoint PPT Presentation

Coding for Optimized Writing Rate in DNA Storage Siddharth Jain, Farzad Farnoud, Moshe Schwartz, Shuki Bruck IEEE ISIT 2020 DN DNA Stor orage Information In this DNA Synthesis (Writing) talk Storage Medium (Multiple Strands of DNA) DNA


slide-1
SLIDE 1

Coding for Optimized Writing Rate in DNA Storage

Siddharth Jain, Farzad Farnoud, Moshe Schwartz, Shuki Bruck

IEEE ISIT 2020

slide-2
SLIDE 2

DN DNA Stor

  • rage

DNA Synthesis (Writing) DNA Sequencing (Reading) Reconstruction Storage Medium (Multiple Strands of DNA)

Information

In this talk

slide-3
SLIDE 3

Current DNA Synthesis Systems

  • Slow
  • Expensive
slide-4
SLIDE 4

Terminator Free DNA Synthesis

(H. H. Lee, R. Kalhor, N. Goela, J. Bolot, and G. M. Church, “Terminator-free template-independent enzymatic DNA synthesis for digital information storage,” Nature Communications, vol. 10, no. 2383, pp. 1–12, 2019. )

  • Faster
  • Cheaper
  • Noisy
slide-5
SLIDE 5

Terminator Free DNA Synthesis Channel

(H. H. Lee, R. Kalhor, N. Goela, J. Bolot, and G. M. Church, “Terminator-free template-independent enzymatic DNA synthesis for digital information storage,” Nature Communications, vol. 10, no. 2383, pp. 1–12, 2019. )

C C C C C A

Sequence Write

Noise

Previous Symbol Current Symbol

Distribution 𝐸

Write Time 𝑢

(Sticky Insertion)

slide-6
SLIDE 6

Terminator Free DNA Synthesis Channel

(H. H. Lee, R. Kalhor, N. Goela, J. Bolot, and G. M. Church, “Terminator-free template-independent enzymatic DNA synthesis for digital information storage,” Nature Communications, vol. 10, no. 2383, pp. 1–12, 2019. )

C C C C C A

Sequence Write

Noise

Previous Symbol Current Symbol

Distribution 𝐸!→#

Write Time 𝑢

(Sticky Insertion)

slide-7
SLIDE 7

Terminator Free DNA Synthesis Channel

(H. H. Lee, R. Kalhor, N. Goela, J. Bolot, and G. M. Church, “Terminator-free template-independent enzymatic DNA synthesis for digital information storage,” Nature Communications, vol. 10, no. 2383, pp. 1–12, 2019. )

C C C C C A

Sequence Write

Noise

Previous Symbol Current Symbol

Distribution 𝐸!→#(𝑢)

Write Time 𝑢

(Sticky Insertion)

slide-8
SLIDE 8

Terminator Free DNA Synthesis Channel

(H. H. Lee, R. Kalhor, N. Goela, J. Bolot, and G. M. Church, “Terminator-free template-independent enzymatic DNA synthesis for digital information storage,” Nature Communications, vol. 10, no. 2383, pp. 1–12, 2019. )

Sequence to be synthesized: ACTAG

A ACCC ACCCTT ACCCTTA ACCCTTAGGGGG Round 1 Round 2 Round 3 Round 4

𝐸

!→#(𝑢$)

𝐸#→%(𝑢&) 𝐸%→!(𝑢') 𝐸

!→((𝑢))

Length of run in each round is given by a distribution 𝐸 which depends on

  • previous symbol - current symbol
  • time of synthesis
slide-9
SLIDE 9

Approach

(H. H. Lee, R. Kalhor, N. Goela, J. Bolot, and G. M. Church, “Terminator-free template-independent enzymatic DNA synthesis for digital information storage,” Nature Communications, vol. 10, no. 2383, pp. 1–12, 2019. )

ACCCTTAGGGGG ACTAG

Forget Runs and Encode Information in Transitions

Rate: 𝐦𝐩𝐡𝟑 𝟒 Can we do better?

slide-10
SLIDE 10

Precision Resolution (PR) Framework

(M. Schwartz and J. Bruck, “On the capacity of the precision-resolution system,” IEEE Trans. Inform. Theory, vol. 56, no. 3, pp. 1028–1037, 2010. )

0100010010010101

  • Information encoded in length of runs of 0’s.
  • Clock frequency mismatch at Tx and Rx can result in

erroneous measurement of run lengths.

  • PR framework provides an optimal set of run lengths that

can be recovered without any error.

slide-11
SLIDE 11

Precision Resolution (PR) Framework

  • Assumptions:
  • 1. Run Length noise is independent of the location of the run.
  • 2. Noisy Run Lengths have a finite support.

PR framework cannot be directly applied for the Terminator Free DNA Synthesis Channel.

Why?

slide-12
SLIDE 12

Memory

ACCC ACCCTT ACCCTTA ACCCTTAGGGGG Round 1 Round 2 Round 3 Round 4

𝐸

!→#(𝑢$)

𝐸#→%(𝑢&) 𝐸%→!(𝑢') 𝐸

!→((𝑢))

Distribution 𝑬 depends on the previous symbol

slide-13
SLIDE 13

Quantization Error

  • Distribution 𝑬 doesn’t have a finite support.
  • The quantizer may have an error in detecting the

round duration.

  • We assume this error to be ≤ 𝜺.
slide-14
SLIDE 14

Multiple Copies

  • Multiple DNA strings can be synthesized for the

same user information.

  • They can be used to improve the overall scheme.
slide-15
SLIDE 15

T A C G 1 2 1 3 2 1 1 2 1 3 1 2 1 4 1 3 1 2 1 3 1 2 1 4

𝑢!→# = {1, 2} 𝑢!→$ = {1, 3} 𝑢!→% = {1, 3} 𝑢#→! = {1, 2} 𝑢#→$ = {1, 2} 𝑢#→% = {1, 3} 𝑢$→! = {1, 2} 𝑢$→# = {1, 3} 𝑢$→% = {1, 2} 𝑢%→! = {1, 4} 𝑢%→# = {1, 2} 𝑢%→$ = {1, 4}

𝑇(𝐻)

Encode Information in round times

slide-16
SLIDE 16

Convert 𝐻 to a simple graph 𝐻’

A d1 d2 T 1 1 1

Perron-Frobenius Theory 𝑑𝑏𝑞 𝑇 𝐻 = log% 𝜇 𝐵&!

𝐵$!: 𝐵𝑒𝑘𝑏𝑑𝑓𝑜𝑑𝑧 𝑁𝑏𝑢𝑠𝑗𝑦 𝑝𝑔 𝐻&, 𝜇 𝐵$! : 𝑁𝑏𝑦𝑗𝑛𝑣𝑛 𝐹𝑗𝑕𝑓𝑜 𝑊𝑏𝑚𝑣𝑓 𝑝𝑔 𝐵$!

Add auxiliary vertices A T 3

slide-17
SLIDE 17

Framework Description

  • Maximal round time decoding error (𝜺 > 𝟏)
  • Maximal round time (𝑵)
  • Allowable round times for a given transition 𝑐 → 𝑏

𝟐 ≤ 𝒖𝒄→𝒃

(𝟐) < 𝒖𝒄→𝒃 𝟑

< ⋯ < 𝒖𝒄→𝒃

(ℓ)

≤ 𝑵

slide-18
SLIDE 18

Example

T A C G 1 2 1 3 2 1 1 2 1 3 1 2 1 4 1 3 1 2 1 3 1 2 1 4

𝑢!→# = {1, 2} 𝑢!→$ = {1, 3} 𝑢!→% = {1, 3} 𝑢#→! = {1, 2} 𝑢#→$ = {1, 2} 𝑢#→% = {1, 3} 𝑢$→! = {1, 2} 𝑢$→# = {1, 3} 𝑢$→% = {1, 2} 𝑢%→! = {1, 4} 𝑢%→# = {1, 2} 𝑢%→$ = {1, 4}

𝑇(𝐻)

𝑚 = 2, 𝑁 = 4

slide-19
SLIDE 19

Framework Description

  • Maximal round time decoding error (𝜺 > 𝟏)
  • Maximal round time (𝑵)
  • Allowable round times for a given transition 𝑐 → 𝑏

𝟐 ≤ 𝒖𝒄→𝒃

(𝟐) < 𝒖𝒄→𝒃 𝟑

< ⋯ < 𝒖𝒄→𝒃

(ℓ)

≤ 𝑵

  • Number of copies (𝑶)
  • Quantizing Function ℚ𝒄→𝒃: ℕ𝑶 → [ℓ]

Receiver: 𝒔𝟐, 𝒔𝟑, … , 𝒔𝑶 𝒕. 𝒖 . 𝒔𝒌 ~ 𝑬𝒄→𝒃(𝒖𝒄→𝒃

(𝒋) )

ℚ 𝒔𝟐, 𝒔𝟑, … , 𝒔𝑶 = [ 𝒋 . 𝒕. 𝒖. 𝐐𝐬 [ 𝒋 = 𝒋 𝒋 ≥ 𝟐 − 𝜺.

slide-20
SLIDE 20

𝐸"→$(1) 𝐸"→$(2)

CGGG CGGGGG CGGGG CGGGGGG CGGGGG

Receiver

Quantizing Function

𝑢,→. = {1, 2}

slide-21
SLIDE 21

State Splitting encoder

𝑛 ∈ {0,1}! 𝑒 ∈ [ℓ]"

Error Correction Coding Multiple Sequence Alignment + Quantizing Function ℚ Terminator Free DNA Synthesis Channel

Decoding for ECC

, 𝑒 ∈ [ℓ]"

State Splitting decoder

  • 𝑛 ∈ {0,1}!

𝑶 𝒅𝒑𝒒𝒋𝒇𝒕 𝑡 rounds where for each round there are ℓ possible round times 𝑐𝑓𝑑𝑏𝑣𝑡𝑓 𝑝𝑔 𝑢ℎ𝑓 𝜀 𝑓𝑠𝑠𝑝𝑠 𝑗𝑜𝑢𝑠𝑝𝑒𝑣𝑑𝑓𝑒 𝑐𝑧 𝑢ℎ𝑓 𝑑ℎ𝑏𝑜𝑜𝑓𝑚

slide-22
SLIDE 22

Theorem

Let Gʹ be the ordinary version of G. Further assume the k user informaMon bits are i.i.d. uniform random bits. Then for all large enough k, the user informaMon bits may be encoded into a sequence using at most Here α is the sum of probabilities of non-auxiliary vertices in the stationary distribution of the max-entropic Markov chain and 𝐷1,ℓ ≜ 1 + 𝜀𝑚𝑝𝑕ℓ 𝜀 ℓ − 1 + (1 − 𝜀)𝑚𝑝𝑕ℓ(1 − 𝜀)

𝑙 𝑑𝑏𝑞 𝑇 𝐻 (1 + 𝛽 1 𝐷*,ℓ − 1 𝑚𝑝𝑕-.$(ℓ)) synthesis time, and be decoded correctly with high probability.

slide-23
SLIDE 23

State Splitting encoder

𝑛 ∈ {0,1}! 𝑒 ∈ [ℓ]"

Error Correction Coding Multiple Sequence Alignment + Quantizing Function ℚ Terminator Free DNA Synthesis Channel

Decoding for ECC

, 𝑒 ∈ [ℓ]"

State Splitting decoder

  • 𝑛 ∈ {0,1}!

𝑶 𝒅𝒑𝒒𝒋𝒇𝒕

𝑙 𝑑𝑏𝑞 𝑇 𝐻

(1 + 𝛽 1 𝐷1,ℓ − 1 𝑚𝑝𝑕345(ℓ))

slide-24
SLIDE 24

Experimental Results – Binomial Run lengths

𝑂 = 5

𝜀 = 0.02

slide-25
SLIDE 25

Poisson Run Lengths

𝜇! for different values of 𝑂

Achievable Rates for different values of 𝜀

slide-26
SLIDE 26

Conclusion

  • Method for encoding information in DNA sequences based on PR

framework and terminator free DNA synthesis method.

  • Rate above 𝐦𝐩𝐡𝟑 𝟒 can be achieved.
  • Method accounts for quantizer error 𝜺.
  • Provided method for designing quantizers for Binomial and Poisson

run lengths.

  • As we have multiple copies, alignment can be used to account for

deletion of runs.

slide-27
SLIDE 27

Thank You. Questions

sidjain@caltech.edu