[PPT] - Coding for Optimized Writing Rate in DNA Storage Siddharth Jain, PowerPoint Presentation

SLIDE 1

Coding for Optimized Writing Rate in DNA Storage

Siddharth Jain, Farzad Farnoud, Moshe Schwartz, Shuki Bruck

IEEE ISIT 2020

SLIDE 2

DN DNA Stor

rage

DNA Synthesis (Writing) DNA Sequencing (Reading) Reconstruction Storage Medium (Multiple Strands of DNA)

Information

In this talk

SLIDE 3

Current DNA Synthesis Systems

Slow
Expensive

SLIDE 4

Terminator Free DNA Synthesis

(H. H. Lee, R. Kalhor, N. Goela, J. Bolot, and G. M. Church, “Terminator-free template-independent enzymatic DNA synthesis for digital information storage,” Nature Communications, vol. 10, no. 2383, pp. 1–12, 2019. )

Faster
Cheaper
Noisy

SLIDE 5

Terminator Free DNA Synthesis Channel

(H. H. Lee, R. Kalhor, N. Goela, J. Bolot, and G. M. Church, “Terminator-free template-independent enzymatic DNA synthesis for digital information storage,” Nature Communications, vol. 10, no. 2383, pp. 1–12, 2019. )

C C C C C A

Sequence Write

Noise

Previous Symbol Current Symbol

Distribution 𝐸

Write Time 𝑢

(Sticky Insertion)

SLIDE 6

Terminator Free DNA Synthesis Channel

(H. H. Lee, R. Kalhor, N. Goela, J. Bolot, and G. M. Church, “Terminator-free template-independent enzymatic DNA synthesis for digital information storage,” Nature Communications, vol. 10, no. 2383, pp. 1–12, 2019. )

C C C C C A

Sequence Write

Noise

Previous Symbol Current Symbol

Distribution 𝐸!→#

Write Time 𝑢

(Sticky Insertion)

SLIDE 7

Terminator Free DNA Synthesis Channel

(H. H. Lee, R. Kalhor, N. Goela, J. Bolot, and G. M. Church, “Terminator-free template-independent enzymatic DNA synthesis for digital information storage,” Nature Communications, vol. 10, no. 2383, pp. 1–12, 2019. )

C C C C C A

Sequence Write

Noise

Previous Symbol Current Symbol

Distribution 𝐸!→#(𝑢)

Write Time 𝑢

(Sticky Insertion)

SLIDE 8

Terminator Free DNA Synthesis Channel

(H. H. Lee, R. Kalhor, N. Goela, J. Bolot, and G. M. Church, “Terminator-free template-independent enzymatic DNA synthesis for digital information storage,” Nature Communications, vol. 10, no. 2383, pp. 1–12, 2019. )

Sequence to be synthesized: ACTAG

A ACCC ACCCTT ACCCTTA ACCCTTAGGGGG Round 1 Round 2 Round 3 Round 4

𝐸

!→#(𝑢$)

𝐸#→%(𝑢&) 𝐸%→!(𝑢') 𝐸

!→((𝑢))

Length of run in each round is given by a distribution 𝐸 which depends on

previous symbol - current symbol
time of synthesis

SLIDE 9

Approach

(H. H. Lee, R. Kalhor, N. Goela, J. Bolot, and G. M. Church, “Terminator-free template-independent enzymatic DNA synthesis for digital information storage,” Nature Communications, vol. 10, no. 2383, pp. 1–12, 2019. )

ACCCTTAGGGGG ACTAG

Forget Runs and Encode Information in Transitions

Rate: 𝐦𝐩𝐡𝟑 𝟒 Can we do better?

SLIDE 10

Precision Resolution (PR) Framework

(M. Schwartz and J. Bruck, “On the capacity of the precision-resolution system,” IEEE Trans. Inform. Theory, vol. 56, no. 3, pp. 1028–1037, 2010. )

0100010010010101

Information encoded in length of runs of 0’s.
Clock frequency mismatch at Tx and Rx can result in

erroneous measurement of run lengths.

PR framework provides an optimal set of run lengths that

can be recovered without any error.

SLIDE 11

Precision Resolution (PR) Framework

Assumptions:
1. Run Length noise is independent of the location of the run.
2. Noisy Run Lengths have a finite support.

PR framework cannot be directly applied for the Terminator Free DNA Synthesis Channel.

Why?

SLIDE 12

Memory

ACCC ACCCTT ACCCTTA ACCCTTAGGGGG Round 1 Round 2 Round 3 Round 4

𝐸

!→#(𝑢$)

𝐸#→%(𝑢&) 𝐸%→!(𝑢') 𝐸

!→((𝑢))

Distribution 𝑬 depends on the previous symbol

SLIDE 13

Quantization Error

Distribution 𝑬 doesn’t have a finite support.
The quantizer may have an error in detecting the

round duration.

We assume this error to be ≤ 𝜺.

SLIDE 14

Multiple Copies

Multiple DNA strings can be synthesized for the

same user information.

They can be used to improve the overall scheme.

SLIDE 15

T A C G 1 2 1 3 2 1 1 2 1 3 1 2 1 4 1 3 1 2 1 3 1 2 1 4

𝑢!→# = {1, 2} 𝑢!→$ = {1, 3} 𝑢!→% = {1, 3} 𝑢#→! = {1, 2} 𝑢#→$ = {1, 2} 𝑢#→% = {1, 3} 𝑢$→! = {1, 2} 𝑢$→# = {1, 3} 𝑢$→% = {1, 2} 𝑢%→! = {1, 4} 𝑢%→# = {1, 2} 𝑢%→$ = {1, 4}

𝑇(𝐻)

Encode Information in round times

SLIDE 16

Convert 𝐻 to a simple graph 𝐻’

A d1 d2 T 1 1 1

Perron-Frobenius Theory 𝑑𝑏𝑞 𝑇 𝐻 = log% 𝜇 𝐵&!

𝐵$!: 𝐵𝑒𝑘𝑏𝑑𝑓𝑜𝑑𝑧 𝑁𝑏𝑢𝑠𝑗𝑦 𝑝𝑔 𝐻&, 𝜇 𝐵$! : 𝑁𝑏𝑦𝑗𝑛𝑣𝑛 𝐹𝑗𝑕𝑓𝑜 𝑊𝑏𝑚𝑣𝑓 𝑝𝑔 𝐵$!

Add auxiliary vertices A T 3

SLIDE 17

Framework Description

Maximal round time decoding error (𝜺 > 𝟏)
Maximal round time (𝑵)
Allowable round times for a given transition 𝑐 → 𝑏

𝟐 ≤ 𝒖𝒄→𝒃

(𝟐) < 𝒖𝒄→𝒃 𝟑

< ⋯ < 𝒖𝒄→𝒃

(ℓ)

≤ 𝑵

SLIDE 18

Example

T A C G 1 2 1 3 2 1 1 2 1 3 1 2 1 4 1 3 1 2 1 3 1 2 1 4

𝑢!→# = {1, 2} 𝑢!→$ = {1, 3} 𝑢!→% = {1, 3} 𝑢#→! = {1, 2} 𝑢#→$ = {1, 2} 𝑢#→% = {1, 3} 𝑢$→! = {1, 2} 𝑢$→# = {1, 3} 𝑢$→% = {1, 2} 𝑢%→! = {1, 4} 𝑢%→# = {1, 2} 𝑢%→$ = {1, 4}

𝑇(𝐻)

𝑚 = 2, 𝑁 = 4

SLIDE 19

Framework Description

Maximal round time decoding error (𝜺 > 𝟏)
Maximal round time (𝑵)
Allowable round times for a given transition 𝑐 → 𝑏

𝟐 ≤ 𝒖𝒄→𝒃

(𝟐) < 𝒖𝒄→𝒃 𝟑

< ⋯ < 𝒖𝒄→𝒃

(ℓ)

≤ 𝑵

Number of copies (𝑶)
Quantizing Function ℚ𝒄→𝒃: ℕ𝑶 → [ℓ]

Receiver: 𝒔𝟐, 𝒔𝟑, … , 𝒔𝑶 𝒕. 𝒖 . 𝒔𝒌 ~ 𝑬𝒄→𝒃(𝒖𝒄→𝒃

(𝒋) )

ℚ 𝒔𝟐, 𝒔𝟑, … , 𝒔𝑶 = [ 𝒋 . 𝒕. 𝒖. 𝐐𝐬 [ 𝒋 = 𝒋 𝒋 ≥ 𝟐 − 𝜺.

SLIDE 20

𝐸"→$(1) 𝐸"→$(2)

CGGG CGGGGG CGGGG CGGGGGG CGGGGG

Receiver

Quantizing Function

𝑢,→. = {1, 2}

SLIDE 21

State Splitting encoder

𝑛 ∈ {0,1}! 𝑒 ∈ [ℓ]"

Error Correction Coding Multiple Sequence Alignment + Quantizing Function ℚ Terminator Free DNA Synthesis Channel

Decoding for ECC

, 𝑒 ∈ [ℓ]"

State Splitting decoder

𝑛 ∈ {0,1}!

𝑶 𝒅𝒑𝒒𝒋𝒇𝒕 𝑡 rounds where for each round there are ℓ possible round times 𝑐𝑓𝑑𝑏𝑣𝑡𝑓 𝑝𝑔 𝑢ℎ𝑓 𝜀 𝑓𝑠𝑠𝑝𝑠 𝑗𝑜𝑢𝑠𝑝𝑒𝑣𝑑𝑓𝑒 𝑐𝑧 𝑢ℎ𝑓 𝑑ℎ𝑏𝑜𝑜𝑓𝑚

SLIDE 22

Theorem

Let Gʹ be the ordinary version of G. Further assume the k user informaMon bits are i.i.d. uniform random bits. Then for all large enough k, the user informaMon bits may be encoded into a sequence using at most Here α is the sum of probabilities of non-auxiliary vertices in the stationary distribution of the max-entropic Markov chain and 𝐷1,ℓ ≜ 1 + 𝜀𝑚𝑝𝑕ℓ 𝜀 ℓ − 1 + (1 − 𝜀)𝑚𝑝𝑕ℓ(1 − 𝜀)

𝑙 𝑑𝑏𝑞 𝑇 𝐻 (1 + 𝛽 1 𝐷*,ℓ − 1 𝑚𝑝𝑕-.$(ℓ)) synthesis time, and be decoded correctly with high probability.

SLIDE 23

State Splitting encoder

𝑛 ∈ {0,1}! 𝑒 ∈ [ℓ]"

Error Correction Coding Multiple Sequence Alignment + Quantizing Function ℚ Terminator Free DNA Synthesis Channel

Decoding for ECC

, 𝑒 ∈ [ℓ]"

State Splitting decoder

𝑛 ∈ {0,1}!

𝑶 𝒅𝒑𝒒𝒋𝒇𝒕

𝑙 𝑑𝑏𝑞 𝑇 𝐻

(1 + 𝛽 1 𝐷1,ℓ − 1 𝑚𝑝𝑕345(ℓ))

SLIDE 24

Experimental Results – Binomial Run lengths

𝑂 = 5

𝜀 = 0.02

SLIDE 25

Poisson Run Lengths

𝜇! for different values of 𝑂

Achievable Rates for different values of 𝜀

SLIDE 26

Conclusion

Method for encoding information in DNA sequences based on PR

framework and terminator free DNA synthesis method.

Rate above 𝐦𝐩𝐡𝟑 𝟒 can be achieved.
Method accounts for quantizer error 𝜺.
Provided method for designing quantizers for Binomial and Poisson

run lengths.

As we have multiple copies, alignment can be used to account for

deletion of runs.

SLIDE 27

Coding for Optimized Writing Rate in DNA Storage

IEEE ISIT 2020

DN DNA Stor

Information

In this talk

Current DNA Synthesis Systems

Terminator Free DNA Synthesis

Terminator Free DNA Synthesis Channel

Sequence Write

Noise

Previous Symbol Current Symbol

Distribution 𝐸

Write Time 𝑢

(Sticky Insertion)

Terminator Free DNA Synthesis Channel

Sequence Write

Noise

Previous Symbol Current Symbol

Distribution 𝐸!→#

Write Time 𝑢

(Sticky Insertion)

Terminator Free DNA Synthesis Channel

Sequence Write

Noise

Previous Symbol Current Symbol

Distribution 𝐸!→#(𝑢)

Write Time 𝑢

(Sticky Insertion)

Terminator Free DNA Synthesis Channel

A ACCC ACCCTT ACCCTTA ACCCTTAGGGGG Round 1 Round 2 Round 3 Round 4

Approach

ACCCTTAGGGGG ACTAG

Forget Runs and Encode Information in Transitions

Rate: 𝐦𝐩𝐡𝟑 𝟒 Can we do better?

Precision Resolution (PR) Framework

0100010010010101

Precision Resolution (PR) Framework

PR framework cannot be directly applied for the Terminator Free DNA Synthesis Channel.

Why?

Memory

ACCC ACCCTT ACCCTTA ACCCTTAGGGGG Round 1 Round 2 Round 3 Round 4

Distribution 𝑬 depends on the previous symbol

Quantization Error

round duration.

Multiple Copies

same user information.

𝑇(𝐻)

Encode Information in round times

Convert 𝐻 to a simple graph 𝐻’

Perron-Frobenius Theory 𝑑𝑏𝑞 𝑇 𝐻 = log% 𝜇 𝐵&!

Framework Description

Example

𝑇(𝐻)

𝑚 = 2, 𝑁 = 4

Framework Description

𝐸"→$(1) 𝐸"→$(2)

Receiver

Quantizing Function

𝑢,→. = {1, 2}

Theorem

𝑙 𝑑𝑏𝑞 𝑇 𝐻

Experimental Results – Binomial Run lengths

𝑂 = 5

𝜀 = 0.02

Poisson Run Lengths

𝜇! for different values of 𝑂

Achievable Rates for different values of 𝜀

Conclusion

Thank You. Questions

sidjain@caltech.edu