Reducing Energy Usage Through a Novel File Synchronization Algorithm - - PowerPoint PPT Presentation

reducing energy usage through a novel file
SMART_READER_LITE
LIVE PREVIEW

Reducing Energy Usage Through a Novel File Synchronization Algorithm - - PowerPoint PPT Presentation

Reducing Energy Usage Through a Novel File Synchronization Algorithm Frederic Sala LORIS Lab, UCLA Joint work with: Nicolas Bitouz e (UCLA) Clayton Schoeny (UCLA), S. M. Sadegh Tabatabaei Yazdi (Qualcomm), Lara Dolecek (UCLA) Laboratory


slide-1
SLIDE 1

Reducing Energy Usage Through a Novel File Synchronization Algorithm

Frederic Sala LORIS Lab, UCLA Joint work with: Nicolas Bitouz´ e (UCLA) Clayton Schoeny (UCLA),

  • S. M. Sadegh Tabatabaei Yazdi (Qualcomm),

Lara Dolecek (UCLA)

Laboratory for Robust Information Systems (LORIS) Department of Electrical Engineering, UCLA

1 / 23

slide-2
SLIDE 2

Motivation

Combined data center electricity usage is already at 1.5% of all electricity used in the world.

  • J. Koomey, “Growth in data center electricity use 2005 to

2010”, 2011.

2 / 23

slide-3
SLIDE 3

Motivation

Combined data center electricity usage is already at 1.5% of all electricity used in the world.

  • J. Koomey, “Growth in data center electricity use 2005 to

2010”, 2011. A major contributing factor: large data storage requirements. In part, these requirements are due to the unnecessary storage of superfluous data:

2 / 23

slide-4
SLIDE 4

Motivation

Combined data center electricity usage is already at 1.5% of all electricity used in the world.

  • J. Koomey, “Growth in data center electricity use 2005 to

2010”, 2011. A major contributing factor: large data storage requirements. In part, these requirements are due to the unnecessary storage of superfluous data: Multiple copies of the same file. Multiple versions of a file.

2 / 23

slide-5
SLIDE 5

Reducing Data Storage Demand

When files are identical, we can use deduplication tools.

3 / 23

slide-6
SLIDE 6

Reducing Data Storage Demand

When files are identical, we can use deduplication tools. What if files are similar, but not identical?

3 / 23

slide-7
SLIDE 7

Reducing Data Storage Demand

When files are identical, we can use deduplication tools. What if files are similar, but not identical?

Synchronization from Insertions and Deletions Under a Non-Binary, Non-Uniform Source

Nicolas Bitouz´ e and Lara Dolecek Electrical Engineering Department University of California, Los Angeles Los Angeles, USA Email: bitouze@ucla.edu, dolecek@ee.ucla.edu Abstract—We study the problem of synchronizing two files X and Y at two distant nodes A and B that are connected through a two-way communication channel. We assume that file Y at node B is obtained from file X at node A by inserting and deleting a small fraction of symbols in X. More specifically, we consider the case where X is a non-binary non-uniform string, and deletions and insertions happen uniformly with rates βd and βi,
  • respectively. We propose a synchronization protocol between node
A and node B that needs to transmit O(CX(βd+βi)n log 1 βd+βi ) bits (where n is the length of X and CX is a constant that depends
  • n the statistical properties of X) and reconstructs X at node
B with error probability exponentially low in n. This protocol readily generalizes the recent result by Tabatabaei Yazdi and Dolecek that dealt with synchronization from binary uniform source and under only deletion errors.
  • I. INTRODUCTION
Motivated by the pervasive use of file synchronization in modern data storage technologies, in this work we seek to develop a synchronization protocol that is more efficient than the existing algorithms. In particular, the popular RSYNC method can be in general very inefficient and the number of transmitted bits can be exponentially larger than the optimal number. Our starting point is an information-theoretically oriented scheme recently developed in [1]. In [1], a synchronization protocol that synchronizes an altered copy of the binary file with the original version of the file was proposed. In this scheme, the owner of the altered file requests additional information from the owner of the original file to ensure proper
  • synchronization. It was assumed that the altered copy was
  • btained from the original copy by i.i.d. deletions at the bit-
level and that the original file was generated from an i.i.d. uniform binary source. It was then shown that the rate of the proposed scheme asymptotically matches the optimal rate for this channel, developed earlier in [2]. That is, in the scheme of [1], the number of bits needed to synchronize two files can be kept very small while achieving exponentially low probability
  • f error.
There are many practical scenarios where the files cannot be modeled as binary and uniform. For example, a file is usually not structured by bits, but by bytes or by even longer atomic elements. If the source is a text file, not only are some characters more frequent than others, but there is a large autocorrelation within the file. Additionally, some symbols may be inserted as well as deleted. As a result, our objective is to suitably generalize the scheme in [1], while maintaining low cost of transmission and low error of mis-synchronization. Specifically, our model encapsulates the following general- izations of the model in [1]: 1) We consider errors as being insertions or deletions instead
  • f being restricted to deletions only,
2) We consider non-binary source symbols, 3) We allow the source symbols to have an arbitrary distri- bution; uniform distribution is then a special case. The rest of the paper is organized as follows. In Section II we outline the overall synchronization protocol. Necessary notation and background results are presented in Section III. Two key components of our synchronization protocol, the matching module and the edit recovery module, are discussed in detail in Sections IV and V, respectively. Section VI concludes the paper.
  • II. THE SYNCHRONIZATION PROTOCOL
In [1], the following setup is considered: two distant nodes A and B are connected by a low-bandwidth high-latency
  • network. A contains a file X which is a uniform i.i.d. binary
string of length n, and B contains a file Y of length n0 that is
  • btained by deleting bits of X independently with probability
β ⌧ 1. We consider a generalized setting in which the file X = X1, . . . , Xn is i.i.d. on alphabet X = {0, . . . , Q 1}, where for all 1  t  n, Xt’s are distributed according to µ(x). For simplicity, we consider Q to be a power of two, say Q = 2q. Insertions and deletions occur respectively with probability βi and βd. Let us define an edit pattern E = E1, . . . , En as a string in {1, 0, 1}n such that Y is obtained from X in the following way: for t from 1 to n,
  • if Et = 0, transmit Xt,
  • if Et = 1, delete (do not transmit) Xt,
  • if Et = 1, transmit Xt, then insert (transmit) a new
symbol of X drawn with distribution µ(x). For instance, consider X and Y defined over a quaternary alphabet, X = 00 D122133 D10 and Y = 0120 I23 I10
  • I310. Here
Y is derived from X by 2 deletions and 3 insertions where deleted (inserted) symbols are denoted by D (I). The edit pattern is thus E = (0, 1, 0, 1, 1, 1, 0, 1, 0, 0). Node B aims to synchronize its file Y with the (original) file X by requesting carefully chosen additional information from A

3 / 23

slide-8
SLIDE 8

Reducing Data Storage Demand

When files are identical, we can use deduplication tools. What if files are similar, but not identical?

Synchronization from Insertions and Deletions Under a Non-Binary, Non-Uniform Source

Nicolas Bitouz´ e and Lara Dolecek Electrical Engineering Department University of California, Los Angeles Los Angeles, USA Email: bitouze@ucla.edu, dolecek@ee.ucla.edu Abstract—We study the problem of synchronizing two files X and Y at two distant nodes A and B that are connected through a two-way communication channel. We assume that file Y at node B is obtained from file X at node A by inserting and deleting a small fraction of symbols in X. More specifically, we consider the case where X is a non-binary non-uniform string, and deletions and insertions happen uniformly with rates βd and βi,
  • respectively. We propose a synchronization protocol between node
A and node B that needs to transmit O(CX(βd+βi)n log 1 βd+βi ) bits (where n is the length of X and CX is a constant that depends
  • n the statistical properties of X) and reconstructs X at node
B with error probability exponentially low in n. This protocol readily generalizes the recent result by Tabatabaei Yazdi and Dolecek that dealt with synchronization from binary uniform source and under only deletion errors.
  • I. INTRODUCTION
Motivated by the pervasive use of file synchronization in modern data storage technologies, in this work we seek to develop a synchronization protocol that is more efficient than the existing algorithms. In particular, the popular RSYNC method can be in general very inefficient and the number of transmitted bits can be exponentially larger than the optimal number. Our starting point is an information-theoretically oriented scheme recently developed in [1]. In [1], a synchronization protocol that synchronizes an altered copy of the binary file with the original version of the file was proposed. In this scheme, the owner of the altered file requests additional information from the owner of the original file to ensure proper
  • synchronization. It was assumed that the altered copy was
  • btained from the original copy by i.i.d. deletions at the bit-
level and that the original file was generated from an i.i.d. uniform binary source. It was then shown that the rate of the proposed scheme asymptotically matches the optimal rate for this channel, developed earlier in [2]. That is, in the scheme of [1], the number of bits needed to synchronize two files can be kept very small while achieving exponentially low probability
  • f error.
There are many practical scenarios where the files cannot be modeled as binary and uniform. For example, a file is usually not structured by bits, but by bytes or by even longer atomic elements. If the source is a text file, not only are some characters more frequent than others, but there is a large autocorrelation within the file. Additionally, some symbols may be inserted as well as deleted. As a result, our objective is to suitably generalize the scheme in [1], while maintaining low cost of transmission and low error of mis-synchronization. Specifically, our model encapsulates the following general- izations of the model in [1]: 1) We consider errors as being insertions or deletions instead
  • f being restricted to deletions only,
2) We consider non-binary source symbols, 3) We allow the source symbols to have an arbitrary distri- bution; uniform distribution is then a special case. The rest of the paper is organized as follows. In Section II we outline the overall synchronization protocol. Necessary notation and background results are presented in Section III. Two key components of our synchronization protocol, the matching module and the edit recovery module, are discussed in detail in Sections IV and V, respectively. Section VI concludes the paper.
  • II. THE SYNCHRONIZATION PROTOCOL
In [1], the following setup is considered: two distant nodes A and B are connected by a low-bandwidth high-latency
  • network. A contains a file X which is a uniform i.i.d. binary
string of length n, and B contains a file Y of length n0 that is
  • btained by deleting bits of X independently with probability
β ⌧ 1. We consider a generalized setting in which the file X = X1, . . . , Xn is i.i.d. on alphabet X = {0, . . . , Q 1}, where for all 1  t  n, Xt’s are distributed according to µ(x). For simplicity, we consider Q to be a power of two, say Q = 2q. Insertions and deletions occur respectively with probability βi and βd. Let us define an edit pattern E = E1, . . . , En as a string in {1, 0, 1}n such that Y is obtained from X in the following way: for t from 1 to n,
  • if Et = 0, transmit Xt,
  • if Et = 1, delete (do not transmit) Xt,
  • if Et = 1, transmit Xt, then insert (transmit) a new
symbol of X drawn with distribution µ(x). For instance, consider X and Y defined over a quaternary alphabet, X = 00 D122133 D10 and Y = 0120 I23 I10
  • I310. Here
Y is derived from X by 2 deletions and 3 insertions where deleted (inserted) symbols are denoted by D (I). The edit pattern is thus E = (0, 1, 0, 1, 1, 1, 0, 1, 0, 0). Node B aims to synchronize its file Y with the (original) file X by requesting carefully chosen additional information from A

Synchronization from Insertions and Deletions Under a Non-Binary, Non-Uniform Source

Nicolas Bitouz´ e and Lara Dolecek Electrical Engineering Department University of California, Los Angeles Los Angeles, USA Email: bitouze@ucla.edu, dolecek@ee.ucla.edu Abstract—We study the problem of synchronizing two files X and Y at two distant nodes A and B that are connected through a two-way communication channel. We assume that file Y at node B is obtained from file X at node A by inserting and deleting a small fraction of symbols in X. More specifically, we consider the case where X is a non-binary non-uniform string, and deletions and insertions happen uniformly with rates βd and βi,
  • respectively. We propose a synchronization protocol between node
A and node B that needs to transmit O(CX(βd+βi)n log 1 βd+βi ) bits (where n is the length of X and CX is a constant that depends
  • n the statistical properties of X) and reconstructs X at node
B with error probability exponentially low in n. This protocol readily generalizes the recent result by Tabatabaei Yazdi and Dolecek that dealt with synchronization from binary uniform source and under only deletion errors.
  • I. INTRODUCTION
Motivated by the pervasive use of file synchronization in modern data storage technologies, in this work we seek to develop a synchronization protocol that is more efficient than the existing algorithms. In particular, the popular RSYNC method can be in general very inefficient and the number of transmitted bits can be exponentially larger than the optimal number. Our starting point is an information-theoretically oriented scheme recently developed in [1]. In [1], a synchronization protocol that synchronizes an altered copy of the binary file with the original version of the file was proposed. In this scheme, the owner of the altered file requests additional information from the owner of the original file to ensure proper
  • synchronization. It was assumed that the altered copy was
  • btained from the original copy by i.i.d. deletions at the bit-
level and that the original file was generated from an i.i.d. uniform binary source. It was then shown that the rate of the proposed scheme asymptotically matches the optimal rate for this channel, developed earlier in [2]. That is, in the scheme
  • f [1], the number of bits needed to synchronize two files can
be kept very small while achieving exponentially low error of mis-synchronization. There are many practical scenarios where the files cannot be modeled as binary and uniform. For example, a file is usually not structured by bits, but by bytes or by even longer atomic elements. If the source is a text file, not only are some characters more frequent than others, but there is a large autocorrelation within the file. Additionally, some symbols may be inserted as well as deleted. As a result, our objective is to suitably generalize the scheme in [1], while maintaining low cost of transmission and low error of mis-synchronization. Specifically, our model encapsulates the following general- izations of the model in [1]: 1) We consider errors as being insertions or deletions instead
  • f being restricted to deletions only,
2) We consider non-binary source symbols, 3) We allow the source symbols to have an arbitrary distri- bution; uniform distribution is then a special case. The rest of the paper is organized as follows. In Section II we outline the overall synchronization protocol. Necessary notation and background results are presented in Section III. Two key components of our synchronization protocol, the matching module and the edit recovery module, are discussed in detail in Sections IV and V, respectively. Section VI concludes the paper.
  • II. THE SYNCHRONIZATION PROTOCOL
In [1], the following setup is considered: two distant nodes A and B are connected by a low-bandwidth high-latency
  • network. A contains a file X which is a uniform i.i.d. binary
string of length n, and B contains a file Y of length n0 that is
  • btained by deleting bits of X independently with probability
β ⌧ 1. We consider a generalized setting in which X = X1, . . . , Xn is an i.i.d. file on alphabet X = {0, . . . , Q 1}, where for all 1  t  n, Xt’s are distributed according to µ(x). For simplicity, we consider Q to be a power of two, say Q = 2q. Insertions and deletions occur respectively with probability βi and βd. Let us define an edit pattern E = E1, . . . , En as a string in {1, 0, 1}n such that Y is
  • btained from X in the following way: for t from 1 to n,
  • if Et = 0, transmit Xt,
  • if Et = 1, delete (do not transmit) Xt,
  • if Et = 1, transmit Xt, then insert (transmit) a new
symbol of X drawn with distribution µ(x). For instance, consider X and Y defined over a quaternary alphabet, X = 00 D122133 D10 and Y = 0120 I23 I10
  • I310. Here
Y is derived from X by 2 deletions and 3 insertions where deleted (inserted) symbols are denoted by D (I). The edit pattern is thus E = (0, 1, 0, 1, 1, 1, 0, 1, 0, 0). Node B aims to synchronize its file Y with the (original) file X by requesting carefully chosen additional information from A

3 / 23

slide-9
SLIDE 9

Reducing Data Storage Demand

When files are identical, we can use deduplication tools. What if files are similar, but not identical?

Synchronization from Insertions and Deletions Under a Non-Binary, Non-Uniform Source

Nicolas Bitouz´ e and Lara Dolecek Electrical Engineering Department University of California, Los Angeles Los Angeles, USA Email: bitouze@ucla.edu, dolecek@ee.ucla.edu Abstract—We study the problem of synchronizing two files X and Y at two distant nodes A and B that are connected through a two-way communication channel. We assume that file Y at node B is obtained from file X at node A by inserting and deleting a small fraction of symbols in X. More specifically, we consider the case where X is a non-binary non-uniform string, and deletions and insertions happen uniformly with rates βd and βi,
  • respectively. We propose a synchronization protocol between node
A and node B that needs to transmit O(CX(βd+βi)n log 1 βd+βi ) bits (where n is the length of X and CX is a constant that depends
  • n the statistical properties of X) and reconstructs X at node
B with error probability exponentially low in n. This protocol readily generalizes the recent result by Tabatabaei Yazdi and Dolecek that dealt with synchronization from binary uniform source and under only deletion errors.
  • I. INTRODUCTION
Motivated by the pervasive use of file synchronization in modern data storage technologies, in this work we seek to develop a synchronization protocol that is more efficient than the existing algorithms. In particular, the popular RSYNC method can be in general very inefficient and the number of transmitted bits can be exponentially larger than the optimal number. Our starting point is an information-theoretically oriented scheme recently developed in [1]. In [1], a synchronization protocol that synchronizes an altered copy of the binary file with the original version of the file was proposed. In this scheme, the owner of the altered file requests additional information from the owner of the original file to ensure proper
  • synchronization. It was assumed that the altered copy was
  • btained from the original copy by i.i.d. deletions at the bit-
level and that the original file was generated from an i.i.d. uniform binary source. It was then shown that the rate of the proposed scheme asymptotically matches the optimal rate for this channel, developed earlier in [2]. That is, in the scheme of [1], the number of bits needed to synchronize two files can be kept very small while achieving exponentially low probability
  • f error.
There are many practical scenarios where the files cannot be modeled as binary and uniform. For example, a file is usually not structured by bits, but by bytes or by even longer atomic elements. If the source is a text file, not only are some characters more frequent than others, but there is a large autocorrelation within the file. Additionally, some symbols may be inserted as well as deleted. As a result, our objective is to suitably generalize the scheme in [1], while maintaining low cost of transmission and low error of mis-synchronization. Specifically, our model encapsulates the following general- izations of the model in [1]: 1) We consider errors as being insertions or deletions instead
  • f being restricted to deletions only,
2) We consider non-binary source symbols, 3) We allow the source symbols to have an arbitrary distri- bution; uniform distribution is then a special case. The rest of the paper is organized as follows. In Section II we outline the overall synchronization protocol. Necessary notation and background results are presented in Section III. Two key components of our synchronization protocol, the matching module and the edit recovery module, are discussed in detail in Sections IV and V, respectively. Section VI concludes the paper.
  • II. THE SYNCHRONIZATION PROTOCOL
In [1], the following setup is considered: two distant nodes A and B are connected by a low-bandwidth high-latency
  • network. A contains a file X which is a uniform i.i.d. binary
string of length n, and B contains a file Y of length n0 that is
  • btained by deleting bits of X independently with probability
β ⌧ 1. We consider a generalized setting in which the file X = X1, . . . , Xn is i.i.d. on alphabet X = {0, . . . , Q 1}, where for all 1  t  n, Xt’s are distributed according to µ(x). For simplicity, we consider Q to be a power of two, say Q = 2q. Insertions and deletions occur respectively with probability βi and βd. Let us define an edit pattern E = E1, . . . , En as a string in {1, 0, 1}n such that Y is obtained from X in the following way: for t from 1 to n,
  • if Et = 0, transmit Xt,
  • if Et = 1, delete (do not transmit) Xt,
  • if Et = 1, transmit Xt, then insert (transmit) a new
symbol of X drawn with distribution µ(x). For instance, consider X and Y defined over a quaternary alphabet, X = 00 D122133 D10 and Y = 0120 I23 I10
  • I310. Here
Y is derived from X by 2 deletions and 3 insertions where deleted (inserted) symbols are denoted by D (I). The edit pattern is thus E = (0, 1, 0, 1, 1, 1, 0, 1, 0, 0). Node B aims to synchronize its file Y with the (original) file X by requesting carefully chosen additional information from A

Synchronization from Insertions and Deletions Under a Non-Binary, Non-Uniform Source

Nicolas Bitouz´ e and Lara Dolecek Electrical Engineering Department University of California, Los Angeles Los Angeles, USA Email: bitouze@ucla.edu, dolecek@ee.ucla.edu Abstract—We study the problem of synchronizing two files X and Y at two distant nodes A and B that are connected through a two-way communication channel. We assume that file Y at node B is obtained from file X at node A by inserting and deleting a small fraction of symbols in X. More specifically, we consider the case where X is a non-binary non-uniform string, and deletions and insertions happen uniformly with rates βd and βi,
  • respectively. We propose a synchronization protocol between node
A and node B that needs to transmit O(CX(βd+βi)n log 1 βd+βi ) bits (where n is the length of X and CX is a constant that depends
  • n the statistical properties of X) and reconstructs X at node
B with error probability exponentially low in n. This protocol readily generalizes the recent result by Tabatabaei Yazdi and Dolecek that dealt with synchronization from binary uniform source and under only deletion errors.
  • I. INTRODUCTION
Motivated by the pervasive use of file synchronization in modern data storage technologies, in this work we seek to develop a synchronization protocol that is more efficient than the existing algorithms. In particular, the popular RSYNC method can be in general very inefficient and the number of transmitted bits can be exponentially larger than the optimal number. Our starting point is an information-theoretically oriented scheme recently developed in [1]. In [1], a synchronization protocol that synchronizes an altered copy of the binary file with the original version of the file was proposed. In this scheme, the owner of the altered file requests additional information from the owner of the original file to ensure proper
  • synchronization. It was assumed that the altered copy was
  • btained from the original copy by i.i.d. deletions at the bit-
level and that the original file was generated from an i.i.d. uniform binary source. It was then shown that the rate of the proposed scheme asymptotically matches the optimal rate for this channel, developed earlier in [2]. That is, in the scheme
  • f [1], the number of bits needed to synchronize two files can
be kept very small while achieving exponentially low error of mis-synchronization. There are many practical scenarios where the files cannot be modeled as binary and uniform. For example, a file is usually not structured by bits, but by bytes or by even longer atomic elements. If the source is a text file, not only are some characters more frequent than others, but there is a large autocorrelation within the file. Additionally, some symbols may be inserted as well as deleted. As a result, our objective is to suitably generalize the scheme in [1], while maintaining low cost of transmission and low error of mis-synchronization. Specifically, our model encapsulates the following general- izations of the model in [1]: 1) We consider errors as being insertions or deletions instead
  • f being restricted to deletions only,
2) We consider non-binary source symbols, 3) We allow the source symbols to have an arbitrary distri- bution; uniform distribution is then a special case. The rest of the paper is organized as follows. In Section II we outline the overall synchronization protocol. Necessary notation and background results are presented in Section III. Two key components of our synchronization protocol, the matching module and the edit recovery module, are discussed in detail in Sections IV and V, respectively. Section VI concludes the paper.
  • II. THE SYNCHRONIZATION PROTOCOL
In [1], the following setup is considered: two distant nodes A and B are connected by a low-bandwidth high-latency
  • network. A contains a file X which is a uniform i.i.d. binary
string of length n, and B contains a file Y of length n0 that is
  • btained by deleting bits of X independently with probability
β ⌧ 1. We consider a generalized setting in which X = X1, . . . , Xn is an i.i.d. file on alphabet X = {0, . . . , Q 1}, where for all 1  t  n, Xt’s are distributed according to µ(x). For simplicity, we consider Q to be a power of two, say Q = 2q. Insertions and deletions occur respectively with probability βi and βd. Let us define an edit pattern E = E1, . . . , En as a string in {1, 0, 1}n such that Y is
  • btained from X in the following way: for t from 1 to n,
  • if Et = 0, transmit Xt,
  • if Et = 1, delete (do not transmit) Xt,
  • if Et = 1, transmit Xt, then insert (transmit) a new
symbol of X drawn with distribution µ(x). For instance, consider X and Y defined over a quaternary alphabet, X = 00 D122133 D10 and Y = 0120 I23 I10
  • I310. Here
Y is derived from X by 2 deletions and 3 insertions where deleted (inserted) symbols are denoted by D (I). The edit pattern is thus E = (0, 1, 0, 1, 1, 1, 0, 1, 0, 0). Node B aims to synchronize its file Y with the (original) file X by requesting carefully chosen additional information from A

Synchronization from Insertions and Deletions Under a Non-Binary, Non-Uniform Source

Nicolas Bitouz´ e and Lara Dolecek Electrical Engineering Department University of California, Los Angeles Los Angeles, USA Email: bitouze@ucla.edu, dolecek@ee.ucla.edu Abstract—We study the problem of synchronizing two files X and Y at two distant nodes A and B that are connected through a two-way communication channel. We assume that file Y at node B is obtained from file X at node A by inserting and deleting a small fraction of symbols in X. More specifically, we consider the case where X is a non-binary non-uniform sequence, and deletions and insertions happen uniformly with rates βd and βi,
  • respectively. We propose a synchronization protocol between node
A and node B that needs to transmit O(CX(βd+βi)n log 1 βd+βi ) bits (where n is the length of X and CX is a constant that depends
  • n the statistical properties of X) and reconstructs X at node
B with error probability exponentially low in n. This protocol readily generalizes the recent result by Tabatabaei Yazdi and Dolecek that dealt with synchronization from binary uniform source and under only deletion errors.
  • I. INTRODUCTION
Motivated by the pervasive use of file synchronization in modern data storage technologies, in this work we seek to develop a synchronization protocol that is more efficient than the existing algorithms. In particular, the popular RSYNC method can be in general very inefficient and the number of transmitted bits can be exponentially larger than the optimal number. Our starting point is an information-theoretically oriented scheme recently developed in [1]. In [1], a synchronization protocol that synchronizes an altered copy of the binary file with the original version of the file was proposed. In this scheme, the owner of the altered file requests additional information from the owner of the original file to ensure proper
  • synchronization. It was assumed that the altered copy was
  • btained from the original copy by i.i.d. deletions at the bit-
level and that the original file was generated from an i.i.d. uniform binary source. It was then shown that the rate of the proposed scheme asymptotically matches the optimal rate for this channel, developed earlier in [2]. That is, in the scheme of [1], the number of bits needed to synchronize two files can be kept very small while achieving exponentially low probability
  • f error.
There are many practical scenarios where the files cannot be modeled as binary and uniform. For example, a file is usually not structured by bits, but by bytes or by even longer atomic elements. If the source is a text file, not only are some characters more frequent than others, but there is a large autocorrelation within the file. Additionally, some symbols may be inserted as well as deleted. As a result, our objective is to suitably generalize the scheme in [1], while maintaining low cost of transmission and low error of mis-synchronization. Specifically, our model encapsulates the following general- izations of the model in [1]: 1) We consider errors as being insertions or deletions instead
  • f being restricted to deletions only,
2) We consider non-binary source symbols, 3) We allow the source symbols to have an arbitrary distri- bution; uniform distribution is then a special case. The rest of the paper is organized as follows. In Section II we outline the overall synchronization protocol. Necessary notation and background results are presented in Section III. Two key components of our synchronization protocol, the matching module and the edit recovery module, are discussed in detail in Sections IV and V, respectively. Section VI concludes the paper.
  • II. THE SYNCHRONIZATION PROTOCOL
In [1], the following setup is considered: two distant nodes A and B are connected by a low-bandwidth high-latency
  • network. A contains a file X which is a uniform i.i.d. binary
string of length n, and B contains a file Y of length n0 that is
  • btained by deleting bits of X independently with probability
β ⌧ 1. We consider a generalized setting in which the file X = X1, . . . , Xn is i.i.d. on alphabet X = {0, . . . , Q 1}, where for all 1  t  n, Xt’s are distributed according to µ(x). For simplicity, we consider Q to be a power of two, say Q = 2q. Insertions and deletions occur respectively with probability βi and βd. Let us define an edit pattern E = E1, . . . , En as a string in {1, 0, 1}n such that Y is obtained from X in the following way: for t from 1 to n,
  • if Et = 0, transmit Xt,
  • if Et = 1, delete (do not transmit) Xt,
  • if Et = 1, transmit Xt, then insert (transmit) a new
symbol of X drawn with distribution µ(x). For instance, consider X and Y defined over a quaternary alphabet, X = 00 D122133 D10 and Y = 0120 I23 I10
  • I310. Here
Y is derived from X by 2 deletions and 3 insertions where deleted (inserted) symbols are denoted by D (I). The edit pattern is thus E = (0, 1, 0, 1, 1, 1, 0, 1, 0, 0). Node B aims to synchronize its file Y with the (original) file X by requesting carefully chosen additional information from A

3 / 23

slide-10
SLIDE 10

Reducing Data Storage Demand

When files are identical, we can use deduplication tools. What if files are similar, but not identical?

Synchronization from Insertions and Deletions Under a Non-Binary, Non-Uniform Source

Nicolas Bitouz´ e and Lara Dolecek Electrical Engineering Department University of California, Los Angeles Los Angeles, USA Email: bitouze@ucla.edu, dolecek@ee.ucla.edu Abstract—We study the problem of synchronizing two files X and Y at two distant nodes A and B that are connected through a two-way communication channel. We assume that file Y at node B is obtained from file X at node A by inserting and deleting a small fraction of symbols in X. More specifically, we consider the case where X is a non-binary non-uniform string, and deletions and insertions happen uniformly with rates βd and βi,
  • respectively. We propose a synchronization protocol between node
A and node B that needs to transmit O(CX(βd+βi)n log 1 βd+βi ) bits (where n is the length of X and CX is a constant that depends
  • n the statistical properties of X) and reconstructs X at node
B with error probability exponentially low in n. This protocol readily generalizes the recent result by Tabatabaei Yazdi and Dolecek that dealt with synchronization from binary uniform source and under only deletion errors.
  • I. INTRODUCTION
Motivated by the pervasive use of file synchronization in modern data storage technologies, in this work we seek to develop a synchronization protocol that is more efficient than the existing algorithms. In particular, the popular RSYNC method can be in general very inefficient and the number of transmitted bits can be exponentially larger than the optimal number. Our starting point is an information-theoretically oriented scheme recently developed in [1]. In [1], a synchronization protocol that synchronizes an altered copy of the binary file with the original version of the file was proposed. In this scheme, the owner of the altered file requests additional information from the owner of the original file to ensure proper
  • synchronization. It was assumed that the altered copy was
  • btained from the original copy by i.i.d. deletions at the bit-
level and that the original file was generated from an i.i.d. uniform binary source. It was then shown that the rate of the proposed scheme asymptotically matches the optimal rate for this channel, developed earlier in [2]. That is, in the scheme of [1], the number of bits needed to synchronize two files can be kept very small while achieving exponentially low probability
  • f error.
There are many practical scenarios where the files cannot be modeled as binary and uniform. For example, a file is usually not structured by bits, but by bytes or by even longer atomic elements. If the source is a text file, not only are some characters more frequent than others, but there is a large autocorrelation within the file. Additionally, some symbols may be inserted as well as deleted. As a result, our objective is to suitably generalize the scheme in [1], while maintaining low cost of transmission and low error of mis-synchronization. Specifically, our model encapsulates the following general- izations of the model in [1]: 1) We consider errors as being insertions or deletions instead
  • f being restricted to deletions only,
2) We consider non-binary source symbols, 3) We allow the source symbols to have an arbitrary distri- bution; uniform distribution is then a special case. The rest of the paper is organized as follows. In Section II we outline the overall synchronization protocol. Necessary notation and background results are presented in Section III. Two key components of our synchronization protocol, the matching module and the edit recovery module, are discussed in detail in Sections IV and V, respectively. Section VI concludes the paper.
  • II. THE SYNCHRONIZATION PROTOCOL
In [1], the following setup is considered: two distant nodes A and B are connected by a low-bandwidth high-latency
  • network. A contains a file X which is a uniform i.i.d. binary
string of length n, and B contains a file Y of length n0 that is
  • btained by deleting bits of X independently with probability
β ⌧ 1. We consider a generalized setting in which the file X = X1, . . . , Xn is i.i.d. on alphabet X = {0, . . . , Q 1}, where for all 1  t  n, Xt’s are distributed according to µ(x). For simplicity, we consider Q to be a power of two, say Q = 2q. Insertions and deletions occur respectively with probability βi and βd. Let us define an edit pattern E = E1, . . . , En as a string in {1, 0, 1}n such that Y is obtained from X in the following way: for t from 1 to n,
  • if Et = 0, transmit Xt,
  • if Et = 1, delete (do not transmit) Xt,
  • if Et = 1, transmit Xt, then insert (transmit) a new
symbol of X drawn with distribution µ(x). For instance, consider X and Y defined over a quaternary alphabet, X = 00 D122133 D10 and Y = 0120 I23 I10
  • I310. Here
Y is derived from X by 2 deletions and 3 insertions where deleted (inserted) symbols are denoted by D (I). The edit pattern is thus E = (0, 1, 0, 1, 1, 1, 0, 1, 0, 0). Node B aims to synchronize its file Y with the (original) file X by requesting carefully chosen additional information from A

Synchronization from Insertions and Deletions Under a Non-Binary, Non-Uniform Source

Nicolas Bitouz´ e and Lara Dolecek Electrical Engineering Department University of California, Los Angeles Los Angeles, USA Email: bitouze@ucla.edu, dolecek@ee.ucla.edu Abstract—We study the problem of synchronizing two files X and Y at two distant nodes A and B that are connected through a two-way communication channel. We assume that file Y at node B is obtained from file X at node A by inserting and deleting a small fraction of symbols in X. More specifically, we consider the case where X is a non-binary non-uniform string, and deletions and insertions happen uniformly with rates βd and βi,
  • respectively. We propose a synchronization protocol between node
A and node B that needs to transmit O(CX(βd+βi)n log 1 βd+βi ) bits (where n is the length of X and CX is a constant that depends
  • n the statistical properties of X) and reconstructs X at node
B with error probability exponentially low in n. This protocol readily generalizes the recent result by Tabatabaei Yazdi and Dolecek that dealt with synchronization from binary uniform source and under only deletion errors.
  • I. INTRODUCTION
Motivated by the pervasive use of file synchronization in modern data storage technologies, in this work we seek to develop a synchronization protocol that is more efficient than the existing algorithms. In particular, the popular RSYNC method can be in general very inefficient and the number of transmitted bits can be exponentially larger than the optimal number. Our starting point is an information-theoretically oriented scheme recently developed in [1]. In [1], a synchronization protocol that synchronizes an altered copy of the binary file with the original version of the file was proposed. In this scheme, the owner of the altered file requests additional information from the owner of the original file to ensure proper
  • synchronization. It was assumed that the altered copy was
  • btained from the original copy by i.i.d. deletions at the bit-
level and that the original file was generated from an i.i.d. uniform binary source. It was then shown that the rate of the proposed scheme asymptotically matches the optimal rate for this channel, developed earlier in [2]. That is, in the scheme
  • f [1], the number of bits needed to synchronize two files can
be kept very small while achieving exponentially low error of mis-synchronization. There are many practical scenarios where the files cannot be modeled as binary and uniform. For example, a file is usually not structured by bits, but by bytes or by even longer atomic elements. If the source is a text file, not only are some characters more frequent than others, but there is a large autocorrelation within the file. Additionally, some symbols may be inserted as well as deleted. As a result, our objective is to suitably generalize the scheme in [1], while maintaining low cost of transmission and low error of mis-synchronization. Specifically, our model encapsulates the following general- izations of the model in [1]: 1) We consider errors as being insertions or deletions instead
  • f being restricted to deletions only,
2) We consider non-binary source symbols, 3) We allow the source symbols to have an arbitrary distri- bution; uniform distribution is then a special case. The rest of the paper is organized as follows. In Section II we outline the overall synchronization protocol. Necessary notation and background results are presented in Section III. Two key components of our synchronization protocol, the matching module and the edit recovery module, are discussed in detail in Sections IV and V, respectively. Section VI concludes the paper.
  • II. THE SYNCHRONIZATION PROTOCOL
In [1], the following setup is considered: two distant nodes A and B are connected by a low-bandwidth high-latency
  • network. A contains a file X which is a uniform i.i.d. binary
string of length n, and B contains a file Y of length n0 that is
  • btained by deleting bits of X independently with probability
β ⌧ 1. We consider a generalized setting in which X = X1, . . . , Xn is an i.i.d. file on alphabet X = {0, . . . , Q 1}, where for all 1  t  n, Xt’s are distributed according to µ(x). For simplicity, we consider Q to be a power of two, say Q = 2q. Insertions and deletions occur respectively with probability βi and βd. Let us define an edit pattern E = E1, . . . , En as a string in {1, 0, 1}n such that Y is
  • btained from X in the following way: for t from 1 to n,
  • if Et = 0, transmit Xt,
  • if Et = 1, delete (do not transmit) Xt,
  • if Et = 1, transmit Xt, then insert (transmit) a new
symbol of X drawn with distribution µ(x). For instance, consider X and Y defined over a quaternary alphabet, X = 00 D122133 D10 and Y = 0120 I23 I10
  • I310. Here
Y is derived from X by 2 deletions and 3 insertions where deleted (inserted) symbols are denoted by D (I). The edit pattern is thus E = (0, 1, 0, 1, 1, 1, 0, 1, 0, 0). Node B aims to synchronize its file Y with the (original) file X by requesting carefully chosen additional information from A

Synchronization from Insertions and Deletions Under a Non-Binary, Non-Uniform Source

Nicolas Bitouz´ e and Lara Dolecek Electrical Engineering Department University of California, Los Angeles Los Angeles, USA Email: bitouze@ucla.edu, dolecek@ee.ucla.edu Abstract—We study the problem of synchronizing two files X and Y at two distant nodes A and B that are connected through a two-way communication channel. We assume that file Y at node B is obtained from file X at node A by inserting and deleting a small fraction of symbols in X. More specifically, we consider the case where X is a non-binary non-uniform sequence, and deletions and insertions happen uniformly with rates βd and βi,
  • respectively. We propose a synchronization protocol between node
A and node B that needs to transmit O(CX(βd+βi)n log 1 βd+βi ) bits (where n is the length of X and CX is a constant that depends
  • n the statistical properties of X) and reconstructs X at node
B with error probability exponentially low in n. This protocol readily generalizes the recent result by Tabatabaei Yazdi and Dolecek that dealt with synchronization from binary uniform source and under only deletion errors.
  • I. INTRODUCTION
Motivated by the pervasive use of file synchronization in modern data storage technologies, in this work we seek to develop a synchronization protocol that is more efficient than the existing algorithms. In particular, the popular RSYNC method can be in general very inefficient and the number of transmitted bits can be exponentially larger than the optimal number. Our starting point is an information-theoretically oriented scheme recently developed in [1]. In [1], a synchronization protocol that synchronizes an altered copy of the binary file with the original version of the file was proposed. In this scheme, the owner of the altered file requests additional information from the owner of the original file to ensure proper
  • synchronization. It was assumed that the altered copy was
  • btained from the original copy by i.i.d. deletions at the bit-
level and that the original file was generated from an i.i.d. uniform binary source. It was then shown that the rate of the proposed scheme asymptotically matches the optimal rate for this channel, developed earlier in [2]. That is, in the scheme of [1], the number of bits needed to synchronize two files can be kept very small while achieving exponentially low probability
  • f error.
There are many practical scenarios where the files cannot be modeled as binary and uniform. For example, a file is usually not structured by bits, but by bytes or by even longer atomic elements. If the source is a text file, not only are some characters more frequent than others, but there is a large autocorrelation within the file. Additionally, some symbols may be inserted as well as deleted. As a result, our objective is to suitably generalize the scheme in [1], while maintaining low cost of transmission and low error of mis-synchronization. Specifically, our model encapsulates the following general- izations of the model in [1]: 1) We consider errors as being insertions or deletions instead
  • f being restricted to deletions only,
2) We consider non-binary source symbols, 3) We allow the source symbols to have an arbitrary distri- bution; uniform distribution is then a special case. The rest of the paper is organized as follows. In Section II we outline the overall synchronization protocol. Necessary notation and background results are presented in Section III. Two key components of our synchronization protocol, the matching module and the edit recovery module, are discussed in detail in Sections IV and V, respectively. Section VI concludes the paper.
  • II. THE SYNCHRONIZATION PROTOCOL
In [1], the following setup is considered: two distant nodes A and B are connected by a low-bandwidth high-latency
  • network. A contains a file X which is a uniform i.i.d. binary
string of length n, and B contains a file Y of length n0 that is
  • btained by deleting bits of X independently with probability
β ⌧ 1. We consider a generalized setting in which the file X = X1, . . . , Xn is i.i.d. on alphabet X = {0, . . . , Q 1}, where for all 1  t  n, Xt’s are distributed according to µ(x). For simplicity, we consider Q to be a power of two, say Q = 2q. Insertions and deletions occur respectively with probability βi and βd. Let us define an edit pattern E = E1, . . . , En as a string in {1, 0, 1}n such that Y is obtained from X in the following way: for t from 1 to n,
  • if Et = 0, transmit Xt,
  • if Et = 1, delete (do not transmit) Xt,
  • if Et = 1, transmit Xt, then insert (transmit) a new
symbol of X drawn with distribution µ(x). For instance, consider X and Y defined over a quaternary alphabet, X = 00 D122133 D10 and Y = 0120 I23 I10
  • I310. Here
Y is derived from X by 2 deletions and 3 insertions where deleted (inserted) symbols are denoted by D (I). The edit pattern is thus E = (0, 1, 0, 1, 1, 1, 0, 1, 0, 0). Node B aims to synchronize its file Y with the (original) file X by requesting carefully chosen additional information from A

We need algorithms to synchronize multiple versions of a file. Existing algorithms, such as RSYNC, suffer from high communication costs.

3 / 23

slide-11
SLIDE 11

Reducing Data Storage Demand

When files are identical, we can use deduplication tools. What if files are similar, but not identical?

Synchronization from Insertions and Deletions Under a Non-Binary, Non-Uniform Source

Nicolas Bitouz´ e and Lara Dolecek Electrical Engineering Department University of California, Los Angeles Los Angeles, USA Email: bitouze@ucla.edu, dolecek@ee.ucla.edu Abstract—We study the problem of synchronizing two files X and Y at two distant nodes A and B that are connected through a two-way communication channel. We assume that file Y at node B is obtained from file X at node A by inserting and deleting a small fraction of symbols in X. More specifically, we consider the case where X is a non-binary non-uniform string, and deletions and insertions happen uniformly with rates βd and βi,
  • respectively. We propose a synchronization protocol between node
A and node B that needs to transmit O(CX(βd+βi)n log 1 βd+βi ) bits (where n is the length of X and CX is a constant that depends
  • n the statistical properties of X) and reconstructs X at node
B with error probability exponentially low in n. This protocol readily generalizes the recent result by Tabatabaei Yazdi and Dolecek that dealt with synchronization from binary uniform source and under only deletion errors.
  • I. INTRODUCTION
Motivated by the pervasive use of file synchronization in modern data storage technologies, in this work we seek to develop a synchronization protocol that is more efficient than the existing algorithms. In particular, the popular RSYNC method can be in general very inefficient and the number of transmitted bits can be exponentially larger than the optimal number. Our starting point is an information-theoretically oriented scheme recently developed in [1]. In [1], a synchronization protocol that synchronizes an altered copy of the binary file with the original version of the file was proposed. In this scheme, the owner of the altered file requests additional information from the owner of the original file to ensure proper
  • synchronization. It was assumed that the altered copy was
  • btained from the original copy by i.i.d. deletions at the bit-
level and that the original file was generated from an i.i.d. uniform binary source. It was then shown that the rate of the proposed scheme asymptotically matches the optimal rate for this channel, developed earlier in [2]. That is, in the scheme of [1], the number of bits needed to synchronize two files can be kept very small while achieving exponentially low probability
  • f error.
There are many practical scenarios where the files cannot be modeled as binary and uniform. For example, a file is usually not structured by bits, but by bytes or by even longer atomic elements. If the source is a text file, not only are some characters more frequent than others, but there is a large autocorrelation within the file. Additionally, some symbols may be inserted as well as deleted. As a result, our objective is to suitably generalize the scheme in [1], while maintaining low cost of transmission and low error of mis-synchronization. Specifically, our model encapsulates the following general- izations of the model in [1]: 1) We consider errors as being insertions or deletions instead
  • f being restricted to deletions only,
2) We consider non-binary source symbols, 3) We allow the source symbols to have an arbitrary distri- bution; uniform distribution is then a special case. The rest of the paper is organized as follows. In Section II we outline the overall synchronization protocol. Necessary notation and background results are presented in Section III. Two key components of our synchronization protocol, the matching module and the edit recovery module, are discussed in detail in Sections IV and V, respectively. Section VI concludes the paper.
  • II. THE SYNCHRONIZATION PROTOCOL
In [1], the following setup is considered: two distant nodes A and B are connected by a low-bandwidth high-latency
  • network. A contains a file X which is a uniform i.i.d. binary
string of length n, and B contains a file Y of length n0 that is
  • btained by deleting bits of X independently with probability
β ⌧ 1. We consider a generalized setting in which the file X = X1, . . . , Xn is i.i.d. on alphabet X = {0, . . . , Q 1}, where for all 1  t  n, Xt’s are distributed according to µ(x). For simplicity, we consider Q to be a power of two, say Q = 2q. Insertions and deletions occur respectively with probability βi and βd. Let us define an edit pattern E = E1, . . . , En as a string in {1, 0, 1}n such that Y is obtained from X in the following way: for t from 1 to n,
  • if Et = 0, transmit Xt,
  • if Et = 1, delete (do not transmit) Xt,
  • if Et = 1, transmit Xt, then insert (transmit) a new
symbol of X drawn with distribution µ(x). For instance, consider X and Y defined over a quaternary alphabet, X = 00 D122133 D10 and Y = 0120 I23 I10
  • I310. Here
Y is derived from X by 2 deletions and 3 insertions where deleted (inserted) symbols are denoted by D (I). The edit pattern is thus E = (0, 1, 0, 1, 1, 1, 0, 1, 0, 0). Node B aims to synchronize its file Y with the (original) file X by requesting carefully chosen additional information from A

Synchronization from Insertions and Deletions Under a Non-Binary, Non-Uniform Source

Nicolas Bitouz´ e and Lara Dolecek Electrical Engineering Department University of California, Los Angeles Los Angeles, USA Email: bitouze@ucla.edu, dolecek@ee.ucla.edu Abstract—We study the problem of synchronizing two files X and Y at two distant nodes A and B that are connected through a two-way communication channel. We assume that file Y at node B is obtained from file X at node A by inserting and deleting a small fraction of symbols in X. More specifically, we consider the case where X is a non-binary non-uniform string, and deletions and insertions happen uniformly with rates βd and βi,
  • respectively. We propose a synchronization protocol between node
A and node B that needs to transmit O(CX(βd+βi)n log 1 βd+βi ) bits (where n is the length of X and CX is a constant that depends
  • n the statistical properties of X) and reconstructs X at node
B with error probability exponentially low in n. This protocol readily generalizes the recent result by Tabatabaei Yazdi and Dolecek that dealt with synchronization from binary uniform source and under only deletion errors.
  • I. INTRODUCTION
Motivated by the pervasive use of file synchronization in modern data storage technologies, in this work we seek to develop a synchronization protocol that is more efficient than the existing algorithms. In particular, the popular RSYNC method can be in general very inefficient and the number of transmitted bits can be exponentially larger than the optimal number. Our starting point is an information-theoretically oriented scheme recently developed in [1]. In [1], a synchronization protocol that synchronizes an altered copy of the binary file with the original version of the file was proposed. In this scheme, the owner of the altered file requests additional information from the owner of the original file to ensure proper
  • synchronization. It was assumed that the altered copy was
  • btained from the original copy by i.i.d. deletions at the bit-
level and that the original file was generated from an i.i.d. uniform binary source. It was then shown that the rate of the proposed scheme asymptotically matches the optimal rate for this channel, developed earlier in [2]. That is, in the scheme
  • f [1], the number of bits needed to synchronize two files can
be kept very small while achieving exponentially low error of mis-synchronization. There are many practical scenarios where the files cannot be modeled as binary and uniform. For example, a file is usually not structured by bits, but by bytes or by even longer atomic elements. If the source is a text file, not only are some characters more frequent than others, but there is a large autocorrelation within the file. Additionally, some symbols may be inserted as well as deleted. As a result, our objective is to suitably generalize the scheme in [1], while maintaining low cost of transmission and low error of mis-synchronization. Specifically, our model encapsulates the following general- izations of the model in [1]: 1) We consider errors as being insertions or deletions instead
  • f being restricted to deletions only,
2) We consider non-binary source symbols, 3) We allow the source symbols to have an arbitrary distri- bution; uniform distribution is then a special case. The rest of the paper is organized as follows. In Section II we outline the overall synchronization protocol. Necessary notation and background results are presented in Section III. Two key components of our synchronization protocol, the matching module and the edit recovery module, are discussed in detail in Sections IV and V, respectively. Section VI concludes the paper.
  • II. THE SYNCHRONIZATION PROTOCOL
In [1], the following setup is considered: two distant nodes A and B are connected by a low-bandwidth high-latency
  • network. A contains a file X which is a uniform i.i.d. binary
string of length n, and B contains a file Y of length n0 that is
  • btained by deleting bits of X independently with probability
β ⌧ 1. We consider a generalized setting in which X = X1, . . . , Xn is an i.i.d. file on alphabet X = {0, . . . , Q 1}, where for all 1  t  n, Xt’s are distributed according to µ(x). For simplicity, we consider Q to be a power of two, say Q = 2q. Insertions and deletions occur respectively with probability βi and βd. Let us define an edit pattern E = E1, . . . , En as a string in {1, 0, 1}n such that Y is
  • btained from X in the following way: for t from 1 to n,
  • if Et = 0, transmit Xt,
  • if Et = 1, delete (do not transmit) Xt,
  • if Et = 1, transmit Xt, then insert (transmit) a new
symbol of X drawn with distribution µ(x). For instance, consider X and Y defined over a quaternary alphabet, X = 00 D122133 D10 and Y = 0120 I23 I10
  • I310. Here
Y is derived from X by 2 deletions and 3 insertions where deleted (inserted) symbols are denoted by D (I). The edit pattern is thus E = (0, 1, 0, 1, 1, 1, 0, 1, 0, 0). Node B aims to synchronize its file Y with the (original) file X by requesting carefully chosen additional information from A

Synchronization from Insertions and Deletions Under a Non-Binary, Non-Uniform Source

Nicolas Bitouz´ e and Lara Dolecek Electrical Engineering Department University of California, Los Angeles Los Angeles, USA Email: bitouze@ucla.edu, dolecek@ee.ucla.edu Abstract—We study the problem of synchronizing two files X and Y at two distant nodes A and B that are connected through a two-way communication channel. We assume that file Y at node B is obtained from file X at node A by inserting and deleting a small fraction of symbols in X. More specifically, we consider the case where X is a non-binary non-uniform sequence, and deletions and insertions happen uniformly with rates βd and βi,
  • respectively. We propose a synchronization protocol between node
A and node B that needs to transmit O(CX(βd+βi)n log 1 βd+βi ) bits (where n is the length of X and CX is a constant that depends
  • n the statistical properties of X) and reconstructs X at node
B with error probability exponentially low in n. This protocol readily generalizes the recent result by Tabatabaei Yazdi and Dolecek that dealt with synchronization from binary uniform source and under only deletion errors.
  • I. INTRODUCTION
Motivated by the pervasive use of file synchronization in modern data storage technologies, in this work we seek to develop a synchronization protocol that is more efficient than the existing algorithms. In particular, the popular RSYNC method can be in general very inefficient and the number of transmitted bits can be exponentially larger than the optimal number. Our starting point is an information-theoretically oriented scheme recently developed in [1]. In [1], a synchronization protocol that synchronizes an altered copy of the binary file with the original version of the file was proposed. In this scheme, the owner of the altered file requests additional information from the owner of the original file to ensure proper
  • synchronization. It was assumed that the altered copy was
  • btained from the original copy by i.i.d. deletions at the bit-
level and that the original file was generated from an i.i.d. uniform binary source. It was then shown that the rate of the proposed scheme asymptotically matches the optimal rate for this channel, developed earlier in [2]. That is, in the scheme of [1], the number of bits needed to synchronize two files can be kept very small while achieving exponentially low probability
  • f error.
There are many practical scenarios where the files cannot be modeled as binary and uniform. For example, a file is usually not structured by bits, but by bytes or by even longer atomic elements. If the source is a text file, not only are some characters more frequent than others, but there is a large autocorrelation within the file. Additionally, some symbols may be inserted as well as deleted. As a result, our objective is to suitably generalize the scheme in [1], while maintaining low cost of transmission and low error of mis-synchronization. Specifically, our model encapsulates the following general- izations of the model in [1]: 1) We consider errors as being insertions or deletions instead
  • f being restricted to deletions only,
2) We consider non-binary source symbols, 3) We allow the source symbols to have an arbitrary distri- bution; uniform distribution is then a special case. The rest of the paper is organized as follows. In Section II we outline the overall synchronization protocol. Necessary notation and background results are presented in Section III. Two key components of our synchronization protocol, the matching module and the edit recovery module, are discussed in detail in Sections IV and V, respectively. Section VI concludes the paper.
  • II. THE SYNCHRONIZATION PROTOCOL
In [1], the following setup is considered: two distant nodes A and B are connected by a low-bandwidth high-latency
  • network. A contains a file X which is a uniform i.i.d. binary
string of length n, and B contains a file Y of length n0 that is
  • btained by deleting bits of X independently with probability
β ⌧ 1. We consider a generalized setting in which the file X = X1, . . . , Xn is i.i.d. on alphabet X = {0, . . . , Q 1}, where for all 1  t  n, Xt’s are distributed according to µ(x). For simplicity, we consider Q to be a power of two, say Q = 2q. Insertions and deletions occur respectively with probability βi and βd. Let us define an edit pattern E = E1, . . . , En as a string in {1, 0, 1}n such that Y is obtained from X in the following way: for t from 1 to n,
  • if Et = 0, transmit Xt,
  • if Et = 1, delete (do not transmit) Xt,
  • if Et = 1, transmit Xt, then insert (transmit) a new
symbol of X drawn with distribution µ(x). For instance, consider X and Y defined over a quaternary alphabet, X = 00 D122133 D10 and Y = 0120 I23 I10
  • I310. Here
Y is derived from X by 2 deletions and 3 insertions where deleted (inserted) symbols are denoted by D (I). The edit pattern is thus E = (0, 1, 0, 1, 1, 1, 0, 1, 0, 0). Node B aims to synchronize its file Y with the (original) file X by requesting carefully chosen additional information from A

We need algorithms to synchronize multiple versions of a file. Existing algorithms, such as RSYNC, suffer from high communication costs. Goal: Develop a more efficient synchronization algorithm.

3 / 23

slide-12
SLIDE 12

Synchronization Protocols

Original File

4 / 23

slide-13
SLIDE 13

Synchronization Protocols

Original File Alice’s Version Bob ’s Version Edits Edits

4 / 23

slide-14
SLIDE 14

Synchronization Protocols

Original File Alice’s Version Bob ’s Version Edits Edits Synchronization Protocol so that Bob’s version matches Alice’s

4 / 23

slide-15
SLIDE 15

Synchronization Protocols

Original File Alice’s Version Alice’s Version Edits Edits Synchronization Protocol so that Bob’s version matches Alice’s

4 / 23

slide-16
SLIDE 16

Problem Setting

File X: h d c e a t g b k u j r v c x f q. . . File Y : h d e a k t g v b j r v c r f s q. . . Small rate of edits β. File length: |X|=n, |Y |=m≈n.

5 / 23

slide-17
SLIDE 17

Problem Setting

File X: h d c e a t g b k u j r v c x f q. . . File Y : h d e a k t g v b j r v c r f s q. . . Small rate of edits β. File length: |X|=n, |Y |=m≈n.

5 / 23

slide-18
SLIDE 18

Problem Setting

File X: h d c e a t g b k u j r v c x f q. . . File Y : h d e a k t g v b j r v c r f s q. . . Small rate of edits β. File length: |X|=n, |Y |=m≈n.

5 / 23

slide-19
SLIDE 19

Problem Setting

File X: h d c e a t g b k u j r v c x f q. . . File Y : h d e a k t g v b j r v c r f s q. . . D Small rate of edits β. File length: |X|=n, |Y |=m≈n.

5 / 23

slide-20
SLIDE 20

Problem Setting

File X: h d c e a t g b k u j r v c x f q. . . File Y : h d e a k t g v b j r v c r f s q. . . D Small rate of edits β. File length: |X|=n, |Y |=m≈n.

5 / 23

slide-21
SLIDE 21

Problem Setting

File X: h d c e a t g b k u j r v c x f q. . . File Y : h d e a k t g v b j r v c r f s q. . . D Small rate of edits β. File length: |X|=n, |Y |=m≈n.

5 / 23

slide-22
SLIDE 22

Problem Setting

File X: h d c e a t g b k u j r v c x f q. . . File Y : h d e a k t g v b j r v c r f s q. . . D I Small rate of edits β. File length: |X|=n, |Y |=m≈n.

5 / 23

slide-23
SLIDE 23

Problem Setting

File X: h d c e a t g b k u j r v c x f q. . . File Y : h d e a k t g v b j r v c r f s q. . . D I Small rate of edits β. File length: |X|=n, |Y |=m≈n.

5 / 23

slide-24
SLIDE 24

Problem Setting

File X: h d c e a t g b k u j r v c x f q. . . File Y : h d e a k t g v b j r v c r f s q. . . D I Small rate of edits β. File length: |X|=n, |Y |=m≈n.

5 / 23

slide-25
SLIDE 25

Problem Setting

File X: h d c e a t g b k u j r v c x f q. . . File Y : h d e a k t g v b j r v c r f s q. . . D DD D I I I I Small rate of edits β. File length: |X|=n, |Y |=m≈n.

5 / 23

slide-26
SLIDE 26

Problem Setting

File X: h d c e a t g b k u j r v c x f q. . . File Y : h d e a k t g v b j r v c r f s q. . . D DD D I I I I File X Node A File Y Node B Two-way Channel (Noiseless) Goal: Interactive Communication Scheme Allow Node to B recover X from Y : with low probability of error, with low communication cost.

5 / 23

slide-27
SLIDE 27

Related Work

Scheme that corrects a single edit (binary & non-binary):

  • V. I. Levenshtein, “Binary codes with correction of deletions,

insertions and reversals”, 1965.

  • G. M. Tenengolts, “Nonbinary codes, correcting single deletion
  • r insertion”, 1984.

6 / 23

slide-28
SLIDE 28

Related Work

Scheme that corrects a single edit (binary & non-binary):

  • V. I. Levenshtein, “Binary codes with correction of deletions,

insertions and reversals”, 1965.

  • G. M. Tenengolts, “Nonbinary codes, correcting single deletion
  • r insertion”, 1984.

Scheme that corrects a fixed number of edits (binary):

  • R. Venkataramanan, H. Zhang, and K. Ramchandran,

“Interactive low-complexity codes for synchronization from deletions and insertions”, 2010.

6 / 23

slide-29
SLIDE 29

Related Work

Scheme that corrects a single edit (binary & non-binary):

  • V. I. Levenshtein, “Binary codes with correction of deletions,

insertions and reversals”, 1965.

  • G. M. Tenengolts, “Nonbinary codes, correcting single deletion
  • r insertion”, 1984.

Scheme that corrects a fixed number of edits (binary):

  • R. Venkataramanan, H. Zhang, and K. Ramchandran,

“Interactive low-complexity codes for synchronization from deletions and insertions”, 2010.

Theoretical bound for the fixed rate of edits case:

  • N. Ma, K. Ramchandran and D. Tse, “Efficient file

synchronization: a distributed source coding approach”, 2011.

6 / 23

slide-30
SLIDE 30

Our Contributions

Scheme for fixed rate of edits:

  • N. Bitouz´

e and L. Dolecek, “Synchronization from insertions and deletions under a non-binary, non-uniform source”, IEEE ISIT, Jul. 2013.

  • S. M. S. Tabatabaei and L. Dolecek, “A deterministic,

polynomial-time protocol for synchronizing from deletions”, IEEE Trans. I.T., 2013.

  • N. Bitouz´

e, F. Sala, S. M. S. Tabatabaei, and L. Dolecek, “A practical framework for efficient file synchronization”, Allerton,

  • Oct. 2013.

7 / 23

slide-31
SLIDE 31

Synchronization Scheme: Overview

hdcatgbkujinntuqjrvxfqxrzleydwajgndtwanyohdcyhzrnrm File X: Node A hdcatbkujintkuqjrvxqxrledwrajgnidtwayohdcyhzrnrm File Y : Node B

8 / 23

slide-32
SLIDE 32

Synchronization Scheme: Overview

hdcatgbkujinntuqjrvxfqxrzleydwajgndtwanyohdcyhzrnrm File X: Pivot Strings: short Segment Strings: long Node A hdcatbkujintkuqjrvxqxrledwrajgnidtwayohdcyhzrnrm File Y : Node B

8 / 23

slide-33
SLIDE 33

Synchronization Scheme: Overview

hdcatgbkujinntuqjrvxfqxrzleydwajgndtwanyohdcyhzrnrm File X: Pivot Strings: short Segment Strings: long Node A hdcatbkujintkuqjrvxqxrledwrajgnidtwayohdcyhzrnrm File Y : Node B Send the pivots: hdc jrv gnd nrm

8 / 23

slide-34
SLIDE 34

Synchronization Scheme: Overview

hdcatgbkujinntuqjrvxfqxrzleydwajgndtwanyohdcyhzrnrm File X: Pivot Strings: short Segment Strings: long Node A hdcatbkujintkuqjrvxqxrledwrajgnidtwayohdcyhzrnrm File Y : Matched Pivots Node B Send the pivots: hdc jrv gnd nrm

1 Matching Module: Matches the pivot strings. 8 / 23

slide-35
SLIDE 35

Synchronization Scheme: Overview

hdcatgbkujinntuqjrvxfqxrzleydwajgndtwanyohdcyhzrnrm File X: Pivot Strings: short Segment Strings: long Node A hdcatbkujintkuqjrvxqxrledwrajgnidtwayohdcyhzrnrm File Y : Matched Pivots Node B

1 Matching Module: Matches the pivot strings. 8 / 23

slide-36
SLIDE 36

Synchronization Scheme: Overview

hdcatgbkujinntuqjrvxfqxrzleydwajgndtwanyohdcyhzrnrm File X: Pivot Strings: short Segment Strings: long Node A hdcatbkujintkuqjrvxqxrledwrajgnidtwayohdcyhzrnrm File Y : Matched Pivots Non-Synced Segments Node B

1 Matching Module: Matches the pivot strings. 8 / 23

slide-37
SLIDE 37

Synchronization Scheme: Overview

hdcatgbkujinntuqjrvxfqxrzleydwajgndtwanyohdcyhzrnrm File X: Pivot Strings: short Segment Strings: long Node A hdcatbkujintkuqjrvxqxrledwrajgnidtwayohdcyhzrnrm File Y : Matched Pivots Non-Synced Segments Node B

1 Matching Module: Matches the pivot strings. 2 Edit Recovery Module: Synchronizes the segment strings. 8 / 23

slide-38
SLIDE 38

Synchronization Scheme: Overview

hdcatgbkujinntuqjrvxfqxrzleydwajgndtwanyohdcyhzrnrm File X: Pivot Strings: short Segment Strings: long Node A hdcatgbkujinntuqjrvxfqxrzleydwajgndtwanyohdcyhzrnrm File Y : Matched Pivots Synchronized Segments Node B

1 Matching Module: Matches the pivot strings. 2 Edit Recovery Module: Synchronizes the segment strings. 8 / 23

slide-39
SLIDE 39

Synchronization Scheme: Overview

hdcatgbkujinntuqjrvxfqxrzleydwajgndtwanyohdcyhzrnrm File X: Pivot Strings: short Segment Strings: long Node A hdcatgbkujinntuqjrvxfqxrzleydwajgndtwanyohdcyhzrnrm File Y : Matched Pivots Synchronized Segments Node B

1 Matching Module: Matches the pivot strings. 2 Edit Recovery Module: Synchronizes the segment strings. 3 Channel Coding Module: Recovers from residual errors if any. 8 / 23

slide-40
SLIDE 40

1 Matching Module

slide-41
SLIDE 41

Matching Module

We match hdc jrv gnd nrm in Y = hdcatbkujintkuqjrvxqxrledwrajgnidtwayohdcyhzrnrm.

9 / 23

slide-42
SLIDE 42

Matching Module

We match hdc jrv gnd nrm in Y = hdcatbkujintkuqjrvxqxrledwrajgnidtwayohdcyhzrnrm. hdc hdcatbkujintkuqjrvxqxrledwrajgnidtwaohdcdyhzrnrm

9 / 23

slide-43
SLIDE 43

Matching Module

We match hdc jrv gnd nrm in Y = hdcatbkujintkuqjrvxqxrledwrajgnidtwayohdcyhzrnrm. hdc hdcatbkujintkuqjrvxqxrledwrajgnidtwaohdcdyhzrnrm jrv hdcatbkujintkuqjrvxqxrledwrajgnidtwaohdcdyhzrnrm

9 / 23

slide-44
SLIDE 44

Matching Module

We match hdc jrv gnd nrm in Y = hdcatbkujintkuqjrvxqxrledwrajgnidtwayohdcyhzrnrm. hdc hdcatbkujintkuqjrvxqxrledwrajgnidtwaohdcdyhzrnrm jrv hdcatbkujintkuqjrvxqxrledwrajgnidtwaohdcdyhzrnrm gnd hdcatbkujintkuqjrvxqxrledwrajgnidtwaohdcdyhzrnrm

9 / 23

slide-45
SLIDE 45

Matching Module

We match hdc jrv gnd nrm in Y = hdcatbkujintkuqjrvxqxrledwrajgnidtwayohdcyhzrnrm. hdc hdcatbkujintkuqjrvxqxrledwrajgnidtwaohdcdyhzrnrm jrv hdcatbkujintkuqjrvxqxrledwrajgnidtwaohdcdyhzrnrm gnd hdcatbkujintkuqjrvxqxrledwrajgnidtwaohdcdyhzrnrm nrm hdcatbkujintkuqjrvxqxrledwrajgnidtwaohdcdyhzrnrm

9 / 23

slide-46
SLIDE 46

Matching Module

We match hdc jrv gnd nrm in Y = hdcatbkujintkuqjrvxqxrledwrajgnidtwayohdcyhzrnrm. hdc hdcatbkujintkuqjrvxqxrledwrajgnidtwaohdcdyhzrnrm jrv hdcatbkujintkuqjrvxqxrledwrajgnidtwaohdcdyhzrnrm gnd hdcatbkujintkuqjrvxqxrledwrajgnidtwaohdcdyhzrnrm nrm hdcatbkujintkuqjrvxqxrledwrajgnidtwaohdcdyhzrnrm

9 / 23

slide-47
SLIDE 47

Matching Module: Larger Example

1 2 3 4 5 6 1 1 1 1 2 2 3 3 3 4 4 4 5 5 5 6 6 6

y1y2 . . . . . . ym

10 / 23

slide-48
SLIDE 48

Matching Module: Even Larger Example

Edit Rates βd = βi = 0.02, n = 5000, Pivot Length 5

Pivot k Pivot 40 Pivot 20 Pivot 1 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

11 / 23

slide-49
SLIDE 49

Matching Module: Even Larger Example

Edit Rates βd = βi = 0.02, n = 5000, Pivot Length 6

Pivot k Pivot 40 Pivot 20 Pivot 1 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

11 / 23

slide-50
SLIDE 50

Matching Module: Even Larger Example

Edit Rates βd = βi = 0.02, n = 5000, Pivot Length 7

Pivot k Pivot 40 Pivot 20 Pivot 1 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

11 / 23

slide-51
SLIDE 51

Matching Module: Cost and Error Probability

Errors Pivots are short enough to avoid edits, Pivots are long enough so that pivot collisions are unlikely, “Wrong” thick edges exist, but wrong thick paths do not.

12 / 23

slide-52
SLIDE 52

Matching Module: Cost and Error Probability

Errors Pivots are short enough to avoid edits, Pivots are long enough so that pivot collisions are unlikely, “Wrong” thick edges exist, but wrong thick paths do not. Communication With pivot length O ⇣ log 1

β

⌘ , With segment length O ⇣

1 β

⌘ , O(nβ) pivots are transmitted, totalling O ⇣ nβ log 1

β

⌘ bit.

12 / 23

slide-53
SLIDE 53

Matching Module: Cost and Error Probability

Summary The module transmits O ⇣ nβ log 1

β

⌘ bits (in a single channel use). With probability 1 − O(2−n):

We match a fraction 1 − O ⇣ β log 1

β

  • f the pivots.

These matches are incorrect with probability < β + o(β).

12 / 23

slide-54
SLIDE 54

2 Edit Recovery Module

slide-55
SLIDE 55

Edit Recovery Module: Correcting Single Edits

When two files differ by a single edit, synchronization: can be done in a single round of communication, in a perfect manner (no error), communicating ∼ log L + log q bits (from A to B), where L and L ± 1 are the lengths of the files.

  • G. M. Tenengolts, “Nonbinary codes, correcting single deletion
  • r insertion”, 1984.

13 / 23

slide-56
SLIDE 56

Edit Recovery Module

  • R. Venkataramanan, H. Zhang, and K. Ramchandran,

“Interactive low-complexity codes for synchronization from deletions and insertions”, 2010. Goal: X = j r v x q x r l e d w r a j g n i d t w a y o h d c y h z r n r m Given: Y = j r v x f q x r z l e y d w a j g n d t w a n y o h d c y h z r n r m

14 / 23

slide-57
SLIDE 57

Edit Recovery Module

  • R. Venkataramanan, H. Zhang, and K. Ramchandran,

“Interactive low-complexity codes for synchronization from deletions and insertions”, 2010. Goal: X = j r v x q x r l e d w r a j g n i d t w a y o h d c y h z r n r m Given: Y = j r v x f q x r z l e y d w a j g n d t w a n y o h d c y h z r n r m

j r v x q x r l e d w a j g n i d t w a y o h d c y h z r n r m j r v x f q x r z l e y d w a j g n d t w a n y o h d c y h z r n r m 26 29

Lengths differ by more than 1: Send a central delimiter.

14 / 23

slide-58
SLIDE 58

Edit Recovery Module

  • R. Venkataramanan, H. Zhang, and K. Ramchandran,

“Interactive low-complexity codes for synchronization from deletions and insertions”, 2010. Goal: X = j r v x q x r l e d w r a j g n i d t w a y o h d c y h z r n r m Given: Y = j r v x f q x r z l e y d w a j g n d t w a n y o h d c y h z r n r m

j r v x q x r l e d w a j g n i d t w a y o h d c y h z r n r m j r v x f q x r z l e y d w a j g n d t w a n y o h d c y h z r n r m 12 12

Delimiter i d has no match: Choose a different delimiter.

14 / 23

slide-59
SLIDE 59

Edit Recovery Module

  • R. Venkataramanan, H. Zhang, and K. Ramchandran,

“Interactive low-complexity codes for synchronization from deletions and insertions”, 2010. Goal: X = j r v x q x r l e d w r a j g n i d t w a y o h d c y h z r n r m Given: Y = j r v x f q x r z l e y d w a j g n d t w a n y o h d c y h z r n r m

j r v x q x r l e d w a j g n i d t w a y o h d c y h z r n r m j r v x f q x r z l e y d w a j g n d t w a n y o h d c y h z r n r m

New “central” delimiter g n is matched: Repeat on both sides.

14 / 23

slide-60
SLIDE 60

Edit Recovery Module

Goal: X = j r v x q x r l e d w r a j g n i d t w a y o h d c y h z r n r m Given: Y = j r v x f q x r z l e y d w a j g n d t w a n y o h d c y h z r n r m

j r v x q x r l e d w a j g n i d t w a y o h d c y h z r n r m j r v x f q x r z l e y d w a j g n d t w a n y o h d c y h z r n r m 10 14 13 14

Left side: Lengths differ by more than 1: Send a delimiter. Right side: Lengths are equal: Test if strings are equal (hash), No: Send a delimiter.

14 / 23

slide-61
SLIDE 61

Edit Recovery Module

Goal: X = j r v x q x r l e d w r a j g n i d t w a y o h d c y h z r n r m Given: Y = j r v x f q x r z l e y d w a j g n d t w a n y o h d c y h z r n r m

j r v x q x r l e d w a j g n i d t w a y o h d c y h z r n r m j r v x f q x r z l e y d w a j g n d t w a n y o h d c y h z r n r m

And so on...

14 / 23

slide-62
SLIDE 62

Edit Recovery Module

Goal: X = j r v x q x r l e d w r a j g n i d t w a y o h d c y h z r n r m Given: Y = j r v x f q x r z l e y d w a j g n d t w a n y o h d c y h z r n r m

j r v x q x r l e d w a j g n i d t w a y o h d c y h z r n r m j r v x f q x r z l e y d w a j g n d t w a n y o h d c y h z r n r m 4 6 4 5 6 6 6 6

And so on...

14 / 23

slide-63
SLIDE 63

Edit Recovery Module

Goal: X = j r v x q x r l e d w r a j g n i d t w a y o h d c y h z r n r m Given: Y = j r v x f q x r z l e y d w a j g n d t w a n y o h d c y h z r n r m

j r v x q x r l e d w a j g n id t w a y o h d c y h z r n r m j r v x f q x r z l e d w a j g n d t w a n y o h d c y h z r n r m

And so on...

14 / 23

slide-64
SLIDE 64

Edit Recovery Module

Goal: X = j r v x q x r l e d w r a j g n i d t w a y o h d c y h z r n r m Given: Y = j r v x f q x r z l e y d w a j g n d t w a n y o h d c y h z r n r m

j r v x q x r l e d w a j g n id t w a y o h d c y h z r n r m j r v x f q x r z l e d w a j g n d t w a n y o h d c y h z r n r m 1 1 2 2 2 2 1 3

And so on...

14 / 23

slide-65
SLIDE 65

Edit Recovery Module

Goal: X = j r v x q x r l e d w r a j g n i d t w a y o h d c y h z r n r m Given: Y = j r v x f q x r z l e y d w a j g n d t w a n y o h d c y h z r n r m

j r v x q x r l e d w a j g n id t w a y o h d c y h z r n r m j r v x q x r l e d w a j g n id t w a y o h d c y h z r n r m

And so on... until every subproblem is solved.

14 / 23

slide-66
SLIDE 66

Edit Recovery Module: Cost and Error Probability

Errors Come from Hash Collisions, Increase hash length:

More communication, Less collisions.

15 / 23

slide-67
SLIDE 67

Edit Recovery Module: Cost and Error Probability

Errors Come from Hash Collisions, Increase hash length:

More communication, Less collisions.

Communication Hashes (from A to B), Delimiters (from A to B), Syndromes (from A to B), Control (from B to A).

15 / 23

slide-68
SLIDE 68

Edit Recovery Module: Cost and Error Probability

Summary The module transmits O ⇣ nβ log 1

β

⌘ bits (in a few rounds of communication). With probability 1 − O(2−n),

  • nly a fraction o(β) of the segments is mis-synchronized.

15 / 23

slide-69
SLIDE 69

3 Channel Coding Module

slide-70
SLIDE 70

Channel Coding Module: Motivation

Possible Errors Pivots mismatched by the Matching Module. Delimiters mismatched by the Edit Recovery Module. Hash collisions. After these two modules, the symbol-error probability is < 2β + o(β).

16 / 23

slide-71
SLIDE 71

Channel Coding Module: Motivation

Possible Errors Pivots mismatched by the Matching Module. Delimiters mismatched by the Edit Recovery Module. Hash collisions. After these two modules, the symbol-error probability is < 2β + o(β). ⇒ Correct these errors with the Channel Coding Module.

16 / 23

slide-72
SLIDE 72

Channel Coding Module

Node A Node B X Output ≈X from Edit Recov. Mod. (Same length as X)

17 / 23

slide-73
SLIDE 73

Channel Coding Module

Node A Node B X Output ≈X from Edit Recov. Mod. (Same length as X) C Systematic part q-ary Checks Codeword from Systematic LDPC Code

17 / 23

slide-74
SLIDE 74

Channel Coding Module

Node A Node B X Output ≈X from Edit Recov. Mod. (Same length as X) C Systematic part q-ary Checks Codeword from Systematic LDPC Code C Send

17 / 23

slide-75
SLIDE 75

Channel Coding Module

Node A Node B X Output ≈X from Edit Recov. Mod. C Systematic part q-ary Checks Codeword from Systematic LDPC Code C Send

17 / 23

slide-76
SLIDE 76

Channel Coding Module

Node A Node B X Output ≈X from Edit Recov. Mod. C Systematic part q-ary Checks Codeword from Systematic LDPC Code C X C Channel Coding Module: LDPC Decoder Send

17 / 23

slide-77
SLIDE 77

Channel Coding Module: Cost and Error Probability

Communication Residual error probability from previous modules: 2β + o(β). Additional data required to correct these errors: Const · n · H(2β + o(β)) = O ⇣ nβ log 1

β

⌘ bits.

18 / 23

slide-78
SLIDE 78

Channel Coding Module: Cost and Error Probability

Communication Residual error probability from previous modules: 2β + o(β). Additional data required to correct these errors: Const · n · H(2β + o(β)) = O ⇣ nβ log 1

β

⌘ bits. Summary The module transmits O ⇣ nβ log 1

β

⌘ bits (in a single channel use). With probability 1 − O(2−n), Y is synchronized with no remaining error.

18 / 23

slide-79
SLIDE 79

Comparison with RSYNC when n varies

β = 0.01 and β = 0.002, i.i.d. file, i.i.d. edits, q = 52, Our pivot length: 5, segment length: 1/β.

100000 200000 300000 400000 20000 40000 60000 80000 100000 Bits transmitted File length (n) Factor 5.5 Factor 12.5

  • ur scheme (β = 0.002)
  • ur scheme (β = 0.010)

rsync (β = 0.002) rsync (β = 0.010)

19 / 23

slide-80
SLIDE 80

Comparison with RSYNC when β varies

n = 50000, i.i.d. file, i.i.d. edits, q = 52, Our pivot length: 5, segment length: 1/β.

100000 200000 300000 0.002 0.004 0.006 0.008 0.01 Bits transmitted Edit rate (β) Factor 10

  • ur scheme

rsync

20 / 23

slide-81
SLIDE 81

Comparison with Venkataramanan et al. scheme

β = 0.01, i.i.d. file, i.i.d. edits, q = 52, Our pivot length: 5, segment length: 1/β. Bandwidth (in bits) n 20k 40k 60k 100k Median Our scheme 18k 35k 54k 87k Venkataramanan 19k 41k 63k 87k Worst-case Our scheme 52k 96k 95k 95k Venkataramanan 390k 845k 1,216k 346k Rounds of communication required Our scheme completes in about half less rounds. Errors prior to Channel Coding Our error rate per symbol is also about half lower.

21 / 23

slide-82
SLIDE 82

Applications

In addition to reducing data storage demand, our algorithm can be used for:

22 / 23

slide-83
SLIDE 83

Applications

In addition to reducing data storage demand, our algorithm can be used for: Synchronization in general data storage (Dropbox),

22 / 23

slide-84
SLIDE 84

Applications

In addition to reducing data storage demand, our algorithm can be used for: Synchronization in general data storage (Dropbox), Synchronization in particular data repositories: source control (Github, SVN, etc...), video (YouTube, Vimeo),

22 / 23

slide-85
SLIDE 85

Applications

In addition to reducing data storage demand, our algorithm can be used for: Synchronization in general data storage (Dropbox), Synchronization in particular data repositories: source control (Github, SVN, etc...), video (YouTube, Vimeo), Database searches: determining whether two records are similar.

22 / 23

slide-86
SLIDE 86

Ongoing Work

Allow for more complex edit patterns (e.g., in practical scenarios, edits are often “bursty”),

23 / 23

slide-87
SLIDE 87

Ongoing Work

Allow for more complex edit patterns (e.g., in practical scenarios, edits are often “bursty”), Specialize our scheme to application-dependent types of files (e.g., if the files are source code, exploit that structure),

23 / 23

slide-88
SLIDE 88

Ongoing Work

Allow for more complex edit patterns (e.g., in practical scenarios, edits are often “bursty”), Specialize our scheme to application-dependent types of files (e.g., if the files are source code, exploit that structure), Optimize the implementation (e.g., in terms of both computation and network usage).

23 / 23