# Deletion Streams

Imagine Alice is trying to send a message \(M\) to Bob. Assume that the message consists of \(m\) bits, i.e., \(M \in {0, 1}^m\). However, the channel through which Alice transmits the message is unreliable: each bit in the message is deleted (independently) with probability \(p\). The message that Bob receives will be shorter than \(M\), and Bob does not know which bits have been deleted. Let \(T\) (for *trace*) denote the message that Bob receives. We assume that Bob knows in advance the length \(m\) of \(M\). A natural question to ask about this communication model (called the *deletion stream model*) is, “How many times must Alice attempt to send Bob the message in order to guarantee that Bob can reconstruct any \(M\) (with high probability)?”

This communication model is similar to that considered by Claude Shannon in his seminal work *A Mathematical Theory of Communication*. Shannon’s work instigated the field of information theory and motivated the study of error-correcting codes. In Shannon’s model, the message \(M’\) received by Bob has the same length as the original message \(M\), but some bits are changed (instead of deleted entirely). In Shannon’s model, error-correcting codes are ways of encoding the original message so that if a modest number of bits are flipped, Bob can detect and correct the errors and decode the original message faithfully.

Much less is known about deletion streams than Shannon’s streaming model. One reason for this is that more information about the original message is destroyed in the deletion process than in Shannon’s model. For example if Bob sees \(k\) traces \(T_1, T_2, \ldots, T_k\) of the message for any fixed \(k\), there is a positive probability that he will not see some fraction of bits in the message and he will not be able to determine the positions or values of these missing bits. Apparently, *exact* reconstruction of the original message (with high probability) requires at least a linear number of traces (that is \(k = \Omega(m)\)).

To me, it would be interesting to to consider the problem of *approximate* reconstruction of \(M\) from traces. That is, given traces \(T_1, T_2, \ldots, T_k\), for what values of \(k\) can we find a message \(M’\) that is “close” to \(M\)? A natural notion of closeness in this case is Hamming distance, which is the number of bits where \(M\) and \(M’\) disagree. To my knowledge, it is still not known how large \(k\) must be in order to approximately reconstruct \(M\).

Given an algorithm that gives an approximate reconstruction of \(M\), we could solve the problem of communication over a deletion stream in the following way: First Alice encodes the message \(M\) using a suitable error-correcting code, and sends Bob \(k\) copies of the encoded message for a suitable value of \(k\). Bob can then approximately reconstruct the encoded message, and using the error-correcting code decode the original message.