Will Rosenbaum | Binary Representation

\[\def\compare{ {\mathbf{compare}} } \def\true{ {\mathbf{true}} } \def\false{ {\mathbf{false}} } \def\and{ {\mathbf{and}} } \def\lshift{ {\,\ll\,} } \def\xor{ {\oplus} } \def\Add{ {\mathrm{Add}} } \def\carry{ {\mathrm{carry}} }\]

In all but the lowest-level programming languages, it is typically the case that the programmer is shielded from the precise representation of numerical values. In Java, for example, the programmer must specify if an integer should be represented by a int or long (depending on how large of values a program is expected to encounter), but once this determination is made, the mechanics of arithemtic operations are handled by the computer system. Thus, we tend to treat arithmetic operations such as + and *, and comparison operations like < as being elementary and indivisible. While this view is a handy abstraction, it is also limiting. The precise representation and manipulation of numerical values, however, makes a significant difference in the efficiency of these fundamental operations.

In this note, we will describe a standard encoding of integer values in binary. Binary representation is analogous to the decimal (base 10) notation we are taught in gradeschool, except that it only uses two symbols ($0$ and $1$) instead of the digits $0$ through $9$ used in the decimal system.

Representing Numbers as Strings of Symbols

What number do we mean when we write $537$ (five hundred thirty seven)? This number has three digits, $5, 3$ and $7$. The value represented by each digit is determined both by the symbol ($0$ through $9$) as well as the digit’s place. From right to left, the place value of each digit is $1, 10, 100, 1000, \ldots$. The number $537$ thus represents $7$ ones, $3$ tens, and $5$ hundreds. That is, we can write

\[537 = 7 \cdot 1 + 3 \cdot 10 + 5 \cdot 100.\]

More generally, every natural number has a unique expression as above. We are taught this particular representation of numbers from an early age, as well as algorithms for manipulation such numbers, such as for performing arithmetic: addition, subtration, multiplication, and division.

Decimal represention such as above uses ten symbols, the digits $0$ through $9$. In many contexts (for example, in most modern computer hardware), it is more convient to represent numbers using a smaller symbol set. The smallest imaginable set would be to represent all numbers with a single symbol, say, $\circ$. We can simply represent a number by duplicating this symbol. This system is called unary encoding of numbers. For example, the number seven becomes $\circ\circ\circ\circ\circ\circ\circ$. Clearly, unary encoding makes expressing even moderately large numbers incredibly tedious.

Things improve dramatically when we allow ourselves two symbols. We will denote these symbols $0$ and $1$. We refer to this symbols as bits (sort for binary digit). Written alone, they have their usual interpretation as the values zero and one, respectively. With only two symbols, we already have to use multiple symbols to express the value $2$, which we write as $10$. Just as with decimals, the value of a bit is determined both by its symbol ($0$ or $1$) and its place. In binary, the place values are powers of $2$: $1, 2, 4, 8, 16, \ldots$. Thus, the rightmost bit is the ones bit, the next bit is the twos be, the next is the fours, etc. For example, we can write

\[\begin{align*} 10111_2 &= 1 \cdot 1 + 1 \cdot 2 + 1 \cdot 4 + 0 \cdot 8 + 1 \cdot 16\\ &= 1 + 2 + 4 + 16\\ &= 23 \end{align*}\]

Here we use the subscript $\cdot_2$ to indicate that the first expression is a binary expression, while the remaining symbols are decimal. (Typically, we will avoid using subscripts when the intpretetaion as binary or decimal is unambiguous.)

Exercise. Familiarize yourself with binary by performing the following tasks:

Write the numbers $0$ through $31$ in binary.
What is the (decimal) value of $10110110$.
Write $87$ in binary.
Write your birth year in binary.

To refer to a generic number in binary, it will often be helpful to write the bits of the number individually. In this case, to represent a number $b$ we will use $b_i$ to refer to the $i$th bit of the binary representation of $b$. Thus, if $b$ is represented with $n$ bits, we would write

\[b = b_n b_{n-1} \cdots b_1 b_0.\]

In this expression, the value of $b$ can be computed as the sum

\[b = 2^0 \cdot b_0 + 2^1 \cdot b_1 + \cdots + 2^{n-1} \cdot b_{n-1} + 2^n \cdot b_n.\]

Finally, we can write this sum much more succinctly using summation notation as follows:

\[\begin{equation} b = \sum_{i = 0}^n b_i 2^i. \end{equation}\]

The notation $\sum$ (the Greek capital letter Sigma) is essentially just shorthand for a “for” loop: $\sum_{i=0}^n s(i)$ means “add up the values of $s(i)$ for $i = 0, 1,\ldots,n$.” Just as with decimal representations of natural numbers, every natural number has a unique representation as a binary expression as above.

Comparing Binary Values

Given two natural numbers expressed in binary, we can easily compare them to determine which is larger. The process is analogous to comparing decimal numbers. To simplify things, we assume that both numbers are expressed with the same number of bits, $n$. (If one of the numbers uses fewer bits, we can pad the expression by adding $0$s on the left until both numbers have the same number of bits.) If we are asked to compare $a$ and $b$, we can align the numbers and find the left-most bit where their values differ. If this is bit $k$, then there are only two possibilites: $a_k = 1$ and $b_k = 0$ or $a_k = 0$ and $b_k = 1$. In the former case, we have $a > b$ while in the latter case $b > a$.

Example. To determine which of $a = 10110111000111001$ and $b =10110110111001111$ is larger, it is helpful to write them with their bits aligned:

  a = 10110111000111001
  b = 10110110111001111
             *

In this case, the first bit $k$ in which the two numbers differ is indicated with the *. Since $a_k = 1$ and $b_k = 0$, we can conlcude that $a > b$.

Exercise. Sort the following list of numbers (written in binary) without converting them to decimal.

  110110100110
  100110001001
  001000101000
  101100111001
  000010001010
  111100011100
  101111001100
  010100010010
  011011100010

At the end of this note, we include a proof of the correctness of the comparison procedure described above.

Exercise. Suppose $a$ and $b$ are represented as binary arrays of length $n$. That is, $a$ and $b$ are arrays of values $0$ and $1$, and the $i$th bit of $a$ (say) can be accessed as $a[i]$. Specifically, $a[0]$ is the 1s bit, $a[1]$ is the 2s bit, and so on. Write pseudocode for the method $\compare(a, b)$ that returns $\true$ if $a > b$ and $\false$ otherwise.

Arithmetic

Arithmetic with the binary representation of numbers can be performed analogously to the decimal arithmetic we are taught in gradeschool. In fact, the gradeschool procedures are simplified in binary representation because there are fewer symbols used. As with gradeschool arithmetic, defining the addition/multiplication operations first requires one to define these operations on individual digits (bits). Then we can combine these single-bit operations to add or multiply larger numbers.

Addition

We first define addition for one-bit numbers. In binary, we get the following table:

\[\begin{align*} 0 + 0 &= 00\\ 1 + 0 &= 01\\ 0 + 1 &= 01\\ 1 + 1 &= 10 \end{align*}\]

Thinking ahead, we represent the sum of two bits as a two bit number: the sum bit and the carry bit. It will be instructive to separate out the sum above into two separate operations that give a single bit each, one for the sum and the other for the carry. Notice the pattern of the sum (1s) bit in the expressions above: its value is $1$ if exactly one of the two bits on the left is $1$, and its value is $0$ otherwise. The binary function with this behavior is called the exclusive or or xor function. It is often denoted with the mathematical symbol $\xor$. The xor operator is defined by the following table of values:

\[\begin{align*} 0 \xor 0 &= 0\\ 1 \xor 0 &= 1\\ 0 \xor 1 &= 1\\ 1 \xor 1 &= 0 \end{align*}\]

Again, this table encodes the 1s bits from the previous table. The name “exclusive or” comes from interpreting the values $0$ and $1$ as representing Boolean values $\false$ and $\true$, respectively. In this case “exclusive or” means that “one, but not both of the values is true.”

We can similarly define a separate operator that gives us just the carry bit upon adding two values. Again referring to the previous table, the carry bit of $a + b$ is $1$ if $a$ and $b$ are both $1$, and $0$ otherwise. In this case, this operation is represented by the logical and operations, often denoted $\wedge$. The and operator is defined by the following table:

\[\begin{align*} 0 \wedge 0 &= 0\\ 1 \wedge 0 &= 0\\ 0 \wedge 1 &= 0\\ 1 \wedge 1 &= 1 \end{align*}\]

Thinking of $0$ and $1$ as corresponding to $\false$ and $\true$, respectively, $a \wedge b$ is $\true$ precisely when both $a$ and $b$ are true.

Now that we have formally defined how to add individual bits via the $\xor$ and $\wedge$ operators, we can describe the procedure for adding binary numbers. Performing the procedure by hand, it is analogous to gradeschool addition:

Write the numbers down so that bits are aligned according to their values.
Moving from right to left, add the individual bits, along with carry bits using the tables above.

Example. Consider adding the numbers $10110 (= 22)$ and $1101 (= 13)$:

    111       (carry)
     10110    (a)
  +  01101    (b)
  --------
    100011    (sum)

To get the first (rightmost) bit of the sum, we compute the xor of the first bits of $a$ and $b$ ($0$ and $1$ respectively), to get $1$. The carry bit is $0 \wedge 1 = 0$. Similarly, the second bit of the sum is $1 \xor 0 = 1$, while the carry bit is $1 \wedge 0 = 0$. For the third bit of the sum, we compute the sum $1 \xor 1 = 0$ and the carry bit $1 \wedge 1 = 1$. We indicate the carry bits above $a$, so the 4th column from the right contains the carry bit $1$ from the addition of the third column. This process continues untill all bits have been added.

We can express the procedure for addition in pseudocode as follows. Again, we treat the values $a$ and $b$ as Boolean (binary) arrays where, for example $a[0]$ is the ones bit of $a$, $a[1]$ is the twos bit, and so on.

  Add(a, b): # a and b binary arrays of length n
    c <- new array of size n+1 # store sum
    sum <- 0
    carry <- 0
    for i = 1 up to n do:
      sum <- a[i] xor b[i]   # add bits of a and b
      c[i] <- sum xor carry  # add previous carry bit
      carry <- (a[i] and b[i]) or (sum and carry)
    endfor
    c[n+1] <- carry
    return c

Exercise. Convince yourself that $\carry$ is computed properly. Note that we should have $\carry = 1$ after iteration $i$ if at least two of $a[i], b[i]$ and $\carry$ are $1$.

Exercise. What is the running time of $\Add(a, b)$ if $a$ and $b$ are both represented with $n$ bits? If $a$ and $b$ both satisfy $a, b \leq N$, what is the running time of $\Add(a, b)$ as a function of $N$? (Hint: how many bits are sufficient to represent $N$?)

Exercise. Pick a few of your favorite numbers, express them in binary, and add them together.

Multiplication

Before describing multiplication in full generality, we first describe a special case that will make our procedure much simpler to understand.

Observation. If $a$ has binary representation $a_n a_{n-1} \cdots a_1 a_0$, then $2 a$ has binary representation $a_n a_{n-1} \cdots a_1 a_0 0$. That is, the binary representation of $2a$ is obtained from that of $a$ by shifting $a$’s bits $1$ to the left, and appending $0$ as the ones bit. (Note that this is analogous to the procedure of multiplying a decimal representation by $10$).

In light of this observation, we introduce a new operation, the left shift operator, denoted $\lshift$. Applying the left shift operator $a \lshift k$ has the effect of shifting the bits of $a$ to the left by $k$ and appending $k$ 0s as the lowest order bits:

\[a \lshift k = a_n a_{n-1} \cdots a_0 a_1 \underset{k}{\underbrace{00\cdots0}}\]

Observe that the value stored by $a \lshift k$ is $2^k a$—i.e., applying $\lshift k$ is equivalent to multiplying by $2^k$.

In order to multiply two binary numbers, we first give a multiplication table for individual bits. In this case we get:

\[\begin{align*} 0 * 0 &= 0\\ 1 * 0 &= 0\\ 0 * 1 &= 0\\ 1 * 1 &= 1 \end{align*}\]

Note that for individual bits, $*$ precisely the same as the $\wedge$ (logical $\and$) operator from before. Because of the simplified multiplication table above, the procedure for multiplying two numbers in binary is simpler than decimal multiplication. When performing the multiplication by hand, the basic procedure is this same:

Write $a$ above $b$ with the bits aligned.
For each bit $b_i$ of $b$, shift $a$, $i$ bits to the left, multiply by $b_i$, and add the result to the running total.

In the case of binary expressions, step 2 is simpler than decimal multiplication. This is because $b_i$ is either $0$ or $1$, so the product is either $0$ or simply the shifted expression of $a$.

Example. Consider multiplying $1011_2 = 11$ by $101_2 = 5$.

      1011
   *   101
  --------
      1011
  +  00000
  + 101100
  --------
    110111

We can formalize the multiplication procedure as the following method.

  Multiply(a, b):
  product <- 0
  shifted <- a    # copy of a we will shift
  for i = 1 to size b do
    if b[i] = 1 do
      product <- Add(product, shifted)
    endif
    shifted << 1
  endfor

Exercise. Pick a few of your favorite numbers, express them in binary, then use the procedure above to multiply them by hand.

Exercise. Suppose $a$ and $b$ are both represented with $n$ bits. What is the running time of $\mathrm{Multiply}(a, b)$ as a function of $n$? (Be sure to account for the running time of the call to $\Add$.) If $a$ and $b$ satisfy $a, b \leq N$, what is the running time of $\mathrm{Multiply}(a, b)$ as a function of $N$?

Proof of Comparison Procedure Correctness

The correctness of the comparison technique above can be established as follows. Suppose $a$ and $b$ are both $n$ bit numbers, and $k$ is the most significant bit on which their binary representations differ. That is,

\[\begin{align*} a &= a_n a_{n-1} \cdots a_0 = \sum_{i = 0}^n a_i 2^i\\ b &= b_n b_{n-1} \cdots b_0 = \sum_{i = 0}^n b_i 2^i, \end{align*}\]

and $a_k \neq b_k$, while $a_i = b_i$ for $i > k$. Suppose $a_k = 1$ and $b_k = 0$. We can write the difference $a - b$ as

\[\begin{align*} a - b &= \sum_{i = 0}^n a_i 2^i - \sum_{i = 0}^n b_i 2^i\\ &= \sum_{i = 0}^n (a_i - b_i) 2^i\\ &= 2^k - \sum_{i = 0}^{k-1} (b_i - a_i) 2^i. \end{align*}\]

The final expression follows because $a_i - b_i = 0$ for $i > k$ and our assumption that $a_k = 1$ and $b_k = 0$. Note that $a > b$ is equivalent to showing that $a - b > 0$. We will argue that

\[2^k - \sum_{i = 0}^{k-1} (b_i - a_i) 2^i > 0,\]

or equivalently, that

\[\sum_{i = 0}^{k-1} (b_i - a_i) 2^i < 2^k.\]

To this end, note that all $a_i$ and $b_i$ are $0$ or $1$. Therefore, we have $b_i - a_i \leq 1$ for all $i$. Thus we have

\[\sum_{i = 0}^{k-1} (b_i - a_i) 2^i \leq \sum_{i = 0}^{k-1} (1) 2^i = 1 + 2 + \cdots + 2^{k-1}.\]

As you showed on Homework 1, this final expression is $2^{k} - 1$, which is $< 2^k$. Therefore, $a - b > 0$, which is what we wanted to show.