Lecture 21: Hashing Wrap-up and Huffman Codes

Overview

  1. Hashing Rehash
  2. Remarks on Assignment 08
  3. Data Compression

Last Time

Hash tables with chaining!

  • Goal: implement unordered set ADT–find, add, remove

  • Assumption: given a hash function $h$

    • $h$ takes an object instance as its argument
    • $h$ returns a hash code associated with instance from a specified range

Hash Table with Chaining

Have array arr of lists, element x is stored in list arr[h(x)]

Example

  • red -> 2
  • orange -> 0
  • yellow -> 4
  • green -> 1
  • blue -> 0
  • indigo -> 2
  • violet -> 0

Hashing Strings in Java

String s stores an array of chars

  • each char has associated ASCII value
  • interpret a char as a value from 0 to 255

Let n = s.length, interpret s as char array

Java computes

s.hashCode() = s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-2]*31 + s[n-1]

where result is computed as an int

Example

String s = "bake"

As a character array:

s = ['b', 'a', 'k', 'e']

  = [98,   97, 107, 101]

Computing hash code:

int h = 98 * 31^3 + 97 * 31^2 + 107 * 31 + 101

      = 3016153

You can confirm this with s.hashCode() in Java!

From hashCode to Index

The hashCode() method returns an arbitrary int value

  • assume x.hashCode() == y.hashCode() is “unlikely” unless x.equals(y)

For hash tables, we need an index in a specified range 0, 1,...,n-1!

  • cannot do anything about collisions x.hashCode() == y.hashCode()
  • want to ensure that if x.hashCode() != y.hashCode() then index of x is “unlikely” to equal index of y

Implementation: define method int getIndex(E x)

  • method uses x.hashCode()
  • returns a value in range 0, 1,..., arr.length - 1

A Problematic Solution

Question. If n = arr.length, what was the issue with using i = x.hashCode() % n?

s.hashCode() = s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-2]*31 + s[n-1]

Better Hashing

Goal. Given an arbitrary int h = x.hashCode(), compute an index i from h such that:

  1. i is in range 0, 1,...,n-1
  2. if h != k, then associated indices i and j are unlikely to be same
    • how unlikely?

Get Index Method

  1. Choose an odd random number r
    • fixed at beginning of execution (when we initialize hash table)
  2. Maintain that array capacity is always a power of 2
    • n = 2^k for some value of k
    • indices encoded with k bits
  3. To generate index i from hash code h, take i to be the k most significant bits of h * r
    00000000000000000000100010110000   (h)

  * 10111000110110000111000111010101   (r)
  
  = ???????????????????????????10000

Implementation in Java

    protected int getIndex(E x) {
	// r is a fixed odd number that was randomly 
	// chosen when table was constructed
	// k is the log of the capacity (i.e., capacity = 2^k)
	
	//get the first logCapacity bits of z * x.hashCode()
	return ((r * x.hashCode()) >>> (32 - k));
    }

The operator >>> is the unsigned bit shift operator

Guarantee

    protected int getIndex(E x) {
	// r is a fixed odd number that was randomly 
	// chosen when table was constructed
	// k is the log of the capacity (i.e., capacity = 2^k)
	
	//get the first logCapacity bits of z * x.hashCode()
	return ((r * x.hashCode()) >>> (32 - k));
    }

Assume

  • r is a randomly chosen odd number
  • x and y are objects with different hash codes
  • n = 2^k is the size of arr

Then the probability (over choice of r) that getIndex(x) == getIndex(y) is at most 1/n

Informally: collisions are as unlikely as possible!

Occupancy?

Suppose we have a hash table with arr of size n storing at most n elements.

Question. If we try to add(x), where x is not stored in the hash table, what is the “expected” (average) occupancy of index getIndex(x)?

  • assume x.hashCode() is distinct from other hash codes

Running Time?

If arr stores has size n and stores at most n elements, then the expected (average) occupancy of getIndex(x) is $1$!

Question. What does this mean in terms of the running time of add(x), find(x), and remove(x)?

Randomized Guarantee

If we:

  1. choose value r randomly
  2. use getIndex method as defined previously
  3. maintain that size <= arr.length

Then:

On average (over random choice of r), unordered set operations are supported in $O(1)$ time! (assuming distinct hashCode values)

Homework 08

  1. HashUSet implements SimpleUSet using a hash table
    • getIndex method provided
    • Node<E> class provided
  2. HashSimpleMap implements SimpleMap using HashUSet
    • SimpleMap stores key-value pairs
    • each key has unique associated value
    • examples:
      • dictionary: keys = words, values = definitions
      • phone book: key = person (name), value = phone number
    • use a HashUSet that stores Pair<K, V>
    • subtlety: must ensure that no two Pairs with same key are stored in the hash table

Changing Gears

Data Compression

Main Question. How much space (i.e., memory) is required to store a given set of data?

For your consideration: shakespeare.txt

  • contains the complete works of William Shakespeare
  • basic stats:
    • file size: 5.8 MB
    • 170,592 lines
    • 961,443 words
    • 5,756,698 characters

Question

How is shakespeare.txt stored/represented in computer memory?

  • basic stats:
    • file size: 5.8 MB (5,765,698 Bytes)
    • 170,592 lines
    • 961,443 words
    • 5,756,698 characters

ASCII Encoding

Each char represented with 8 bits = 1 Byte:

'A' = 65 = 01000001
'B' = 66 = 01000010
'C' = 67 = 01000011
'D' = 66 = 01000100
...

256 total possible char values

  • shakespeare.txt contains 5.8 M characters, and the file size is 5.8 MB.

Question

Do we need 5.8 MB to store Shakespeare?

  • examine histogram of characters!

Data Compression Attempt 1

  • ASCII encoding allows for 256 possible characters
    • 8 bits per character
  • Shakespeare only uses 114 characters

Can we use this fact to compress shakespeare.txt into a smaller file?

Idea

Re-encode characters actually used by Shakespeare

  • table of (ASCII) characters used and new encoding
  • store actual characters with new encoding

Question 1

How much space (number of bits) does the new encoding require?

Question 2

How would we decode the newly encoded file?

Digging Deeper

Consider character frequencies of Shakespeare

Observation

8 characters account for more than half of the characters of shakespeare.txt!

char  ASCII count
' '   32  1055175
'e'   101  445988
't'   116  315647
'o'   111  305115
'a'   97   265561
'h'   104  238932
's'   115  236293
'n'   110  235774
-----------------
total     3098485 (= 54% of chars)

Question

How could we exploit frequency counts to further compress shakespeare.txt?

Data Compression Attempt 2

Have two tables: one for frequent characters, another for infrequent characters

  • only 8 frequent chars, so need only 3 bits to encode each
  • still < 128 infrequent chars, so use 7 bits to encode each

Question

If we have 2 char tables, how can we decode a string?

Frequent: 
' ' -> 000
'e' -> 001
...

Infrequent:
'r'  -> 0000000
'i'  -> 0000001
`\n' -> 0000010
...

How to decode 000000101001100...?

An Issue!

When decoding, how do we distinguish between frequent and infrequent character from the encoded character?

A Solution?

Use an extra bit to indicate if the following (encoded) character is frequent (3 bits) or infrequent (7 bits).

Frequent: 
' ' -> 0000
'e' -> 0001
...

Infrequent:
'r'  -> 10000000
'i'  -> 10000001
`\n' -> 10000010
...

Now frequent characters use 4 bits (always starting with 1), infrequent use 8 bits (always starting with 0)

Example

Decode the string 010010000000100000010100

' ' -> 0000
'e' -> 0001
't' -> 0010
'o' -> 0011
'a' -> 0100

'r' -> 10000000
'i' -> 10000001

Structure of Encoded Document

Start scanning bits from the first bit:

  • if first bit is 0, first four bits encode a frequent character
  • if first bit is 1, first eight bits encode an infrequent character
  • once a character is decoded, the next bit tells us if following character is frequent or infrequent

Picture:

Question

If we use frequent/infrequent character encoding, what is the resulting size of shakespeare.txt?

  • 5.8 M characters
  • 1/2 are frequent -> 4 bits each
  • < 1/2 are infrequent -> 8 bits each

Strategies So Far

  1. ASCII: every possible character gets 8 bits
    • 5.8 M characters => 5.8 MB
  2. Re-encode only used characters
    • all (used) characters encoded w/ 7 bits
    • resulting encoding uses 5.8 M * 7 / 8 = 5.1 MB
    • compression ratio $7 / 8 = 87.5\%$
  3. Re-encode frequent and infrequent characters separately
    • 8 frequent chars account for more than 1/2 of chars
    • encode remaining chars with 8 bits
    • Size: ~ 5.8 M (0.5 * 4 + 0.5 * 8) / 8 = 4.3 MB
    • compression ratio $3/4 = 75\%$

Can We Do Beter?

Why limit ourselves to just two types of characters (frequent/infrequent)?

General Situation:

  • every character gets assigned a codeword
    • codeword is a binary string (some number of 0s and 1s)
  • lengths of codewords may be different

Example:

' ' -> 00
'e' -> 010
't' -> 011
'o' -> 101
...
'X' -> 10011011

Two Questions

  1. What properties of codewords are required to enable us to decode an encoded text?

  2. What properties of codewords are desired to enable us to compress the original text?

Question 1

What properties of codewords are required to enable us to decode an encoded text?

Prefix Codes

Unique decodability:

  • When reading individual bits, must know when I’ve reached the end of a character

  • Cannot have: one codeword is 1001 and another codeword starts 1001...

  • We say 1001 is a prefix of 1001011

Definition. A set of codewords is a prefix code if no codeword is a prefix of any other.

Examples.

  1. ASCII
  2. $7$-bit Shakespeare encoding
  3. $4,8$-bit Shakespeare encoding

Prefix Codes and Trees

Any prefix code can be represented as a binary tree!

Start at root:

  1. label children of each node 0 and 1
  2. label leaves with characters
  3. codeword associated with a leaf is sequence of 0s and 1s along the path from root to the leaf

Example

Construct the binary tree for

'a' -> 00
'b' -> 01
'c' -> 101
'd' -> 111
'e' -> 1101
'f' -> 1100

Example

Use previous tree to decode 1100001011101111

Next Time

How could we generate the best possible prefix code to encode shakespeare.txt?

  • Want the smallest possible compressed file.