Hash tables with chaining!
Goal: implement unordered set ADT–find, add, remove
Assumption: given a hash function $h$
Have array arr of lists, element x is stored in list arr[h(x)]
Example
red -> 2orange -> 0yellow -> 4green -> 1blue -> 0indigo -> 2violet -> 0String s stores an array of chars
char has associated ASCII valuechar as a value from 0 to 255
Let n = s.length, interpret s as char array
Java computes
s.hashCode() = s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-2]*31 + s[n-1]
where result is computed as an int
String s = "bake"
As a character array:
s = ['b', 'a', 'k', 'e']
= [98, 97, 107, 101]
Computing hash code:
int h = 98 * 31^3 + 97 * 31^2 + 107 * 31 + 101
= 3016153
You can confirm this with s.hashCode() in Java!
hashCode to IndexThe hashCode() method returns an arbitrary int value
x.hashCode() == y.hashCode() is “unlikely” unless x.equals(y)
For hash tables, we need an index in a specified range 0, 1,...,n-1!
x.hashCode() == y.hashCode()
x.hashCode() != y.hashCode() then index of x is “unlikely” to equal index of y
Implementation: define method int getIndex(E x)
x.hashCode()
0, 1,..., arr.length - 1
Question. If n = arr.length, what was the issue with using i = x.hashCode() % n?
s.hashCode() = s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-2]*31 + s[n-1]
Goal. Given an arbitrary int h = x.hashCode(), compute an index i from h such that:
i is in range 0, 1,...,n-1
h != k, then associated indices i and j are unlikely to be same
r
n = 2^k for some value of k
k bitsi from hash code h, take i to be the k most significant bits of h * r
00000000000000000000100010110000 (h)
* 10111000110110000111000111010101 (r)
= ???????????????????????????10000
protected int getIndex(E x) {
// r is a fixed odd number that was randomly
// chosen when table was constructed
// k is the log of the capacity (i.e., capacity = 2^k)
//get the first logCapacity bits of z * x.hashCode()
return ((r * x.hashCode()) >>> (32 - k));
}
The operator >>> is the unsigned bit shift operator
protected int getIndex(E x) {
// r is a fixed odd number that was randomly
// chosen when table was constructed
// k is the log of the capacity (i.e., capacity = 2^k)
//get the first logCapacity bits of z * x.hashCode()
return ((r * x.hashCode()) >>> (32 - k));
}
Assume
r is a randomly chosen odd numberx and y are objects with different hash codesn = 2^k is the size of arr
Then the probability (over choice of r) that getIndex(x) == getIndex(y) is at most 1/n
Informally: collisions are as unlikely as possible!
Suppose we have a hash table with arr of size n storing at most n elements.
Question. If we try to add(x), where x is not stored in the hash table, what is the “expected” (average) occupancy of index getIndex(x)?
x.hashCode() is distinct from other hash codesIf arr stores has size n and stores at most n elements, then the expected (average) occupancy of getIndex(x) is $1$!
Question. What does this mean in terms of the running time of add(x), find(x), and remove(x)?
If we:
r randomlygetIndex method as defined previouslysize <= arr.length
Then:
On average (over random choice of r), unordered set operations are supported in $O(1)$ time! (assuming distinct hashCode values)
HashUSet implements SimpleUSet using a hash table
getIndex method providedNode<E> class providedHashSimpleMap implements SimpleMap using HashUSet
SimpleMap stores key-value pairsHashUSet that stores Pair<K, V>
Pairs with same key are stored in the hash tableMain Question. How much space (i.e., memory) is required to store a given set of data?
For your consideration: shakespeare.txt
How is shakespeare.txt stored/represented in computer memory?
Each char represented with 8 bits = 1 Byte:
'A' = 65 = 01000001
'B' = 66 = 01000010
'C' = 67 = 01000011
'D' = 66 = 01000100
...
256 total possible char values
shakespeare.txt contains 5.8 M characters, and the file size is 5.8 MB.Do we need 5.8 MB to store Shakespeare?
Can we use this fact to compress shakespeare.txt into a smaller file?
Re-encode characters actually used by Shakespeare
How much space (number of bits) does the new encoding require?
How would we decode the newly encoded file?
Consider character frequencies of Shakespeare
8 characters account for more than half of the characters of shakespeare.txt!
char ASCII count
' ' 32 1055175
'e' 101 445988
't' 116 315647
'o' 111 305115
'a' 97 265561
'h' 104 238932
's' 115 236293
'n' 110 235774
-----------------
total 3098485 (= 54% of chars)
How could we exploit frequency counts to further compress shakespeare.txt?
Have two tables: one for frequent characters, another for infrequent characters
If we have 2 char tables, how can we decode a string?
Frequent:
' ' -> 000
'e' -> 001
...
Infrequent:
'r' -> 0000000
'i' -> 0000001
`\n' -> 0000010
...
How to decode 000000101001100...?
When decoding, how do we distinguish between frequent and infrequent character from the encoded character?
Use an extra bit to indicate if the following (encoded) character is frequent (3 bits) or infrequent (7 bits).
Frequent:
' ' -> 0000
'e' -> 0001
...
Infrequent:
'r' -> 10000000
'i' -> 10000001
`\n' -> 10000010
...
Now frequent characters use 4 bits (always starting with 1), infrequent use 8 bits (always starting with 0)
Decode the string 010010000000100000010100
' ' -> 0000
'e' -> 0001
't' -> 0010
'o' -> 0011
'a' -> 0100
'r' -> 10000000
'i' -> 10000001
Start scanning bits from the first bit:
0, first four bits encode a frequent character1, first eight bits encode an infrequent characterPicture:
If we use frequent/infrequent character encoding, what is the resulting size of shakespeare.txt?
5.8 MB
5.8 M * 7 / 8 = 5.1 MB
~ 5.8 M (0.5 * 4 + 0.5 * 8) / 8 = 4.3 MB
Why limit ourselves to just two types of characters (frequent/infrequent)?
General Situation:
Example:
' ' -> 00
'e' -> 010
't' -> 011
'o' -> 101
...
'X' -> 10011011
What properties of codewords are required to enable us to decode an encoded text?
What properties of codewords are desired to enable us to compress the original text?
What properties of codewords are required to enable us to decode an encoded text?
Unique decodability:
When reading individual bits, must know when I’ve reached the end of a character
Cannot have: one codeword is 1001 and another codeword starts 1001...
We say 1001 is a prefix of 1001011
Definition. A set of codewords is a prefix code if no codeword is a prefix of any other.
Examples.
Any prefix code can be represented as a binary tree!
Start at root:
0 and 1
0s and 1s along the path from root to the leafConstruct the binary tree for
'a' -> 00
'b' -> 01
'c' -> 101
'd' -> 111
'e' -> 1101
'f' -> 1100
Use previous tree to decode 1100001011101111
How could we generate the best possible prefix code to encode shakespeare.txt?