Hash tables with chaining!
Goal: implement unordered set ADT–find
, add
, remove
Assumption: given a hash function $h$
Have array arr
of lists, element x
is stored in list arr[h(x)]
Example
red -> 2
orange -> 0
yellow -> 4
green -> 1
blue -> 0
indigo -> 2
violet -> 0
String s
stores an array of char
s
char
has associated ASCII valuechar
as a value from 0
to 255
Let n = s.length
, interpret s
as char
array
Java computes
s.hashCode() = s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-2]*31 + s[n-1]
where result is computed as an int
String s = "bake"
As a character array:
s = ['b', 'a', 'k', 'e']
= [98, 97, 107, 101]
Computing hash code:
int h = 98 * 31^3 + 97 * 31^2 + 107 * 31 + 101
= 3016153
You can confirm this with s.hashCode()
in Java!
hashCode
to IndexThe hashCode()
method returns an arbitrary int
value
x.hashCode() == y.hashCode()
is “unlikely” unless x.equals(y)
For hash tables, we need an index in a specified range 0, 1,...,n-1
!
x.hashCode() == y.hashCode()
x.hashCode() != y.hashCode()
then index of x
is “unlikely” to equal index of y
Implementation: define method int getIndex(E x)
x.hashCode()
0, 1,..., arr.length - 1
Question. If n = arr.length
, what was the issue with using i = x.hashCode() % n
?
s.hashCode() = s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-2]*31 + s[n-1]
Goal. Given an arbitrary int h = x.hashCode()
, compute an index i
from h
such that:
i
is in range 0, 1,...,n-1
h != k
, then associated indices i
and j
are unlikely to be same
r
n = 2^k
for some value of k
k
bitsi
from hash code h
, take i
to be the k
most significant bits of h * r
00000000000000000000100010110000 (h)
* 10111000110110000111000111010101 (r)
= ???????????????????????????10000
protected int getIndex(E x) {
// r is a fixed odd number that was randomly
// chosen when table was constructed
// k is the log of the capacity (i.e., capacity = 2^k)
//get the first logCapacity bits of z * x.hashCode()
return ((r * x.hashCode()) >>> (32 - k));
}
The operator >>>
is the unsigned bit shift operator
protected int getIndex(E x) {
// r is a fixed odd number that was randomly
// chosen when table was constructed
// k is the log of the capacity (i.e., capacity = 2^k)
//get the first logCapacity bits of z * x.hashCode()
return ((r * x.hashCode()) >>> (32 - k));
}
Assume
r
is a randomly chosen odd numberx
and y
are objects with different hash codesn = 2^k
is the size of arr
Then the probability (over choice of r
) that getIndex(x) == getIndex(y)
is at most 1/n
Informally: collisions are as unlikely as possible!
Suppose we have a hash table with arr
of size n
storing at most n
elements.
Question. If we try to add(x)
, where x
is not stored in the hash table, what is the “expected” (average) occupancy of index getIndex(x)
?
x.hashCode()
is distinct from other hash codesIf arr
stores has size n
and stores at most n
elements, then the expected (average) occupancy of getIndex(x)
is $1$!
Question. What does this mean in terms of the running time of add(x)
, find(x)
, and remove(x)
?
If we:
r
randomlygetIndex
method as defined previouslysize <= arr.length
Then:
On average (over random choice of r
), unordered set operations are supported in $O(1)$ time! (assuming distinct hashCode
values)
HashUSet
implements SimpleUSet
using a hash table
getIndex
method providedNode<E>
class providedHashSimpleMap
implements SimpleMap
using HashUSet
SimpleMap
stores key-value pairsHashUSet
that stores Pair<K, V>
Pair
s with same key are stored in the hash tableMain Question. How much space (i.e., memory) is required to store a given set of data?
For your consideration: shakespeare.txt
How is shakespeare.txt
stored/represented in computer memory?
Each char
represented with 8 bits = 1 Byte:
'A' = 65 = 01000001
'B' = 66 = 01000010
'C' = 67 = 01000011
'D' = 66 = 01000100
...
256 total possible char
values
shakespeare.txt
contains 5.8 M characters, and the file size is 5.8 MB.Do we need 5.8 MB to store Shakespeare?
Can we use this fact to compress shakespeare.txt
into a smaller file?
Re-encode characters actually used by Shakespeare
How much space (number of bits) does the new encoding require?
How would we decode the newly encoded file?
Consider character frequencies of Shakespeare
8 characters account for more than half of the characters of shakespeare.txt
!
char ASCII count
' ' 32 1055175
'e' 101 445988
't' 116 315647
'o' 111 305115
'a' 97 265561
'h' 104 238932
's' 115 236293
'n' 110 235774
-----------------
total 3098485 (= 54% of chars)
How could we exploit frequency counts to further compress shakespeare.txt
?
Have two tables: one for frequent characters, another for infrequent characters
If we have 2 char tables, how can we decode a string?
Frequent:
' ' -> 000
'e' -> 001
...
Infrequent:
'r' -> 0000000
'i' -> 0000001
`\n' -> 0000010
...
How to decode 000000101001100...
?
When decoding, how do we distinguish between frequent and infrequent character from the encoded character?
Use an extra bit to indicate if the following (encoded) character is frequent (3 bits) or infrequent (7 bits).
Frequent:
' ' -> 0000
'e' -> 0001
...
Infrequent:
'r' -> 10000000
'i' -> 10000001
`\n' -> 10000010
...
Now frequent characters use 4 bits (always starting with 1
), infrequent use 8 bits (always starting with 0
)
Decode the string 010010000000100000010100
' ' -> 0000
'e' -> 0001
't' -> 0010
'o' -> 0011
'a' -> 0100
'r' -> 10000000
'i' -> 10000001
Start scanning bits from the first bit:
0
, first four bits encode a frequent character1
, first eight bits encode an infrequent characterPicture:
If we use frequent/infrequent character encoding, what is the resulting size of shakespeare.txt
?
5.8 MB
5.8 M * 7 / 8 = 5.1 MB
~ 5.8 M (0.5 * 4 + 0.5 * 8) / 8 = 4.3 MB
Why limit ourselves to just two types of characters (frequent/infrequent)?
General Situation:
Example:
' ' -> 00
'e' -> 010
't' -> 011
'o' -> 101
...
'X' -> 10011011
What properties of codewords are required to enable us to decode an encoded text?
What properties of codewords are desired to enable us to compress the original text?
What properties of codewords are required to enable us to decode an encoded text?
Unique decodability:
When reading individual bits, must know when I’ve reached the end of a character
Cannot have: one codeword is 1001
and another codeword starts 1001...
We say 1001
is a prefix of 1001011
Definition. A set of codewords is a prefix code if no codeword is a prefix of any other.
Examples.
Any prefix code can be represented as a binary tree!
Start at root:
0
and 1
0
s and 1
s along the path from root to the leafConstruct the binary tree for
'a' -> 00
'b' -> 01
'c' -> 101
'd' -> 111
'e' -> 1101
'f' -> 1100
Use previous tree to decode 1100001011101111
How could we generate the best possible prefix code to encode shakespeare.txt
?