Lecture 19: Hash Tables

Overview

1. Unordered Sets
2. Hash Tables
3. The Art of Hashing

From Last Time

All efficient ($O(\log n)$ time) data structures have relied on data types being Comparable

• location of element in data structure determined by relative order

Question. Can we achieve similar efficiency without comparability?

One Idea

Associate a numerical value to every possible element

• numbers are comparable, so just do comparison by number

Two Issues

1. How do we compute the numerical value consistently?
2. What do we do about collisions?

Hashing

Idea. Given an object instance obj, compute a numberical value from data stored in obj

• value is called a hash value or hash code

Application.

• use hash value of obj to determine where in data structure obj should be stored

Goals.

1. Different objects should be unlikely to have same hash value
2. Should be able to specify range of possible values
3. Semantically equivalent objects should have same hash value

Application: Hash Sets I

• add, find, remove methods

Assume. have access to a hash function $h$

• for any object $x$, $h(x)$ is the hash code of $x$
• the range of values for $h$ can be specified

Application: Hash Sets II

Idea. Store elements in an array

• choose range of hash values to be $0, 1, \ldots, n-1$
• $n$ is array size
• to add, find, remove $x$, look at index $h(x)$

Example: Hashing Colors

$n = 6$

• green
• orange
• yellow
• blue

Uh Oh!

What do we do about collisions???

Chaining

Idea. Each entry of the array refers to the head of a linked list

• linked list at arr[i] stores all elements $x$ with hash values $h(x) = i$

Hash Set with Chaining

• store an array arr of heads of linked lists—hash table
• assume hash function h has range 0, 1,..., n-1 w/ n = arr.length

How To add(x)?

• red -> 2
• orange -> 0
• yellow -> 4
• green -> 1
• blue -> 0
• indigo -> 2
• violet -> 0

How To find(x)?

• red -> 2
• orange -> 0
• yellow -> 4
• green -> 1
• blue -> 0
• indigo -> 2
• violet -> 0

How To remove(x)?

• red -> 2
• orange -> 0
• yellow -> 4
• green -> 1
• blue -> 0
• indigo -> 2
• violet -> 0

Running Time of operations?

Assume: computing $h(x)$ is $O(1)$*

* - this may not be justified!

What can Go Wrong?

Extreme example: h(x) = 0 always!

Too Many Elements!

Array size is fixed, but keep adding elements

• What is the running time?

Resize Challange

If we resize to larger array—say size $2 n$

• must update hash function $h$ to have range $0, 1, \ldots, 2n -1$

• this could change hash values of elements already in hash table

Picture So Far

1. A hash function $h$ takes an object instance $x$ and returns a value $h(x)$ in a specified range

2. Hash tables with chaining can support add/find/remove methods required by unordered set ADT (SimpleUSet)

3. Running time of, say, op(x) depends on occupancy of arr[h(x)]
• can be as large as $n$!
4. Efficiency depends on:
1. data entered into table
2. hash function used

Connections

Recall the (unbalanced) binary search tree:

• Running times of ops are $O(\mathrm{height})$
• Height can be as large as $n - 1$

Question 1. What was typical height for, say, Shakespeare?

Question 2. When did we expect typical height to be small?

Idea for Hashing

Use randomness (somehow)!

• red
• orange
• yellow
• green
• blue
• indigo
• violet

Is It So Simple?

Choose hash values randomly:

• just take $h(x)$ to be a random value in a given range

Will this work?

Need Consistency!

If we compute $h(x)$ repeatedly, must get the same value every time!

The Art of Hashing

Choose a hash function $h$ whose output “looks random”

• distinct values $x, y$ are “unlikely” to have $h(x) = h(y)$

Hash function value determined by data stored in instance so not really random

Hashing Strings

String s stores an array of chars

• each char has associated ASCII value
• interpret a char as a value from 0 to 255

Let n = s.length, interpret s as char array

Java computes

s.hashCode() = s[0]*31^(n-1) + s[1]*31^(n-1) + ... + s[n-2]*31 + s[n-1]


where result is computed as an int

Example

String s = "bake"

As a character array:

s = ['b', 'a', 'k', 'e']

= [98,   97, 107, 101]


Computing hash code:

int h = 98 * 31^3 + 97 * 31^2 + 107 * 31 + 101

= 3016153


You can confirm this with s.hashCode() in Java!

Question

s.hashCode() returns an int

• value could be anything in the int range!

How could we get a value from a prescribed range?

• e.g., want a value from 0 to n-1

Suggestion

If we want a value from 0 to n-1, use

• s.hashCode() % n

Is this good?

Challenge

Using

• s.hashCode() = s[0]*31^(n-1) + s[1]*31^(n-1) + ... + s[n-2]*31 + s[n-1]
• i = s.hashCode() % n

find a value of n and many strings s that hash to same value of i

Conclusion

Need to be careful with constructing hash functions!

Balance requirements:

• random looking
• minimize likelihood of “hidden” patterns
• consistency
• efficiency

More on hashing!