Lecture 19: Hash Tables

Overview

Unordered Sets
Hash Tables
The Art of Hashing

From Last Time

All efficient ($O(\log n)$ time) data structures have relied on data types being Comparable

location of element in data structure determined by relative order

Question. Can we achieve similar efficiency without comparability?

One Idea

Associate a numerical value to every possible element

numbers are comparable, so just do comparison by number

Two Issues

How do we compute the numerical value consistently?
What do we do about collisions?

Hashing

Idea. Given an object instance obj, compute a numberical value from data stored in obj

value is called a hash value or hash code

Application.

use hash value of obj to determine where in data structure obj should be stored

Goals.

Different objects should be unlikely to have same hash value
Should be able to specify range of possible values
Semantically equivalent objects should have same hash value

Application: Hash Sets I

Goal. Implement unordered set ADT

add, find, remove methods

Assume. have access to a hash function $h$

for any object $x$, $h(x)$ is the hash code of $x$
the range of values for $h$ can be specified

Application: Hash Sets II

Idea. Store elements in an array

choose range of hash values to be $0, 1, \ldots, n-1$
- $n$ is array size
to add, find, remove $x$, look at index $h(x)$

Example: Hashing Colors

$n = 6$

green
orange
yellow
blue

Uh Oh!

What do we do about collisions???

Chaining

Idea. Each entry of the array refers to the head of a linked list

linked list at arr[i] stores all elements $x$ with hash values $h(x) = i$

Hash Set with Chaining

store an array arr of heads of linked lists—hash table
assume hash function h has range 0, 1,..., n-1 w/ n = arr.length

How To `add(x)`?

red -> 2
orange -> 0
yellow -> 4
green -> 1
blue -> 0
indigo -> 2
violet -> 0

How To `find(x)`?

red -> 2
orange -> 0
yellow -> 4
green -> 1
blue -> 0
indigo -> 2
violet -> 0

How To `remove(x)`?

red -> 2
orange -> 0
yellow -> 4
green -> 1
blue -> 0
indigo -> 2
violet -> 0

Running Time of operations?

Assume: computing $h(x)$ is $O(1)$^*

* - this may not be justified!

What can Go Wrong?

Bad Hash Functions

Extreme example: h(x) = 0 always!

Too Many Elements!

Array size is fixed, but keep adding elements

What is the running time?

How to Resize?

Resize Challange

If we resize to larger array—say size $2 n$

must update hash function $h$ to have range $0, 1, \ldots, 2n -1$
this could change hash values of elements already in hash table

Resize Method

Picture So Far

A hash function $h$ takes an object instance $x$ and returns a value $h(x)$ in a specified range
Hash tables with chaining can support add/find/remove methods required by unordered set ADT (SimpleUSet)
Running time of, say, op(x) depends on occupancy of arr[h(x)]
- can be as large as $n$!
Efficiency depends on:
1. data entered into table
2. hash function used

Connections

Recall the (unbalanced) binary search tree:

Running times of ops are $O(\mathrm{height})$
Height can be as large as $n - 1$

Question 1. What was typical height for, say, Shakespeare?

Question 2. When did we expect typical height to be small?

Idea for Hashing

Use randomness (somehow)!

red
orange
yellow
green
blue
indigo
violet

Is It So Simple?

Choose hash values randomly:

just take $h(x)$ to be a random value in a given range

Will this work?

Need Consistency!

If we compute $h(x)$ repeatedly, must get the same value every time!

The Art of Hashing

Choose a hash function $h$ whose output “looks random”

distinct values $x, y$ are “unlikely” to have $h(x) = h(y)$

Hash function value determined by data stored in instance so not really random

Hashing Strings

String s stores an array of chars

each char has associated ASCII value
interpret a char as a value from 0 to 255

Let n = s.length, interpret s as char array

Java computes

s.hashCode() = s[0]*31^(n-1) + s[1]*31^(n-1) + ... + s[n-2]*31 + s[n-1]

where result is computed as an int

Example

String s = "bake"

As a character array:

s = ['b', 'a', 'k', 'e']

  = [98,   97, 107, 101]

Computing hash code:

int h = 98 * 31^3 + 97 * 31^2 + 107 * 31 + 101

      = 3016153

You can confirm this with s.hashCode() in Java!

Question

s.hashCode() returns an int

value could be anything in the int range!

How could we get a value from a prescribed range?

e.g., want a value from 0 to n-1

Suggestion

If we want a value from 0 to n-1, use

s.hashCode() % n

Is this good?

Challenge

Using

s.hashCode() = s[0]*31^(n-1) + s[1]*31^(n-1) + ... + s[n-2]*31 + s[n-1]
i = s.hashCode() % n

find a value of n and many strings s that hash to same value of i

Bad Behavior

Conclusion

Need to be careful with constructing hash functions!

Balance requirements:

random looking
- minimize likelihood of “hidden” patterns
consistency
efficiency