Lecture 20: More Hashing

Overview

Review of Last Time
Subtleties and the Art of Hashing
Implementations

Last Time

Hash tables with chaining!

Goal: implement unordered set ADT–find, add, remove
Assumption: given a hash function $h$
- $h$ takes an object instance as its argument
- $h$ returns a hash code associated with instance from a specified range

Hash Table with Chaining

Have array arr of lists, element x is stored in list arr[h(x)]

Example

red -> 2
orange -> 0
yellow -> 4
green -> 1
blue -> 0
indigo -> 2
violet -> 0

Also Last Time

Performance of operations depends on occupancy (size) of lists

occupancy determined by hash values of elements
a good hash function should “spread out” values among possible inputs

The Art of Hashing

Choose a hash function $h$ whose output “looks random”

distinct values $x, y$ are “unlikely” to have $h(x) = h(y)$

Hash function value determined by data stored in instance so not really random

Hashing Strings in Java

String s stores an array of chars

each char has associated ASCII value
interpret a char as a value from 0 to 255

Let n = s.length, interpret s as char array

Java computes

s.hashCode() = s[0]*31^(n-1) + s[1]*31^(n-1) + ... + s[n-2]*31 + s[n-1]

where result is computed as an int

Example

String s = "bake"

As a character array:

s = ['b', 'a', 'k', 'e']

  = [98,   97, 107, 101]

Computing hash code:

int h = 98 * 31^3 + 97 * 31^2 + 107 * 31 + 101

      = 3016153

You can confirm this with s.hashCode() in Java!

Question

s.hashCode() returns an int

value could be anything in the int range!

How could we get a value from a prescribed range?

e.g., want a value from 0 to n-1

Suggestion

If we want a value from 0 to n-1, use

s.hashCode() % n

Is this good?

Challenge

Using

s.hashCode() = s[0]*31^(n-1) + s[1]*31^(n-1) + ... + s[n-2]*31 + s[n-1]
i = s.hashCode() % n

find a value of n and many strings s that hash to same value of i

Bad Behavior

Conclusion

Need to be careful with constructing hash functions!

Balance requirements:

random looking
- minimize likelihood of “hidden” patterns
consistency
efficiency

Hashing in Java

Object class has a method int hashCode()!

hashCode() requirements from Java API:

if we have Object x and invoke x.hashCode() multiple times in the same execution, all calls must return the same value
- different values may result from different executions
if x.equals(y), then we must have x.hashCode() == y.hashCode()
not required that x.equals(y) == false implies x.hashCode() != y.hashCode)

As much as is reasonably practical, the hashCode method defined by class Object does return distinct integers for distinct objects.

Your Responsibility

When defining a new class, define a hashCode() method with properties specified in Java API

Remaining Challenge

The hashCode() method returns an arbitrary int value

assume x.hashCode() == y.hashCode() is “unlikely” unless x.equals(y)

For hash tables, we need a hash value in a specified range 0, 1,...,n-1!

cannot do anything about collisions x.hashCode() == y.hashCode()
want to ensure that if x.hashCode() != y.hashCode() then index of x is “unlikely” to equal index of y

Implementation: define method int getIndex(E x)

method uses x.hashCode()
returns a value in range 0, 1,..., arr.length - 1

What Might Be Problematic?

int capacity; // size of array

int getIndex(E x) {
    return x.hashCode() % capacity;
}

Better Hashing

Goal. Given an arbitrary int h = x.hashCode(), compute an index i from h such that:

i is in range 0, 1,...,n-1
if h != k, then associated indices i and j are unlikely to be same
- how unlikely?

Binary

In Java, ints are 32-bit values

0 = 00000000000000000000000000000000
1 = 00000000000000000000000000000001
2 = 00000000000000000000000000000010
3 = 00000000000000000000000000000011
4 = 00000000000000000000000000000100
5 = 00000000000000000000000000000101
…

Last (right-most) bit is least siginificant bit

First is “sign” bit

Operations

Operations in binary are like grade-school arithmetic, but with only 1s and 0s:

    00000000000000000000000000000011
	
 +  00000000000000000000000000000110
  
  =

    00000000000000000000000000000011

  * 00000000000000000000000000000110
  
  =

Observation (Vague)

If we start with a non-zero number h and multiply it by a random odd number r and multiply h and r, the higher order bits of h * r tend to be random-ish

    00000000000000000000100010110000   (h)

  * 10111000110110000111000111010101   (r)
  
  = ???????????????????????????10000

Get Index Method

Choose an odd random number r
- fixed at beginning of execution (when we initialize hash table)
Maintain that array capacity is always a power of 2
- n = 2^k for some value of k
- indices encoded with k bits
To generate index i from hash code h, take i to be the k most significant bits of h * r

    00000000000000000000100010110000   (h)

  * 10111000110110000111000111010101   (r)
  
  = ???????????????????????????10000

Implementation in Java

    protected int getIndex(E x) {
	// r is a fixed odd number that was randomly 
	// chosen when table was constructed
	// k is the log of the capacity (i.e., capacity = 2^k)
	
	//get the first logCapacity bits of z * x.hashCode()
	return ((r * x.hashCode()) >>> (32 - k));
    }

The operator >>> is the unsigned bit shift operator

Guarantee

    protected int getIndex(E x) {
	// r is a fixed odd number that was randomly 
	// chosen when table was constructed
	// k is the log of the capacity (i.e., capacity = 2^k)
	
	//get the first logCapacity bits of z * x.hashCode()
	return ((r * x.hashCode()) >>> (32 - k));
    }

Assume

r is a randomly chosen odd number
x and y are objects with different hash codes
n = 2^k is the size of arr

Then the probability (over choice of r) that getIndex(x) == getIndex(y) is at most 1/n

Informally: collisions are as unlikely as possible!

Wild Idea

To get from x.hashCode() to index from 0,1,...,n-1 with n = 2^k

pick a random function from 32 bt ints to k-bit values
- randomness from choice of r
use the same function (value of r) for every call to getIndex

A Miracle:

This method makes collisions unlikely, not matter what (distinct) hashCode() values are given!

Next Assignments

Assignment 08.

Complete an implementation of a hash table for SimpleUSet
Use implementation to implement a “hash map”

Assignment 09. (Optional)

Explore statistical properties of random assignments
“Balls in bins”