Lecture 20: More Hashing

Overview

  1. Review of Last Time
  2. Subtleties and the Art of Hashing
  3. Implementations

Last Time

Hash tables with chaining!

  • Goal: implement unordered set ADT–find, add, remove

  • Assumption: given a hash function $h$

    • $h$ takes an object instance as its argument
    • $h$ returns a hash code associated with instance from a specified range

Hash Table with Chaining

Have array arr of lists, element x is stored in list arr[h(x)]

Example

  • red -> 2
  • orange -> 0
  • yellow -> 4
  • green -> 1
  • blue -> 0
  • indigo -> 2
  • violet -> 0

Also Last Time

Performance of operations depends on occupancy (size) of lists

  • occupancy determined by hash values of elements
  • a good hash function should “spread out” values among possible inputs

The Art of Hashing

Choose a hash function $h$ whose output “looks random”

  • distinct values $x, y$ are “unlikely” to have $h(x) = h(y)$

Hash function value determined by data stored in instance so not really random

Hashing Strings in Java

String s stores an array of chars

  • each char has associated ASCII value
  • interpret a char as a value from 0 to 255

Let n = s.length, interpret s as char array

Java computes

s.hashCode() = s[0]*31^(n-1) + s[1]*31^(n-1) + ... + s[n-2]*31 + s[n-1]

where result is computed as an int

Example

String s = "bake"

As a character array:

s = ['b', 'a', 'k', 'e']

  = [98,   97, 107, 101]

Computing hash code:

int h = 98 * 31^3 + 97 * 31^2 + 107 * 31 + 101

      = 3016153

You can confirm this with s.hashCode() in Java!

Question

s.hashCode() returns an int

  • value could be anything in the int range!

How could we get a value from a prescribed range?

  • e.g., want a value from 0 to n-1

Suggestion

If we want a value from 0 to n-1, use

  • s.hashCode() % n

Is this good?

Challenge

Using

  • s.hashCode() = s[0]*31^(n-1) + s[1]*31^(n-1) + ... + s[n-2]*31 + s[n-1]
  • i = s.hashCode() % n

find a value of n and many strings s that hash to same value of i

Bad Behavior

Conclusion

Need to be careful with constructing hash functions!

Balance requirements:

  • random looking
    • minimize likelihood of “hidden” patterns
  • consistency
  • efficiency

Hashing in Java

Object class has a method int hashCode()!

hashCode() requirements from Java API:

  • if we have Object x and invoke x.hashCode() multiple times in the same execution, all calls must return the same value
    • different values may result from different executions
  • if x.equals(y), then we must have x.hashCode() == y.hashCode()
  • not required that x.equals(y) == false implies x.hashCode() != y.hashCode)

As much as is reasonably practical, the hashCode method defined by class Object does return distinct integers for distinct objects.

Your Responsibility

When defining a new class, define a hashCode() method with properties specified in Java API

Remaining Challenge

The hashCode() method returns an arbitrary int value

  • assume x.hashCode() == y.hashCode() is “unlikely” unless x.equals(y)

For hash tables, we need a hash value in a specified range 0, 1,...,n-1!

  • cannot do anything about collisions x.hashCode() == y.hashCode()
  • want to ensure that if x.hashCode() != y.hashCode() then index of x is “unlikely” to equal index of y

Implementation: define method int getIndex(E x)

  • method uses x.hashCode()
  • returns a value in range 0, 1,..., arr.length - 1

What Might Be Problematic?

int capacity; // size of array

int getIndex(E x) {
    return x.hashCode() % capacity;
}

Better Hashing

Goal. Given an arbitrary int h = x.hashCode(), compute an index i from h such that:

  1. i is in range 0, 1,...,n-1
  2. if h != k, then associated indices i and j are unlikely to be same
    • how unlikely?

Binary

In Java, ints are 32-bit values

  • 0 = 00000000000000000000000000000000
  • 1 = 00000000000000000000000000000001
  • 2 = 00000000000000000000000000000010
  • 3 = 00000000000000000000000000000011
  • 4 = 00000000000000000000000000000100
  • 5 = 00000000000000000000000000000101

Last (right-most) bit is least siginificant bit

First is “sign” bit

Operations

Operations in binary are like grade-school arithmetic, but with only 1s and 0s:

    00000000000000000000000000000011
	
 +  00000000000000000000000000000110
  
  =
    00000000000000000000000000000011

  * 00000000000000000000000000000110
  
  =

Observation (Vague)

If we start with a non-zero number h and multiply it by a random odd number r and multiply h and r, the higher order bits of h * r tend to be random-ish

    00000000000000000000100010110000   (h)

  * 10111000110110000111000111010101   (r)
  
  = ???????????????????????????10000

Get Index Method

  1. Choose an odd random number r
    • fixed at beginning of execution (when we initialize hash table)
  2. Maintain that array capacity is always a power of 2
    • n = 2^k for some value of k
    • indices encoded with k bits
  3. To generate index i from hash code h, take i to be the k most significant bits of h * r
    00000000000000000000100010110000   (h)

  * 10111000110110000111000111010101   (r)
  
  = ???????????????????????????10000

Implementation in Java

    protected int getIndex(E x) {
	// r is a fixed odd number that was randomly 
	// chosen when table was constructed
	// k is the log of the capacity (i.e., capacity = 2^k)
	
	//get the first logCapacity bits of z * x.hashCode()
	return ((r * x.hashCode()) >>> (32 - k));
    }

The operator >>> is the unsigned bit shift operator

Guarantee

    protected int getIndex(E x) {
	// r is a fixed odd number that was randomly 
	// chosen when table was constructed
	// k is the log of the capacity (i.e., capacity = 2^k)
	
	//get the first logCapacity bits of z * x.hashCode()
	return ((r * x.hashCode()) >>> (32 - k));
    }

Assume

  • r is a randomly chosen odd number
  • x and y are objects with different hash codes
  • n = 2^k is the size of arr

Then the probability (over choice of r) that getIndex(x) == getIndex(y) is at most 1/n

Informally: collisions are as unlikely as possible!

Wild Idea

To get from x.hashCode() to index from 0,1,...,n-1 with n = 2^k

  • pick a random function from 32 bt ints to k-bit values
    • randomness from choice of r
  • use the same function (value of r) for every call to getIndex

A Miracle:

  • This method makes collisions unlikely, not matter what (distinct) hashCode() values are given!

Next Assignments

Assignment 08.

  • Complete an implementation of a hash table for SimpleUSet
  • Use implementation to implement a “hash map”

Assignment 09. (Optional)

  • Explore statistical properties of random assignments
  • “Balls in bins”