# Lecture 20: More Hashing

## Overview

1. Review of Last Time
2. Subtleties and the Art of Hashing
3. Implementations

## Last Time

Hash tables with chaining!

• Goal: implement unordered set ADT–find, add, remove

• Assumption: given a hash function $h$

• $h$ takes an object instance as its argument
• $h$ returns a hash code associated with instance from a specified range

## Hash Table with Chaining

Have array arr of lists, element x is stored in list arr[h(x)]

Example

• red -> 2
• orange -> 0
• yellow -> 4
• green -> 1
• blue -> 0
• indigo -> 2
• violet -> 0

## Also Last Time

Performance of operations depends on occupancy (size) of lists

• occupancy determined by hash values of elements
• a good hash function should “spread out” values among possible inputs

## The Art of Hashing

Choose a hash function $h$ whose output “looks random”

• distinct values $x, y$ are “unlikely” to have $h(x) = h(y)$

Hash function value determined by data stored in instance so not really random

## Hashing Strings in Java

String s stores an array of chars

• each char has associated ASCII value
• interpret a char as a value from 0 to 255

Let n = s.length, interpret s as char array

Java computes

s.hashCode() = s[0]*31^(n-1) + s[1]*31^(n-1) + ... + s[n-2]*31 + s[n-1]


where result is computed as an int

## Example

String s = "bake"

As a character array:

s = ['b', 'a', 'k', 'e']

= [98,   97, 107, 101]


Computing hash code:

int h = 98 * 31^3 + 97 * 31^2 + 107 * 31 + 101

= 3016153


You can confirm this with s.hashCode() in Java!

## Question

s.hashCode() returns an int

• value could be anything in the int range!

How could we get a value from a prescribed range?

• e.g., want a value from 0 to n-1

## Suggestion

If we want a value from 0 to n-1, use

• s.hashCode() % n

Is this good?

## Challenge

Using

• s.hashCode() = s[0]*31^(n-1) + s[1]*31^(n-1) + ... + s[n-2]*31 + s[n-1]
• i = s.hashCode() % n

find a value of n and many strings s that hash to same value of i

## Conclusion

Need to be careful with constructing hash functions!

Balance requirements:

• random looking
• minimize likelihood of “hidden” patterns
• consistency
• efficiency

## Hashing in Java

Object class has a method int hashCode()!

hashCode() requirements from Java API:

• if we have Object x and invoke x.hashCode() multiple times in the same execution, all calls must return the same value
• different values may result from different executions
• if x.equals(y), then we must have x.hashCode() == y.hashCode()
• not required that x.equals(y) == false implies x.hashCode() != y.hashCode)

As much as is reasonably practical, the hashCode method defined by class Object does return distinct integers for distinct objects.

When defining a new class, define a hashCode() method with properties specified in Java API

## Remaining Challenge

The hashCode() method returns an arbitrary int value

• assume x.hashCode() == y.hashCode() is “unlikely” unless x.equals(y)

For hash tables, we need a hash value in a specified range 0, 1,...,n-1!

• cannot do anything about collisions x.hashCode() == y.hashCode()
• want to ensure that if x.hashCode() != y.hashCode() then index of x is “unlikely” to equal index of y

Implementation: define method int getIndex(E x)

• method uses x.hashCode()
• returns a value in range 0, 1,..., arr.length - 1

## What Might Be Problematic?

int capacity; // size of array

int getIndex(E x) {
return x.hashCode() % capacity;
}


## Better Hashing

Goal. Given an arbitrary int h = x.hashCode(), compute an index i from h such that:

1. i is in range 0, 1,...,n-1
2. if h != k, then associated indices i and j are unlikely to be same
• how unlikely?

## Binary

In Java, ints are 32-bit values

• 0 = 00000000000000000000000000000000
• 1 = 00000000000000000000000000000001
• 2 = 00000000000000000000000000000010
• 3 = 00000000000000000000000000000011
• 4 = 00000000000000000000000000000100
• 5 = 00000000000000000000000000000101

Last (right-most) bit is least siginificant bit

First is “sign” bit

## Operations

Operations in binary are like grade-school arithmetic, but with only 1s and 0s:

    00000000000000000000000000000011

+  00000000000000000000000000000110

=

    00000000000000000000000000000011

* 00000000000000000000000000000110

=


## Observation (Vague)

If we start with a non-zero number h and multiply it by a random odd number r and multiply h and r, the higher order bits of h * r tend to be random-ish

    00000000000000000000100010110000   (h)

* 10111000110110000111000111010101   (r)

= ???????????????????????????10000


## Get Index Method

1. Choose an odd random number r
• fixed at beginning of execution (when we initialize hash table)
2. Maintain that array capacity is always a power of 2
• n = 2^k for some value of k
• indices encoded with k bits
3. To generate index i from hash code h, take i to be the k most significant bits of h * r
    00000000000000000000100010110000   (h)

* 10111000110110000111000111010101   (r)

= ???????????????????????????10000


## Implementation in Java

    protected int getIndex(E x) {
// r is a fixed odd number that was randomly
// chosen when table was constructed
// k is the log of the capacity (i.e., capacity = 2^k)

//get the first logCapacity bits of z * x.hashCode()
return ((r * x.hashCode()) >>> (32 - k));
}



The operator >>> is the unsigned bit shift operator

## Guarantee

    protected int getIndex(E x) {
// r is a fixed odd number that was randomly
// chosen when table was constructed
// k is the log of the capacity (i.e., capacity = 2^k)

//get the first logCapacity bits of z * x.hashCode()
return ((r * x.hashCode()) >>> (32 - k));
}



Assume

• r is a randomly chosen odd number
• x and y are objects with different hash codes
• n = 2^k is the size of arr

Then the probability (over choice of r) that getIndex(x) == getIndex(y) is at most 1/n

Informally: collisions are as unlikely as possible!

## Wild Idea

To get from x.hashCode() to index from 0,1,...,n-1 with n = 2^k

• pick a random function from 32 bt ints to k-bit values
• randomness from choice of r
• use the same function (value of r) for every call to getIndex

A Miracle:

• This method makes collisions unlikely, not matter what (distinct) hashCode() values are given!

## Next Assignments

Assignment 08.

• Complete an implementation of a hash table for SimpleUSet
• Use implementation to implement a “hash map”

Assignment 09. (Optional)

• Explore statistical properties of random assignments
• “Balls in bins”