Imagine you have a collection of \(n\) bins (or buckets), labelled \(0\) through \(n - 1\), and consider the following process: pick a bin uniformly at random (i.e., choosing each bin \(b_i\) with probability \(1 / n\)), place a ball into the bin, and repeat. After the first iteration, there is one ball in one of the bins, while the others are empty. After the second iteration, there are two balls in the bins—possibly in two separate bins, or perhaps in the same bin the same bin may be repeated chosen in multiple iterations.
After repeating the procedure above \(m\) times, i.e. after throwing \(m\) balls indpendently at random into the bins, we might want to know about the distribution of balls in bins. Remarkably, the analysis of many randomized procedures boils down to understanding various statistics of the scenario above. In this lab, you will simulate the balls-into-bins process and investigate patterns that emerge.
Here, we describe the balls-int-bins random process a bit more formally. Each trial consists of throwing a single ball into a bin, and the outcome is the index of the chosen bin—i.e., a number from \(0\) to \(n - 1\). In each trial, the outcome should be chosen uniformly at random meaning that all \(n\) possible outcomes are equally likely, each occuring with probability \(1 / n\). The trials are independent in the sense that the outcome of each trial does not depend on outcomes of previous trials. Trials such as these are often referred to as iid trials, for “independent, identically distributed.”
We say that a trial results in a collision if the chosen bin (index) has already appeared in a previous trial. That is, a collision occurs when we place a ball in a bin that already contains one or more balls. Given a sequence of trials, the first collision is the first trial in which a collision occurs. Observe that a collision is guaranteed to happen after \(n + 1\) trials: if we have \(n + 1\) balls contained in \(n\) bins, then some bin is guaranteed to contain at least \(2\) balls (this fact is known as the pigeonhole principle). As you will see, however, it is typically the case that the first collision occurs much before the \((n + 1)\)st trial. In fact, the first collision is expected to happen so much sooner, this phenomenon has a name: the birthday paradox.
After some sequence of trials, the occupancy of bin \(i\) is the number of balls contained in \(i\). That is, the occupancy of \(i\) is the number of trials that resulted in an outcome of \(i\). The maximum occupancy of the sequence of trials is the maximum number of balls contained in any bin.
In Java, we can simulate iid trials as above using the Random class. Specifically, if we create
1 Random rand = new Random();
a sequence of calls to
rand.nextInt(n) will return a sequence of “random” numbers between
n - 1. (Technically, the sequence is not truly random, but is pseudorandom. That is, the sequence appears random, despite being produced by a deterministic process. Nonetheless, pseudorandomness will be good enough for our purposes.)
The Power of Two Choices
We can consider a variation of the balls into bins problem as follows. When choosing which bin in which to place a ball, instead of choosing a single bin, we choose two bins. If the two bins have the same occupancy, then the ball is placed in either bin; otherwise, the ball is placed in the smaller bin.
In this lab, you will write a program that simulates the balls into bins process and records various statistics of the procedure. For the basic balls into bins process with \(n\) buckets, your program should report the following.
The time of the first collision.
The maximum occupancy after the \(n\)th trial.
The number of trials needed until all buckets have occupancy at least 1.
Additionally, you should implement the “power of two choices” process described above and report the maximum occupancy after \(n\) trials (as in item 2 above).
To test your procedures, you should write a program that plots the statistics above for a range of values of \(n\). Specifically, you should do multiple runs of each simulation for each choice of \(n\), and report the maximum, minimum, and average values for items 1–3 above encountered for each choice of \(n\). Thus, your program should show not just how the statistics above grow with \(n\), but how the values themselves are distributed over multiple runs of the same simulation. Further, your program should be efficient. Specifically, it should be able to handle values of \(n\) on the order of 1 million.
One way to efficiently store the state of a simulation is to store an array of
ints, where the the
ith index stores the occupancy of bin
In order to perform item 3 above, your program must be able to check whether or not all bins have occupancy at least 1. You should use a more efficient procedure than checking the entire array of bin occupancies to see if they are all positive.
Plot the first collision trials (i.e., trial in which the first collision occurs) as a function of \(n\) for a wide range of values of \(n\). Be sure to do multiple trials for each \(n\) and report the maximum, minimum, and average values reported. How does the first collision time appear to scale with \(n\)? Linearly? Constant? Something in between?
For the case \(n = 365\), we can interpret the balls into bins process as follows: each \(i\) corresponds to a day of the year, and each ball represents a person’s birthday (assuming birthdays are evenly distributed throughout the year). Thus, a collision occurs when two people have the same birthday. According to your simulations, approximately how many people need to be in a class before we’d expect two people in the class to share a birthday? Is your result surprising?
For a wide range of values of \(n\), plot the maximum occupancy of \(n\) bins after \(n\) trials as a function of \(n\). Again, perform several simulations for for each \(n\) and plot the maximum, minimum, and average values you see. How quickly does this occupancy appear to grow?
Now do the same for the “power of two choices” version of the random process. How quickly does the occupancy appear to grow? How different is the maximum occupancy for this simulation from the previous (without power of two choices)?
For a wide range of values of \(n\), plot the number of trials needed until all bins have occupancy at least 1. How quickly does this function appear to grow? (It might be helpful to graph the number of trials divided by \(n\) to see if the growth is linear or super-linear).
What procedure did you use to check if all of the bins were occupied? What is the running time of your procedure per trial?
What to Submit
BallsIntoBins.javacontaining a main method that reports the statistics listed above
- be sure to include all files needed to run
- be sure to include all files needed to run
response.pdfcontaining your responses to the questions above
Your submission will receive a score out of 10 points, broken down as follows:
- 4 points for
- Code compiles and runs.
- Code is sensibly organized with helpful comments.
- Code performs the simulations as specified above and reports specified statistics.
- 6 points for responses to the questions above