Will Rosenbaum | Assignment 02

Overview

In this assignment you will provide two implementations of the SimpleUSet interface using a linked list as the underlying data structure. The first implementation is a straightforward implementation in which the relative positions of elements are unchanged upon search, insertion, and removal. The second implementation uses the “move-to-front” heuristic, in which whenever an element in the set is accessed, it is moved to the front of the list. You will then compare the running times of the two implementations for the task of measuring the size of vocabulary used in different texts.

Background

Previously, we discussed the List ADT. We analyzed several implementations, as well as more restricted ADTs (Deque, Stack, Queue) and implementations thereof. All of these ADTs represent a collection of elements stored in some particular order; in all cases the underlying state is a sequence of elements. In contrast, a set represents a collection of elements without any underlying order. Each element of a set occurs only once in the set, in contrast to Lists which can store many duplicates of the same element. Access to elements of a set is provided by three main methods:

\(\mathrm{find}(y)\): determine if the element \(y\) is contained in the set.
\(\mathrm{add}(y)\): add the element \(y\) to the set, if it was not already in the set.
\(\mathrm{remove}(y)\): remove the element \(y\) from the set, if present.

You can find a more formal description of set ADTs and variants in Simple Set ADTs. In this assignment, you will implement the the SimpleUSet ADT/interface.

It is possible to use a List to implement the operations of a SimpleUSet. Yet Lists are rigid in the sense that they maintain the relative order of elements they store, and changing this order might be costly (depending on implementation). Sets, in contrast, impose no restriction on relative orderings of items. Thus, it may be advantageous to store elements in a set in such a way that more frequently accessed items can be found more efficiently than less frequently accessed items.

Just as with the List interface, we can represent a SimpleUSet using a linked list. The idea is that each node of the linked list stores a single element in the SimpleUSet. To determine if an element is contained in the SimpleUSet, the nodes of the linked list are traversed in order until either the element is found, or the end of the list is reached (indicating the element is not in the set). To add an element, we first must search the linked list to see if the element is already present; if not the element can be inserted into the linked list. Removal of an element can be performed by first finding the node storing the element (if present), then removing the node as in our implementation of remove for LinkedSimpleList.

Simple Linked List Implementation

Probably the most straightforward way of using a linked list to implement a SimpleUSet is to do the following:

For find(x), traverse the linked list until a node storing y satisfying x.equals(y) is found or the end of the list is reached. In the former case, return y; in the latter case return null.
For add(x), traverse the linked list until a node storing y with x.equals(y) is found, or the last node is reached without finding such a y. In the former case, stop and return false. In the latter case, append a new node storing x and append it to the end of the linked list.
For remove(x), traverse the linked list until a node storing y with x.equals(y) is found, or the end of the list is reached. In the former case, remove the node storing y (as we did with LinkedSimpleList), and return y. In the latter case return null.

Observe that in all three of the methods above, the first step is to search for an element in the linked list. Since the linked list is traversed sequentially, how long this takes depends on the position of the element in the list: if the element is towards the end of the list—or not contained in the list at all—then this search requires scanning (almost) the entire list. On the other hand, elements towards the front of the list will be found much faster.

The Move-to-front Heuristic

In the “simple” implementation described above, the relative positions of elements in the linked list is fixed after each element is added (unless an element is removed and subsequently added again). The move-to-front heuristic is a strategy to keep more frequently accessed elements near the front of the list. Thus, the most frequently accessed elements will typically be found more quickly, leading—hopefully—to a faster running time.

More formally, the move-to-front heuristic modifies the searching portion of the find and add methods as follows:

When searching for a node containing x (or more precisely, a node containing y satisfying x.equals(y)) if such a node is found, move that node to the head of the linked list.

In a linked list, the process of moving a given node to the head of the list can be performed efficiently (in \(O(1)\) time). Thus, there is not much overhead in applying the move-to-front heuristic. Under some conditions, however, there could be significant improvement in performance for certain sequences of operations. For example, given two consecutive calls to find(x) with the same argument, the second call will succeed after examining the first node in the list.

Your Task

For this assignment, you will write two implementations of the SimpleUSet interface as described above:

LinkedSimpleUSet using the “simple” implementation using a linked list, and
MTFSimpleUSet which applies the move-to-front heuristic to the linked list.

You will then apply your SimpleUSet implementations to write a program that reads all of the words in a given (presumably large) text file, and reports the number of distinct words in the file. To this end, your program should maintain a SimpleUSet of the words it has read so far. Whenever a word is read from the file, it should be added to the SimpleUSet. Since the add method for SimpleUSet only adds the word if it is not already contained in the set, after adding every word from the text, the set will only contain the distinct words from the text—i.e., the size of the vocabulary in the text. You should apply this procedure for a few texts and compare the running times of your two SimpleUSet implementations.

Code

Please download hw02.zip containing the following files:

SimpleUSet.java defines the SimpleUSet interface.
WordReader.java is a tool for reading individual words from a text file.
CSVWriter.java is a tool for writing csv (comma separated value) files.
RunTimer.java is a tool for measuring running times of methods.
USetTester.java (forthcoming) tests your implementation for correctness.
A directory sample-texts containing texts of various sizes for your performance tests.

Questions

How does the performance of LinkedSimpleUSet and MTFSimpleUSet compare for the task of computing the number of distinct words in the texts you tested? Does the move-to-front heuristic seem to lead to a significant improvement? (Please include a table and/or graph of performance to support your answer.)
When using a SimpleUSet to count the number of distinct words in a text, under what conditions would you expect the move-to-front heuristic to give a significant performance increase? Would you expect (most) English language texts to satisfy some of these conditions?
Under what conditions would you expect the move-to-front heuristic not to give any significant performance improvement? Can you describe a text file for which you’d expect MTFSimpleUSet to be significantly less efficient than LinkedSimpleUSet?

What to Submit

LinkedSimpleUSet.java containing the simple SimpleUSet implementation described above.
MTFSimpleUSet.java containg the SimpleUSet implementation that applies the move-to-front heuristic.
responses.md containing your responses to the questions above.

Evaluation

Your submission will receive a score out of 10 points, broken down as follows:

3 points each for LinkedSimpleUSet.java and MTFSimpleUSet.java.
- code compiles, runs, and passes all tests in USetTester.java
- code is well organized and documented with comments
4 points for responses to the questions above