Assignment 02
Set implementation and heuristics
Overview
In this assignment you will provide two implementations of the SimpleUSet
interface using a linked list as the underlying data structure. The first implementation is a straightforward implementation in which the relative positions of elements are unchanged upon search, insertion, and removal. The second implementation uses the “move-to-front” heuristic, in which whenever an element in the set is accessed, it is moved to the front of the list. You will then compare the running times of the two implementations for the task of measuring the size of vocabulary used in different texts.
Background
Previously, we discussed the List
ADT. We analyzed several implementations, as well as more restricted ADTs (Deque
, Stack
, Queue
) and implementations thereof. All of these ADTs represent a collection of elements stored in some particular order; in all cases the underlying state is a sequence of elements. In contrast, a set represents a collection of elements without any underlying order. Each element of a set occurs only once in the set, in contrast to List
s which can store many duplicates of the same element. Access to elements of a set is provided by three main methods:
- \(\mathrm{find}(y)\): determine if the element \(y\) is contained in the set.
- \(\mathrm{add}(y)\): add the element \(y\) to the set, if it was not already in the set.
- \(\mathrm{remove}(y)\): remove the element \(y\) from the set, if present.
You can find a more formal description of set ADTs and variants in Simple Set ADTs. In this assignment, you will implement the the SimpleUSet
ADT/interface.
It is possible to use a List
to implement the operations of a SimpleUSet
. Yet List
s are rigid in the sense that they maintain the relative order of elements they store, and changing this order might be costly (depending on implementation). Sets, in contrast, impose no restriction on relative orderings of items. Thus, it may be advantageous to store elements in a set in such a way that more frequently accessed items can be found more efficiently than less frequently accessed items.
Just as with the List
interface, we can represent a SimpleUSet
using a linked list. The idea is that each node of the linked list stores a single element in the SimpleUSet
. To determine if an element is contained in the SimpleUSet
, the nodes of the linked list are traversed in order until either the element is found, or the end of the list is reached (indicating the element is not in the set). To add an element, we first must search the linked list to see if the element is already present; if not the element can be inserted into the linked list. Removal of an element can be performed by first finding the node storing the element (if present), then removing the node as in our implementation of remove
for LinkedSimpleList
.
Simple Linked List Implementation
Probably the most straightforward way of using a linked list to implement a SimpleUSet
is to do the following:
-
For
find(x)
, traverse the linked list until a node storingy
satisfyingx.equals(y)
is found or the end of the list is reached. In the former case, returny
; in the latter case returnnull
. -
For
add(x)
, traverse the linked list until a node storingy
withx.equals(y)
is found, or the last node is reached without finding such ay
. In the former case, stop and returnfalse
. In the latter case, append a new node storingx
and append it to the end of the linked list. -
For
remove(x)
, traverse the linked list until a node storingy
withx.equals(y)
is found, or the end of the list is reached. In the former case, remove the node storingy
(as we did withLinkedSimpleList
), and returny
. In the latter case returnnull
.
Observe that in all three of the methods above, the first step is to search for an element in the linked list. Since the linked list is traversed sequentially, how long this takes depends on the position of the element in the list: if the element is towards the end of the list—or not contained in the list at all—then this search requires scanning (almost) the entire list. On the other hand, elements towards the front of the list will be found much faster.
The Move-to-front Heuristic
In the “simple” implementation described above, the relative positions of elements in the linked list is fixed after each element is added (unless an element is removed and subsequently added again). The move-to-front heuristic is a strategy to keep more frequently accessed elements near the front of the list. Thus, the most frequently accessed elements will typically be found more quickly, leading—hopefully—to a faster running time.
More formally, the move-to-front heuristic modifies the searching portion of the find
and add
methods as follows:
- When searching for a node containing
x
(or more precisely, a node containingy
satisfyingx.equals(y)
) if such a node is found, move that node to the head of the linked list.
In a linked list, the process of moving a given node to the head of the list can be performed efficiently (in \(O(1)\) time). Thus, there is not much overhead in applying the move-to-front heuristic. Under some conditions, however, there could be significant improvement in performance for certain sequences of operations. For example, given two consecutive calls to find(x)
with the same argument, the second call will succeed after examining the first node in the list.
Your Task
For this assignment, you will write two implementations of the SimpleUSet
interface as described above:
LinkedSimpleUSet
using the “simple” implementation using a linked list, andMTFSimpleUSet
which applies the move-to-front heuristic to the linked list.
You will then apply your SimpleUSet
implementations to write a program that reads all of the words in a given (presumably large) text file, and reports the number of distinct words in the file. To this end, your program should maintain a SimpleUSet
of the words it has read so far. Whenever a word is read from the file, it should be added to the SimpleUSet
. Since the add
method for SimpleUSet
only adds the word if it is not already contained in the set, after add
ing every word from the text, the set will only contain the distinct words from the text—i.e., the size of the vocabulary in the text. You should apply this procedure for a few texts and compare the running times of your two SimpleUSet
implementations.
Code
Please download hw02.zip
containing the following files:
SimpleUSet.java
defines theSimpleUSet
interface.WordReader.java
is a tool for reading individual words from a text file.CSVWriter.java
is a tool for writing csv (comma separated value) files.RunTimer.java
is a tool for measuring running times of methods.USetTester.java
(forthcoming) tests your implementation for correctness.- A directory
sample-texts
containing texts of various sizes for your performance tests.
Questions
-
How does the performance of
LinkedSimpleUSet
andMTFSimpleUSet
compare for the task of computing the number of distinct words in the texts you tested? Does the move-to-front heuristic seem to lead to a significant improvement? (Please include a table and/or graph of performance to support your answer.) -
When using a
SimpleUSet
to count the number of distinct words in a text, under what conditions would you expect the move-to-front heuristic to give a significant performance increase? Would you expect (most) English language texts to satisfy some of these conditions? -
Under what conditions would you expect the move-to-front heuristic not to give any significant performance improvement? Can you describe a text file for which you’d expect
MTFSimpleUSet
to be significantly less efficient thanLinkedSimpleUSet
?
What to Submit
LinkedSimpleUSet.java
containing the simpleSimpleUSet
implementation described above.MTFSimpleUSet.java
containg theSimpleUSet
implementation that applies the move-to-front heuristic.responses.md
containing your responses to the questions above.
Evaluation
Your submission will receive a score out of 10 points, broken down as follows:
- 3 points each for
LinkedSimpleUSet.java
andMTFSimpleUSet.java
.- code compiles, runs, and passes all tests in
USetTester.java
- code is well organized and documented with comments
- code compiles, runs, and passes all tests in
- 4 points for responses to the questions above