Assignment 04: Linked List Sets

implementing the SimpleSet interface with linked lists and heuristics

Overview

In this assignment you will provide two implementations of the SimpleUSet interface using a linked list as the underlying data structure. The first implementation is a straightforward implementation in which the relative positions of elements are unchanged upon search, insertion, and removal. The second implementation uses the “move-to-front” heuristic, in which whenever an element in the set is accessed, it is moved to the front of the list. You will then compare the running times of the two implementations for the task of measuring the size of vocabulary used in different texts.

Background

Previously, we discussed the List ADT. We analyzed several implementations, as well as more restricted ADTs (Deque, Stack, Queue) and implementations thereof. All of these ADTs represent a collection of elements stored in some particular order; in all cases the underlying state is a sequence of elements. In contrast, a set represents a collection of elements without any underlying order. Each element of a set occurs only once in the set, in contrast to Lists which can store many duplicates of the same element. Access to elements of a set is provided by three main methods:

  • \(\mathrm{find}(y)\): determine if the element \(y\) is contained in the set.
  • \(\mathrm{add}(y)\): add the element \(y\) to the set, if it was not already in the set.
  • \(\mathrm{remove}(y)\): remove the element \(y\) from the set, if present.

You can find a more formal description of set ADTs and variants in Simple Set ADTs. In this assignment, you will implement the the SimpleUSet ADT/interface.

It is possible to use a List to implement the operations of a SimpleUSet. Yet Lists are rigid in the sense that they maintain the relative order of elements they store, and changing this order might be costly (depending on implementation). Sets, in contrast, impose no restriction on relative orderings of items. Thus, it may be advantageous to store elements in a set in such a way that more frequently accessed items can be found more efficiently than less frequently accessed items.

Just as with the List interface, we can represent a SimpleUSet using a linked list. The idea is that each node of the linked list stores a single element in the SimpleUSet. To determine if an element is contained in the SimpleUSet, the nodes of the linked list are traversed in order until either the element is found, or the end of the list is reached (indicating the element is not in the set). To add an element, we first must search the linked list to see if the element is already present; if not the element can be inserted into the linked list. Removal of an element can be performed by first finding the node storing the element (if present), then removing the node as in our implementation of remove for LinkedSimpleList.

Simple Linked List Implementation

Probably the most straightforward way of using a linked list to implement a SimpleUSet is to do the following:

  • For find(x), traverse the linked list until a node storing y satisfying x.equals(y) is found or the end of the list is reached. In the former case, return y; in the latter case return null.

  • For add(x), traverse the linked list until a node storing y with x.equals(y) is found, or the last node is reached without finding such a y. In the former case, stop and return false. In the latter case, append a new node storing x and append it to the end of the linked list.

  • For remove(x), traverse the linked list until a node storing y with x.equals(y) is found, or the end of the list is reached. In the former case, remove the node storing y (as we did with LinkedSimpleList), and return y. In the latter case return null.

Observe that in all three of the methods above, the first step is to search for an element in the linked list. Since the linked list is traversed sequentially, how long this takes depends on the position of the element in the list: if the element is towards the end of the list—or not contained in the list at all—then this search requires scanning (almost) the entire list. On the other hand, elements towards the front of the list will be found much faster.

The Move-to-front Heuristic

In the “simple” implementation described above, the relative positions of elements in the linked list is fixed after each element is added (unless an element is removed and subsequently added again). The move-to-front heuristic is a strategy to keep more frequently accessed elements near the front of the list. Thus, the most frequently accessed elements will typically be found more quickly, leading—hopefully—to a faster running time.

More formally, the move-to-front heuristic modifies the searching portion of the find and add methods as follows:

  • When searching for a node containing x (or more precisely, a node containing y satisfying x.equals(y)) if such a node is found, move that node to the head of the linked list.

In a linked list, the process of moving a given node to the head of the list can be performed efficiently. Thus, there is not much overhead in applying the move-to-front heuristic. Under some conditions, however, there could be significant improvement in performance for certain sequences of operations. For example, given two consecutive calls to find(x) with the same argument, the second call will succeed after examining the first node in the list.

Your Task

For this assignment, you will write two implementations of the SimpleUSet interface as described above:

  • LinkedSimpleUSet using the “simple” implementation using a linked list, and
  • MTFSimpleUSet which applies the move-to-front heuristic to the linked list.

You will then apply your SimpleUSet implementations to write a program that reads all of the words in a given (presumably large) text file, and reports the number of distinct uncommon words in the file. Specifically, you will be given a text file containing 10,000 common English words. Given another text file, say sample.txt, your program should read all of the words in sample.txt and report the number of distinct words that do not appear on the list of common words. To this end, your program should maintain two SimpleUSets: one that contains the common English words, and the other contains words from sample.txt that are not found in the set of common words. Whenever a word is read from sample.txt, if the word is not contained in the set of common words, it should be added to set of uncommon words. Since the add method for SimpleUSet only adds the word if it is not already contained in the set, after adding every uncommon word from sample.txt, the set will only contain the distinct uncommon words from the text. You should apply this procedure for a few texts and compare the running times of your two SimpleUSet implementations.

Code

Please download hw04.zip containing the following files:

  • SimpleUSet.java defines the SimpleUSet interface.
  • WordReader.java is a tool for reading individual words from a text file.
  • VocabularyTester.java which you will modify to run the tests described above
    • the provided version demonstrates the usage of WordReader.java
  • CSVWriter.java is a tool for writing csv (comma separated value) files.
  • RunTimer.java is a tool for measuring running times of methods.
  • USetTester.java (forthcoming) tests your implementation for correctness.
  • A directory sample-texts containing texts of various sizes for your performance tests, as well as common-words.txt, which is a list of 10,000 common English words.

Questions

  1. How does the performance of LinkedSimpleUSet and MTFSimpleUSet compare for the task of computing the number of distinct uncommon words in the texts you tested? Does the move-to-front heuristic seem to lead to a significant improvement? (Please include a table and/or graph of performance to support your answer.)

  2. When using a SimpleUSet to count the number of distinct uncommon words in a text, under what conditions would you expect the move-to-front heuristic to give a significant performance increase? Would you expect (most) English language texts to satisfy some of these conditions?

  3. Under what conditions would you expect the move-to-front heuristic not to give any significant performance improvement? Can you describe a text file for which you’d expect MTFSimpleUSet to be significantly less efficient than LinkedSimpleUSet?

What to Submit

  • LinkedSimpleUSet.java containing the simple SimpleUSet implementation described above.
  • MTFSimpleUSet.java containg the SimpleUSet implementation that applies the move-to-front heuristic.
  • A completed VocabularyTester.java that determines the number of distinct uncommon words in any desired text file sample.txt by running:
    • java VocabularyTester sample.txt
  • responses.pdf containing your responses to the questions above.

Evaluation

Your submission will receive a score out of 10 points, broken down as follows:

  • 5 points for LinkedSimpleUSet.java and MTFSimpleUSet.java.
    • code compiles, runs, and passes all tests in USetTester.java
    • code is well organized and documented with comments
  • 2 points for VocabularyTester.java
    • code compiles, runs, and accurately determines the number of distinct uncommon words in a text
    • code is well organized and documented with comments
  • 3 points for responses to the questions above