Lecture 23: Huffman Codes and Graphs

Announcement

  • HW 10 will be last required assignment
  • Deadline pushed back to May 13th (last day of classes)
  • Optional HW 11 will be posted soon

Overview

  1. Huffman Implementation Details
  2. Graphs

Last Time

Goal. Re-encode a (text) file to minimize file size.

Prefix Codes:

  • Each (used) char gets assigned a binary codeword
  • No codeword is a prefix of any other codeword
a -> 00
b -> 010
c -> 011
d -> 101
e -> 110
f -> 1110
g -> 1111
h -> 100

Prefix Codes as Trees

a -> 00
b -> 010
c -> 011
d -> 101
e -> 110
f -> 1110
g -> 1111
h -> 100

Example

Use previous tree to decode 111000011110

Huffman Coding

Idea. Start with all characters together with their frequency counts

  • each distinct character corresponds to a leaf in encoding tree
  • each node gets a weight
    • weight of a leaf = frequency count
    • wight of internal node = sum of frequencies of descendants

Then: form a tree by “merging” nodes by adding a parent

  • pick two lightest nodes w/ out parents: $u$, $w$
  • create a parent $v$ for $u$ and $w$
  • continue until all nodes are connected

Huffman, Illustrated

Build Huffman tree for text ABAAABBAACCBAAADEA

Huffman More Formally

Node stores:

  • a char c (0 if internal Node)
  • an int weight
  • left (0) and right (1) child (both null if leaf)

Huffman Procedure

  1. Create a Node for each distinct character in text, weight is character frequency
  2. Add Nodes to a collection c
  3. While c.size() > 1:
    • remove 2 nodes from c with smallest weights: u, w
    • create new node v
      • v’s children are u and w
      • v.weight = u.weight + w.weight
    • add v to c
  4. Set tree root to unique Node in c

Question

Given a Huffman tree, how do we compute the resulting encoded size?

Remarkable Fact

Theorem. Among all possible prefix codes for a given text, Huffman codes give the smallest possible encoded text.

Homework 10

Implement Huffman coding

  • Think about ADTs and data structures you’ll need
    • don’t need to implement new containers from scratch
  • Measure the size & compression ratio of encoding for different texts

Suggestion: it may be helpful to have 2 representations of Huffman code:

  1. Binary tree (for decoding)
  2. Something else (for encoding)

Thoughts on Data Compression

Huffman codes are optimal prefix codes

  • each character gets assigned a codeword

How large would Huffman encoding of text AAAAAAA...A ($n$ As) be?

Could We Do Better?

How else might we encode AAAAAAA...A ($n$ As)?

A Very Compressed File

void printString() {
    for (int i = 0; i < n; i++) {
        System.out.print("A");	
    }	
}

More Generally

The Kolmogorov Complexity of a string s is the length of the shortest program that can reproduce s

  • must use fixed programming language, etc.

Neat Idea. The document is a program

Remarkable Fact. There is no algorithm that can determine the Kolmogorov complexity of strings!!!

  • Mathematical impossibility result

Graphs

Motivation

So Far. “Highly structured” data structures

  • programmer specifies ADT/interface
    • user will specify sequence of operations
  • we choose how to store data to implement interface
  • we pick a representation of the data that is convenient

We controlled ths state of the data structure

Real World

We do not get to decide where and how the data is stored

  • we must navigate the data a presented!

Example 1: Computer Networks

Example 2: Social Networks

Example 3: State Space of a Game

Commonalities of Examples

What do these examples have in common?

Mathematical Formalism: Graphs

A graph $G = (V, E)$ consists of

  • a set $V$ of vertices (a.k.a. nodes) $V = {v_1, v_2,\ldots,v_n}$
  • a set $E$ of edges, where each edge is a pair of vertices

Example. $V = {1, 2, 3, 4, 5}$, $E = {(1, 2), (2, 3), (3, 4), (4, 5), (5, 1), (1, 3)}$

Graph Terminology

  1. If $E$ contains $(u, v)$ then $u$ and $v$ are neighbors or adjacent
  2. Two variants:
    • undirected graph if $u$ is $v$’s neighbor, then $v$ is $u$’s neighbor
    • directed graph $(u, v)$ an edge doesn’t imply $(v, u)$ is an edge

Previous Examples as Graphs

Computer Network

  • $V = $

  • $E = $

Social Network

  • $V = $

  • $E = $

Tic-Tac-Toe

  • $V = $

  • $E = $

Graphs We’ve Seen Already?

Graph ADT and Operations

How to build use graphs?

  • boolean adjacent(u, v) return true if u and v are adjacent (i.e., (u, v) is an edge)
  • neighbors(u) return a List of vertices adjacent to u
  • addVertex(u) add a vertex to set of vertices, if not already present
  • removeVertex(u) remove u and edges containing u from graph
  • addEdge(u, v) add an edge from u to v if not already present
  • removeEdge(u, v) remove edge from u to v if present

(Can have others too…)

Example

addVertex(1)

addVertex(2)

addVertex(3)

addVertex(4)

addEdge(1, 2)

addEdge(2, 3)

addEdge(2, 4)

addEdge(3, 4)

neighbors(2)

Graph Representation

Question. How could we represent a graph?

  • What ADTs should/could we use?

Adjacency List Representation

Maintain a Map<E, List<E>>:

  • Keys are vertices (use datatype E)
  • Vertices are List<E>s of vertices

Implementation with Java Built-ins

HashMap<K, V>:

  • containsKey(K x)
  • get(K x)
  • put(K x, V y)
  • remove(K x)

ArrayList:

  • add(E x)
  • contains(E x)
  • remove(E x)
  • get(int i)
  • size()

How to…

boolean adjacent(u, v) return true if u and v are adjacent (i.e., (u, v) is an edge)

How to…

neighbors(u) return a List of vertices adjacent to u

How to…

addVertex(u) add a vertex to set of vertices, if not already present

How to…

removeVertex(u) remove u and edges containing u from graph

How to…

addEdge(u, v) add an edge from u to v if not already present

How to…

removeEdge(u, v) remove edge from u to v if present

Question for Next Time

Suppose we are given an arbitrary vertex $v$ in a graph. How could we determine if another vertex $u$ is reachable from $v$?