Coming Soon:
Considered shakespeare.txt
char
char
valuesQuestion. Could we represent shakespeare.txt
using a smaller file?
Consider the number of distinct characters used by Shakespeare!
Can we use this fact to compress shakespeare.txt
into a smaller file?
Re-encode characters actually used by Shakespeare
How much space (number of bits) does the new encoding require?
How would we decode the newly encoded file?
Consider character frequencies of Shakespeare
8 distinct characters account for more than half of the characters of shakespeare.txt
!
char ASCII count
' ' 32 1055175
'e' 101 445988
't' 116 315647
'o' 111 305115
'a' 97 265561
'h' 104 238932
's' 115 236293
'n' 110 235774
-----------------
total 3098485 (= 54% of chars)
How could we exploit frequency counts to further compress shakespeare.txt
?
Have two tables: one for frequent characters, another for infrequent characters
If we have 2 char tables, how can we decode a string?
Frequent:
' ' -> 000
'e' -> 001
...
Infrequent:
'r' -> 0000000
'i' -> 0000001
`\n' -> 0000010
...
How to decode 000000101001100...
?
When decoding, how do we distinguish between frequent and infrequent character from the encoded character?
Use an extra bit to indicate if the following (encoded) character is frequent (3 bits) or infrequent (7 bits).
Frequent:
' ' -> 0000
'e' -> 0001
...
Infrequent:
'r' -> 10000000
'i' -> 10000001
`\n' -> 10000010
...
Now frequent characters use 4 bits (always starting with 1
), infrequent use 8 bits (always starting with 0
)
Decode the string 010010000000100000010100
' ' -> 0000
'e' -> 0001
't' -> 0010
'o' -> 0011
'a' -> 0100
'r' -> 10000000
'i' -> 10000001
Start scanning bits from the first bit:
0
, first four bits encode a frequent character1
, first eight bits encode an infrequent characterPicture:
If we use frequent/infrequent character encoding, what is the resulting size of shakespeare.txt
?
5.8 MB
5.8 M * 7 / 8 = 5.1 MB
~ 5.8 M (0.5 * 4 + 0.5 * 8) / 8 = 4.3 MB
Why limit ourselves to just two types of characters (frequent/infrequent)?
General Situation:
Example:
' ' -> 00
'e' -> 010
't' -> 011
'o' -> 101
...
'X' -> 10011011
What properties of codewords are required to enable us to decode an encoded text?
What properties of codewords are desired to enable us to compress the original text?
What properties of codewords are required to enable us to decode an encoded text?
Unique decodability:
When reading individual bits, must know when I’ve reached the end of a character
Cannot have: one codeword is 1001
and another codeword starts 1001...
We say 1001
is a prefix of 1001011
Definition. A set of codewords is a prefix code if no codeword is a prefix of any other.
Examples.
Any prefix code can be represented as a binary tree!
Start at root:
0
and 1
0
s and 1
s along the path from root to the leafConstruct the binary tree for
'a' -> 00
'b' -> 01
'c' -> 101
'd' -> 111
'e' -> 1101
'f' -> 1100
Use previous tree to decode 1100001011101111
What properties of codewords are desired to enable us to compress the original text?
Idea. Start with all characters together with their frequency counts
Then: form a tree by “merging” nodes by adding a parent
Build Huffman tree for text ABAAABBAACCBAAADEA
Node
stores:
char c
(0
if internal Node
)int weight
left
(0) and right
(1) child (both null
if leaf)Node
for each distinct character in text, weight
is character frequencyNodes
to a collection c
c.size() > 1
:
c
with smallest weights: u
, w
v
v
’s children are u
and w
v.weight = u.weight + w.weight
v
to c
Node
in c
Given a Huffman tree, how do we compute the resulting file size?
Theorem. Among all possible prefix codes for a given text, Huffman codes give the smallest possible encoded text.
Implement Huffman coding