MLunzip

This is an implementation of gunzip (the decompression side of RFC1951/2) in a purely functional subset of Standard ML.

Download (and history)

(.tar.gz files, less than 20 kilobytes)

The development code is stored in a Sourceforge browsable CVS repository.

Implementing the specification

This program has been written as a way to learn functional programming on a real problem that is not a textbook functional exercise. Despite the "bit twiddling" nature of the problem, the problem proved quite doable in a purely functional way without any significant obstacle.

The source is relatively short and readable, about the same number of lines as the RFC, which is itself a concise although reasonably self-contained specification and about as good as informal specifications can be. Nevertheless, the toughest bug to solve was due to a misunderstanding of the spec... The source is probably not self explanatory enough to constitute an alternative specification, but it would make a good readable reference implementation, and anyone wishing to implement DEFLATE in another language could check this implementation to answer any question they may have from reading the RFC's text.

Functional performance

After ease of implementation, a second aim is to see if performance can get close to that of the reference C implementation, g(un)zip. Of course, it's not expected a functional implementation would be as fast as the C one, the question is more whether the cost of using a higher level language is acceptable. This particular problem is probably tougher than most real world problems, because the reference implementation is likely to be much better optimised than any random piece of C or more generally imperative code would be, besides the algorithm is designed for a straightforward C implementation and there is little scope for algorithmic improvements, at least for decompression.

On more general problems, it could be that given the same resources, a functional programmer could be able to implement efficient algorithms that are too hard to implement in the same time in imperative languages. If this problem is really representative of hard problems from the viewpoint of the relative performance of functional programs, this implementation's performance can give an idea about the upper bound of the performance cost of using functional languages in ordinary situations.

Optimisation history

The first version that worked was several hundred times slower than the C version with a time complexity that looked worse than the O(n) expected from the algorithm.

First, let's introduce what DEFLATE is about. The core LZ algorithm can be modelled like this:

 datatype item = LITERAL byte | BACKCOPY int * int (* distance, length *)
 type compressed = item list

That is, a compressed file is a list of either literal bytes or instructions to go back distance and copy length bytes from the output. Therefore we need an efficient way to go back some distance in the output. The first version used List's drop and take to do the copying on the output list of bytes. As this is proportional to distance + length for each compressed item, it was understandably slow.

The first optimisation was to make a circular buffer based on a binary tree of fixed depth (given that the maximum distance is bounded to 2^15 in DEFLATE), by reusing the Huffman tree structure that was available -- the other algorithm used in DEFLATE is huffman encoding of the items.

That brought back O(n) (on the size of the compressed input) behaviour, but still remained quite slow and a lot of time was spent on garbage collection. That was not surprising given that each byte in the output requires updating the tree thus creating 15 nodes, and incidentally the tree traversal turned the index into a list of booleans for use as a tree path. Overall, the processing of the backward references in the output dominated time usage (using MLton's profiler -- MLton is an optimising Standard ML compiler).

The next step was to design a data structure efficient for this problem: looking like a list, but with an efficient implementation of drop for a bounded number of items. A hierarchical skip list was chosen, which is a bit like a tree in reverse to minimise object creation when adding an element (this may be described in more detail in a future edition of this document). The amortised space behaviour of this structure is similar to that of an ordinary functional list, while the performance of drop is similar to that of random access in a tree. This improved performance by several times, and eliminated the dominance of the output processing that was observed before. If the aim was not to remain purely functional, and the language provides an escape hatch to mutable arrays, this is the obvious data structure for output lookups.

On the input side, the original implementation's reliance of the incremental transformation of the input byte stream into a list of bits (booleans) was replaced with direct operation on bytes with the use of bit shifts and binary masks, as would be done in C. After a relatively modest refactorisation, it was possible first to limit the dependency on the bit list representation, and then painlessly replace it with direct byte processing. To the core algorithm, the processing still looks like being on a bitstream (DeflateIO's take_bit, drop_bit, take_fixed) so the optimisation has not damaged the clarity of the algorithm's implementation and all bit twiddling is isolated in the input modules. This optimisation provided a significant improvement, although not as spectacular as that on the output side.

At this stage, these optimisations have greatly improved performance without adding much code (a few dozens lines at most) or damaging the readability of the implementation.

(Work in progress: the next step will be to turn the implementation into an incremental process to limit the cost of keeping a list of bytes for the whole output and input.)


Author. See also: Functional page.