Saturday 27 December 2014

Lossless Data Compression


David Briddock examines the history of data compression algorithms

Any efficient communication system relies on some form of data compression. Even morse code, invented for telegraphy in 1838, used shorter codewords for common letters, such as 'e' and't' in English. Most communication relies on a lossless solution, where the compressed data can be restored to exactly the same state as before.


Information Theory


By the late 1940s the information theory began to gain popularity as a science. By 1949, Claude Shannon and Robert Fano had created a systematic method to assign codewords based on the probability that a certain text fragment would occur in any given piece of data. The most popular fragments could then be encoded using just a few bits.

While studying information theory under Faro at MIT, student David Huffman wrote an end of term paper that showed how to build a so-called probability-tree from the top down. This was new approach improved compression efficiency. While, early computers typically relied on hardware and hardcoded text dictionaries to implement Shannon-Fano and Huffman-style encoding, it soon became clear that dynamically building dictionaries in software offered much greater flexibility in terms of what could be compressed codewords by these methods, as it could be aligned with the data contents.

The rapid growth of data storage and Internet usage acted as a catalyst for new compression solutions. In 1977 Abraham Lempel and Jacob Ziv (Lempel-Ziv) published their groundbreaking LZ77 algorithm, the first to incorporate a dynamically generated dictionary. The duo improved this further with LZ78, which went on to combine data parsing with dictionary generation.

Patent Wars


What followed was a rapid growth in LZ compression variants. Most of these soon died out, but a few are still around today including LZMA, LZX, Deflate and the RAR/WinRAR codec developed by Russian software engineer Eugene Roshal. These survivors tend to be based on the LZ77 algorithm rather than LZ78.

The reason is a LZ78 derivative called LZW, which was patented by the Sperry Corporation in 1984. Sperry then went on an offensive suing vendors, server admins and others who'd used the closely related GIF format. The LZW patent didn't expire until 2003.

Later on, focus shifted to the LZS algorithm developed by Stac Electronics. Microsoft adopted LZS for data compression back in the days of MS-DOS 6.0, under the name DoubleSpace - but it took a ruling finding it guilty of patent-infringement (and awarding its Stac $100 million) before anyone knew it.

Deflate


Phil Katz needed to invent a replacement for the troublesome LZW for version 2 of his PKZIP file compression tool. By 1989 he'd created the Deflate algorithm by combining the LZ77 algorithm with Huffman coding techniques. Despite its age PKZIP technology still forms the basis of the now ubiquitous ZIP format. And Deflate is also found in web technology like HTTP and SSL, the PNG graphics standard and some modems and routers.

The UNIX compress utility was also based on the patent-encumbered LZW algorithm. So the community switched to an open source Deflate-based solution called gzip, which is still around today alongside the newer Burrows-Wheeler Transform-based bzip2 format.

Sadly, Katz didn't live long enough to see Deflate take over the world. After struggling with alcoholism for many years he, died from excessive alcohol consumption in 2000 aged just 37.