...making Linux just a little more fun!

New Options in the World of File Compression

By Brian Lindholm

In the beginning, there was compress. People used it on their data files, and it was good. Their files became smaller, and more data could be crammed onto disc platters than ever before. The joys of LZW compression were enjoyed by many.

In 1992, something better came along: gzip. It used a different compression algorithm (LZ77 plus Huffman coding) that provided even smaller files. As a bonus, it was free of any pesky patent-encumbered algorithms. The masses were even happier, as they could cram even more data into their systems, and no royalty payments for patents were required.

In 1996, Julian Seward released bzip2, which used a combination of the Burrows-Wheeler transform and other compression techniques to achieve compression performance even better than gzip's. It required more CPU power and more memory, but, with the ever-escalating capabilities of computers, this became less and less of an issue over time.

For many years, gzip and bzip2 were the de facto compression standards in the world of free software, with gzip being used on time-sensitive compression tasks, and bzip2 being used when maximum file compression was desired.

However, in the year 2000, something new came along. Igor Pavlov released a program called 7-zip, which featured a new algorithm called LZMA. This algorithm provided very high compression ratios, although it did require major RAM and CPU time.

Unfortunately, there were two problems that made 7-zip less than ideal for Linux/BSD/Unix users. The first is that it was written for Microsoft Windows. Eeek! This was thankfully addressed in 2004, with the release of a cross-platform port called p7zip. A second problem is that 7-zip (and p7zip) used a file format called .7z. This is a multi-file archive format similar in functionality to .zip files. Unfortunately, with its Windows-based roots, the .7z file format made no provision for Unix-style permissions, user/group information, access control lists, or other such information. These limitations are show-stoppers for people doing backups on multi-user systems.

Then in 2004, Igor Pavlov released the LZMA SDK (Software Development Kit). Though intended for application writers, this development kit also contained a little gem of a command-line utility called lzma_alone. This program could be used much like gzip and bzip2, to create .lzma files. When combined with tar, this provided excellent file compression with proper Unix compability.

Less than a year after the release of the LZMA SDK, Lasse Collin released the LZMA Utils. This was initially a set of wrapper scripts around lzma_alone that provided lzma (with command-line options very similar to those of gzip and bzip2) instead of the less common p7zip-style options used by lzma_alone. Later lzma releases were entirely in C. Then, in 2009, Lasse Collin released the XZ Utils, xz being the main utility. This new utility continues to use LZMA compression, but, instead of producing raw LZMA data streams, it wraps the resulting data stream in a well-defined file format containing various magic bytes, stream flags, and cyclic redundancy checks. Thus was born the .xz file format.

In 2008, Antonio Diaz released a similar utility called lzip. Like xz, it uses LZMA compression, but, instead of creating .xz files, it creates .lz files. This format is different in detail, but has many of the same features as .xz files, such as magic bytes, cyclic redundancy checks, etc. Additionally, lzip can create multi-member files, and can split output into multiple volumes.

As of this writing, there are now four command-line utilities (and three file formats) that use LZMA, providing excellent file compression results: lzma_alone by Igor Pavlov, lzma and xz by Lasse Collin, and lzip by Antonio Diaz. Does this mean we're in for a VHS/Betamax-style format war? It's hard to say. (Fortunately, you're not limited to using just one. These are utilities, not VCRs. There's plenty of room for all of them on your hard drive. I have all four on mine.)

I myself prefer lzma_alone, as it's maintained by the person who actually invented the LZMA algorithm and understands it best. However, the file format is minimal, and xz and lzip offer significant advantages with their magic bytes and data integrity checks. It's also difficult to build lzma_alone, and it has no manpage. The XZ Utils are easiest to build (as it features a modern autotools-based configure script), but it currently lacks manpages for the main xz and lzma utilities. Lzip falls in between. It requires some manual hacking to get compiler flags like you want, but it does contain a nice manpage.

At some point, one of these may become the predominant way of using LZMA compression, but, as of today, you can expect to see all three file formats out there "in the wild". I know I have.

How do these utilities compare, as to compression performance? It turns out, there's little difference. Here's a table of results, showing how lzma_alone, xz, lzip, bzip2, gzip, and compress perform on the source tarball from ghostscript-8.64. (I skipped Lasse Collin's lzma, since it's just a symlink to xz, now.) Exact versions were lzma_alone-4.65, xz-4.999.8beta, lzip-1.5, bzip2-1.0.5, gzip-1.3.12, and ncompress-4.2.4.2.

gs.tar       65003520 bytes  (original file)
gs.tar.lzma  12751330 bytes  (159.05s to compress, 1.48s to decompress)
gs.tar.xz    12768488 bytes  (155.17s to compress, 1.54s to decompress)
gs.tar.lz    12804165 bytes  (161.12s to compress, 1.97s to decompress)
gs.tar.bz2   16921504 bytes  ( 14.72s to compress, 3.45s to decompress)
gs.tar.gz    19336239 bytes  (  7.31s to compress, 0.63s to decompress)
gs.tar.Z     29467629 bytes  (  2.39s to compress, 0.78s to decompress)

Compression results on all three LZMA-based utilities were quite similar, with lzma_alone doing the best by a whisker. All three did much better than bzip2, gzip, and compress, though taking much longer. Lzip decompression was about 30% slower than the other two LZMA-based utilities, but it's still markedly faster than Bzip2.

How can you take advantages of these new utilities? Well, if you're lucky, your distribution will have one or more of these available as pre-compiled packages. I run Debian (Lenny) 5.0, which has lzma_alone and an earlier version of Lasse Collin's LZMA Utils (which contains lzma, but not xz) available. For those not provided by your distribution, you'll have to download the source code and compile it yourself. Here are links to the three major programs:

http://www.7-zip.org/sdk.html  (for lzma_alone)
http://tukaani.org/xz/  (for xz and lzma)
http://www.nongnu.org/lzip/lzip.html  (for lzip)

For those who wish to build lzma_alone, I offer this tarball: lzma_alone_patches.tar.bz2, which contains some minimal patches, build instructions, and a manpage. To use it, you'll still need to download the original LZMA SDK from the Web site mentioned above. As for the XZ Utils and Lzip, they are quite straightforward to build and install.

How can you convert existing tarballs from one file compression scheme to another? With Unix pipes, of course. The following examples show how:

gzip -c -d source.tar.gz | lzma_alone e -si source.tar.lzma
bzip2 -c -d source.tar.bz2 | lzma -c > source.tar.lzma
gzip -c -d source.tar.gz | xz -c > source.tar.xz
bzip2 -c -d source.tar.bz2 | lzip -c > source.tar.lz

And how can you decompress these incredibly compact new tarballs?

lzma_alone d source.tar.lzma -so | tar -xvf -
tar --use-compress-program=lzma -xvf source.tar.lzma
tar --use-compress-program=xz -xvf source.tar.xz
tar --use-compress-program=lzip -xvf source.tar.lz

For those who have many tarballs to convert, you might consider downloading and installing the littleutils package. This package contains three scripts (to-lzma, to-xz, and to-lzip) that will convert multiple gzip- and bzip2-compressed files into .lzma, .xz, or .lz format, respectively. The -k option is particularly useful, as it will delete the original file only if the new one is smaller. Otherwise the original file will be preserved. To convert an entire directory of tarballs to .lzma format, simply type the following:

to-lzma -k *.tar.gz *.tar.bz2

After that shameless plug for my own software, I'll conclude this article by urging people to start using at least one of these LZMA-based compression utilities. Particularly if you distribute compressed tarballs. LZMA-compressed files take less time to download, and take less time to decompress (at least compared to bzip2). Even in a world of broadband Internet connections, multi-gigahertz processors, and cavernous hard drives, these utilities will save time and space.


Talkback: Discuss this article with The Answer Gang


Bio picture

Brian Lindholm is a Virginia Tech graduate and middle-aged mechanical engineer who started programming in BASIC on a TRS-80 Model I (way back in 1980). In the late eighties, he moved to Pascal and C on an IBM PC-compatible.

Over the years, Brian became increasingly disgruntled with the instability and expense of the various Microsoft operating systems. In particular, he hated not being in full control of his system. MOST fortunately for him, however, he had a college roommate who ran Linux (way back in the Linux 0.9 and Slackware 1.0 days). That introduction was all he needed.

Over the years, he's slowly learned more and more, and now manages to keep his Debian system running happy and stable (even through four major upgrades: 2.2 to 3.0 to 3.1 to 4.0 to 5.0). [A point of note: His Debian system has NEVER crashed on its own. EVER. Only power failures, attempts to boot off the wrong partition, errant hits of the reset button, a cracked DVD, and a particularly flaky IDE Zip drive ever managed to take it down.] He loves VIM and has found Perl amazingly useful at work.

In his non-Linux life, Brian helps design power generation equipment (big power plant stuff) for a living, occasionally records live music for people, reads too much science fiction, and gets out on the Appalachian Trail as often as he can.


Copyright © 2009, Brian Lindholm. Released under the Open Publication License unless otherwise noted in the body of the article. Linux Gazette is not produced, sponsored, or endorsed by its prior host, SSC, Inc.

Published in Issue 162 of Linux Gazette, May 2009

Tux