Saturday, January 5, 2013

Archiving and Compressing Data

The tar Command

The tar command can be used to create (-c), extract (-x) and list (-t) contents of an archive. Here's a listing of the files we'll be working with:

$ ll
-rw------- 1 ejan ejan  331 Jan  5 18:57 FibonacciException.java
drwx------ 2 ejan ejan 4.0K Jan  5 18:57 handlers
drwx------ 2 ejan ejan 4.0K Jan  5 18:57 impl
-rw------- 1 ejan ejan  389 Jan  5 18:57 RabbitCounter.java
-rw------- 1 ejan ejan  168 Jan  5 18:57 ServiceConstants.java

Note: The ll command shown above is an alias in my ~/.bashrc file defined as alias ll='ls -lh'

Creating a Tarball

In order to create a tarball of the contents of the current directory, use the following command:

$ tar -cvf code.tar *
FibonacciException.java
handlers/
handlers/CorrelationIDHandler.java
handlers/handler-chain.xml
impl/
impl/RabbitCounter.java
RabbitCounter.java
ServiceConstants.java

The -v option is used to produce verbose output while the -f option specifies the name of the output file.

Extracting Contents of a Tarball

If we were to extract contents of this tarball, we would issue:

$ tar -xvf code.tar
FibonacciException.java
handlers/
handlers/CorrelationIDHandler.java
handlers/handler-chain.xml
impl/
impl/RabbitCounter.java
RabbitCounter.java
ServiceConstants.java

Listing Contents of a Tarball

Finally, in order to list contents of a tarball archive, we issue: 

$ tar -tvf code.tar
-rw------- ejan/ejan       331 2013-01-05 19:28 FibonacciException.java
drwx------ ejan/ejan         0 2013-01-05 19:28 handlers/
-rw------- ejan/ejan      2312 2013-01-05 19:28 handlers/CorrelationIDHandler.java
-rw------- ejan/ejan       303 2013-01-05 19:28 handlers/handler-chain.xml
drwx------ ejan/ejan         0 2013-01-05 19:28 impl/
-rw------- ejan/ejan      2397 2013-01-05 19:28 impl/RabbitCounter.java
-rw------- ejan/ejan       389 2013-01-05 19:28 RabbitCounter.java
-rw------- ejan/ejan       168 2013-01-05 19:28 ServiceConstants.java


Now that we can create, extract and list plain tarballs, let's discuss how to create compressed tarballs.

Compressing and Uncompressing Data

The tar command can work with many programs to produce compressed tarball. The most commonly used format is gzip. In order to use it, all we need is to issue the tar command as usual but with an additional option: -z. The -z option tells tar that it should produce a compressed archive using the gzip program.

The following commands will create a compressed tarball with contents of the current directory, extract the compressed archive and finally list its contents, just like the previous commands used to create plain tarballs. The output is mostly similar so there's no need to reproduce it here:

$ tar -cvzf code.tar.gz *
$ tar -xvzf code.tar.gz
$ tar -tvzf code.tar.gz

Similarly, tar can work with bzip2 archives as well. The only difference in the command would be that we'd pass a -j instead of -z:

$ tar -cvjf code.tar.bz2 *
$ tar -xvjf code.tar.bz2
$ tar -tvjf code.tar.bz2

Finally tar can work with other compression programs as well, such as lzop, by means of the --use-compress-program:

$ tar --use-compress-program=lzop -cvf code.tar.lzo *
$ tar --use-compress-program=lzop -tvf code.tar.lzo
$ tar --use-compress-program=lzop -xvf code.tar.lzo

After all this dojo, here's what the directory looks like:

$ ll
-rw-rw-r-- 1 ejan ejan  20K Jan  5 19:54 code.tar
-rw-rw-r-- 1 ejan ejan 2.1K Jan  5 19:55 code.tar.bz2
-rw-rw-r-- 1 ejan ejan 2.1K Jan  5 19:54 code.tar.gz
-rw-rw-r-- 1 ejan ejan 3.0K Jan  5 19:58 code.tar.lzo

Generally speaking, the bzip2 format will give you the best compression, followed by gzip, followed by lzop. Of course, a plain tarball won't save you any space as it's not compressed at all. 

It is worth noting that you can use these programs directly to compress or uncompress data without using the tar command. However, I prefer using the tar command since it's more like a 'One command to rule them all' command, though, admittedly, using the bzip2, gzip and lzop programs will give you the most flexibility. Feel free to use those programs directly when you need their power (like the capability to explicitly specify the desired compression level).

Manipulating Archives

Finally, there are a number of operations we can perform on a tarball, including:

joining an archive to another one:

$ tar -AF target-archiver source-archiver

remove files from an existing archive:

$ tar --delete RabbitCounter.java -f code.tar.bz2

and add files to an existing archive:

$ tar -ref RabbitCounter.java -f code.tar.gz

Happy tarring :)