Saturday, January 5, 2013

Archiving and Compressing Data

The tar Command

The tar command can be used to create (-c), extract (-x) and list (-t) contents of an archive. Here's a listing of the files we'll be working with:

$ ll
-rw------- 1 ejan ejan  331 Jan  5 18:57
drwx------ 2 ejan ejan 4.0K Jan  5 18:57 handlers
drwx------ 2 ejan ejan 4.0K Jan  5 18:57 impl
-rw------- 1 ejan ejan  389 Jan  5 18:57
-rw------- 1 ejan ejan  168 Jan  5 18:57

Note: The ll command shown above is an alias in my ~/.bashrc file defined as alias ll='ls -lh'

Creating a Tarball

In order to create a tarball of the contents of the current directory, use the following command:

$ tar -cvf code.tar *

The -v option is used to produce verbose output while the -f option specifies the name of the output file.

Extracting Contents of a Tarball

If we were to extract contents of this tarball, we would issue:

$ tar -xvf code.tar

Listing Contents of a Tarball

Finally, in order to list contents of a tarball archive, we issue: 

$ tar -tvf code.tar
-rw------- ejan/ejan       331 2013-01-05 19:28
drwx------ ejan/ejan         0 2013-01-05 19:28 handlers/
-rw------- ejan/ejan      2312 2013-01-05 19:28 handlers/
-rw------- ejan/ejan       303 2013-01-05 19:28 handlers/handler-chain.xml
drwx------ ejan/ejan         0 2013-01-05 19:28 impl/
-rw------- ejan/ejan      2397 2013-01-05 19:28 impl/
-rw------- ejan/ejan       389 2013-01-05 19:28
-rw------- ejan/ejan       168 2013-01-05 19:28

Now that we can create, extract and list plain tarballs, let's discuss how to create compressed tarballs.

Compressing and Uncompressing Data

The tar command can work with many programs to produce compressed tarball. The most commonly used format is gzip. In order to use it, all we need is to issue the tar command as usual but with an additional option: -z. The -z option tells tar that it should produce a compressed archive using the gzip program.

The following commands will create a compressed tarball with contents of the current directory, extract the compressed archive and finally list its contents, just like the previous commands used to create plain tarballs. The output is mostly similar so there's no need to reproduce it here:

$ tar -cvzf code.tar.gz *
$ tar -xvzf code.tar.gz
$ tar -tvzf code.tar.gz

Similarly, tar can work with bzip2 archives as well. The only difference in the command would be that we'd pass a -j instead of -z:

$ tar -cvjf code.tar.bz2 *
$ tar -xvjf code.tar.bz2
$ tar -tvjf code.tar.bz2

Finally tar can work with other compression programs as well, such as lzop, by means of the --use-compress-program:

$ tar --use-compress-program=lzop -cvf code.tar.lzo *
$ tar --use-compress-program=lzop -tvf code.tar.lzo
$ tar --use-compress-program=lzop -xvf code.tar.lzo

After all this dojo, here's what the directory looks like:

$ ll
-rw-rw-r-- 1 ejan ejan  20K Jan  5 19:54 code.tar
-rw-rw-r-- 1 ejan ejan 2.1K Jan  5 19:55 code.tar.bz2
-rw-rw-r-- 1 ejan ejan 2.1K Jan  5 19:54 code.tar.gz
-rw-rw-r-- 1 ejan ejan 3.0K Jan  5 19:58 code.tar.lzo

Generally speaking, the bzip2 format will give you the best compression, followed by gzip, followed by lzop. Of course, a plain tarball won't save you any space as it's not compressed at all. 

It is worth noting that you can use these programs directly to compress or uncompress data without using the tar command. However, I prefer using the tar command since it's more like a 'One command to rule them all' command, though, admittedly, using the bzip2, gzip and lzop programs will give you the most flexibility. Feel free to use those programs directly when you need their power (like the capability to explicitly specify the desired compression level).

Manipulating Archives

Finally, there are a number of operations we can perform on a tarball, including:

joining an archive to another one:

$ tar -AF target-archiver source-archiver

remove files from an existing archive:

$ tar --delete -f code.tar.bz2

and add files to an existing archive:

$ tar -ref -f code.tar.gz

Happy tarring :)