Join Scott Simpson for an in-depth discussion in this video File system basics: Archives, part of Linux Tips Weekly.
- [Instructor] When we need to reduce the size of files, or to package them up together in order to store them or send them to someone else, we'll often use an archive. Archives come in two different types that are often thought of together. One kind of archive, called a tape archive, usually represented by a .tar file is a way of sticking a bunch of files together into one big unit. Compressed archives take that one file and use some clever math to make it smaller. Let's take a look at both. TAR files are common in the Linux world, and they have a long history. As I mentioned, TAR is short for tape archive.
Historically, and even to a large extent today, important information was backed up on tape media, which is physically compact and can store a large amount of data. In order to store files on a tape, a system needs to know where on a tape particular files are, and those files need to get written onto a tape in a linear fashion. In order to do this, the tape archive format takes a bunch of files and puts them together into one continuous file. We don't really save any space putting files together like this. But we do get one file that represents all of the data from many separate files.
This one file can be checksummed or sent to someone else more easily than a bundle of files and folders. Once we have an archive like a TAR file, we can then apply compression to it. Compression uses algorithms to look at the data in a file and decides where space can be saved. Algorithms for compressing data differ, but commonly, a compression tool will look at pieces of data and see if it can find patterns it can replicate mathematically, or it'll look for long stretches of zeros or ones that it can remove and replace later on during decompression.
This video isn't about the details of compression algorithms but if you're curious, the algorithms are pretty interesting to read about. I encourage you to explore them if you're so inclined. Let's take a look at how to create an archive and how to compress one. I have a folder here with a bunch of random files in it. In order to create an archive with them, I'll use the tar command, with the c and f options. c tells tar to create an archive, and f tells tar to put the resulting archive into a file.
If you leave that off, it'll send the data for the archive to the standard output here on the screen unless you redirect it somewhere else like a tape device. Then I'll give my file a name. I'll call it archive.tar. You don't have to add .tar on the end but it's customary, so someone can tell just by looking at it what kind of file that they're dealing with. Tools like the file command will still be able to tell what it is without the extension though. And then I'll set what I want to have inside that file. You can put a number of file name or folders here after the name of the archive, but I'll just put this one directory.
And now I have an archive file. This file takes up just about the same amount of space as all of the files I put in it. There's a little extra space taken up for the information that describes how the data should be redivided up to create individual files when the archive is expanded. I can see what's inside of a TAR file with the -t option. I'll write tar tf and the name of the archive. Let's extract this archive. To do that, I'll write tar xf archive.tar, and this command, just as it is, will expand the contents of the archive into the current working directory.
That might not be what I want to do though. I could add file names here at the end to pull out individual files, or I could use dash capital C to change the directory where the files will be put. I'll put them in my Downloads folder instead of the working directory here. That way, I won't overwrite the original folder here in my home directory. We can also apply some compression to the archive with the z or j options. The z option uses gzip and the j option uses bzip2, both pretty standard compression tools that work in different ways.
The results in most cases are pretty similar, but if you have a large file, you might want to try both to see which works faster or results in a smaller file. I'll archive this folder of files again, this time with compression. I'll write tar czf, and call my file archive.tar.gz. Here I'm using z for gzip compression, and I've called the file archive.tar.gz. This is customary, and sometimes gzipped TAR files will have the extension .tgz.
If I were using bzip2, with the j option, I'd name the file .tar.bz2. Once that's created, we can take a look at it and compare the size to the uncompressed archive. In my case, I didn't save any space because these files are actually just full of random data, and that doesn't compress too well. To expand a compressed archive, we use the same command as for an uncompressed one. TAR figures out what it needs to do and then compresses the archive. There's another kind of archive that you may come across as well called ZIP.
ZIP is a little bit more cross-platform friendly than TAR, so many people use it either for compatibility or because they're creating the archive on a system that uses ZIP. The commands we use to work with ZIP archives are called zip and unzip. If you run the commands by themselves, you'll see a helpful guide to more advanced options that you can use. I'll run zip here and I can see that the syntax to create a ZIP archive from a few files is zip and then the name of the zip file and the list of files. I'll do that with some files here.
I'll write zip and the name of an archive. I'll call it archive.zip, and then I'll set the path to the files that I want to compress, all of the files in the my_files directory. And as it's running, I can see how much the zip is deflating or reducing the size of each file. Again, in this case, these files are full of random data, so they're not actually compressing. There's my zip archive. Like a TAR file, or compressed TAR file, this would be easier than a bunch of files to back up or send to someone else.
Let's extract the archive with unzip now. If I just unzip this into a folder, I'll overwrite my original files, and I don't want to do that. So I'll create a folder to put the extracted files in. I'll write mkdir newfiles, and then I'll write unzip archive.zip -d and the folder that I want to send the files to. Archives, and especially compressed archives, are extremely common ways of distributing files, storing logs, and serializing data to make it easier to send and receive across the network.
Note: Because this is an ongoing series, there is no certificate of completion available for this course.