NASA Insignia
Site Title

How To Archive Your Files

Introduction

No matter how much disk space we have it is still a finite resource. Most of us do not clear out old files until a particular disk partition runs out of space.

Please take a look through your files to see whether there is anything that is no longer worth holding onto. If your overall directory is less than a few hundred gigabytes then you might have very little that you can clean out. However nearly everyone has some files that are no longer needed.

Please do not delete anything you are really using or want to hold onto for reference. However, the gzip compression utility can make a real difference. The point is to clear out garbage you have been meaning to remove, not to make your life difficult. :-)

Thanks for your help with this! A complete description of various tools (primarily df,du,gzip,tar, and gtar) follows below.


These will be links:

Mac users

UNIX users (Linux or Mac command line)

A Command Line approach for handling disk space

How to determine your disk usage:

The df command shows "Disk space Free". Running df by itself shows all currently mounted filesystems. You can also give it a directory argument, which shows available space for all users on that partition, not just your account.

df -h ~ (where '~' represents your entire home directory)
df -h /
df -Ph / (recommended for Macs, to unclutter and not show inodes)

The df command shows results in kilobytes, which can be hard to read for large disks. Using the -h flag for a "human-readable" form (megabytes, gigabytes, etc) is very helpful.

The du command shows "Disk Usage". It is most useful in its "summary" ('-s') mode, because otherwise it shows you every nested subdirectory.

du outputs are shown in 512K blocks, which is even less helpful and nonintuitive to read. Thus, use du -hs for a "human-readable" form just like df or use du -ks for kilobytes that are easier to sort. To find the overall size of your home directory, type:

du -hs ~
A useful mode is:
du -ks * | sort -nr
where the asterisk is a wildcard to show all files in the current directory. The catch is that it won't show hidden files (those that start with a period), and those can be large, too. So the special case for your home directory would be
cd ~
du -ks * .[a-zA-Z0-9]* | sort -nr

You can handle the output by piping it to head or a pager (less, more). Or you can redirect it to a file so that you don't have to keep repeating the du over and over.

du -ks * | sort -nr | head -25
du -ks * | sort -nr | less
du -ks * | sort -nr > /tmp/du.myfiles
(where that filename can be anything you want)
Then you can use less on that file, e.g.,
less /tmp/du.myfiles

An effective way to use this command is to cd to your home directory, determine the largest sub-directories (using this command), then cd to those directories and do it again.

On the Linux cluster, you can check your home directory on the file servers with quota -s. (This won't work for local data disks.)

How to compress files

First of all, you should use gzip, not compress. gzip is generally faster than compress, and more importantly, it nearly always squeezes files smaller than compress. (Gzip is also backwards compatible, and can undo .Z files created by compress.) The bzip2 utility compresses files to an even smaller size than gzip, by the way, and is very much recommended for larger files or directories.

Type:

gzip filename
which will create filename.gz
gzip filename* or
gzip fileA fileB
will create a series of gzipp'ed filed (fileA.gz, fileB.gz, etc)
gzip -r
recursively compresses files in subdirectories
gzip -d filename.gz
to uncompress ('d' for decompress).
gzip -help
for more information.

How to create archives

The tar command (Tape ARchive) allows you to create a single file which represents a concatenated group of files, retaining the original directory structure and file ownership/permissions.

Creating a tar file involves no compression unless you request it, and the tar file will be the same size as the sum of the sizes of the files it contains.

Thus, you should plan to compress your tar files as you create them to save space. You can do this by adding a flag of "z" (for gzip) or "j" (for bzip2).

And you want your tar file names to be self documenting, so you and others will know what they are in the future, with suffixes of e.g., .tar.gz or even .tgz. (Similarly: .tar.bz2 or .tbz2 for bzip2)

The best format to use is:

tar zcvf tar_file.tar.gz relative_path_to_files_to_archive
tar jcvf tar_file.tar.bz2 relative_path_to_files_to_archive
where "zcvf" indicates: 'z' compress, 'c' create, 'v' verbose, 'f' filename to follow. The name of the tar file itself can be either an absolute or relative path. Be careful not to create a recursive problem where you are tar'ing the tar file in the current directory!

Make sure always to use relative pathnames when referring to the files to be archived, because otherwise you have no choice as to where to restore them (they must go back in the same place). [Definition: an absolute path has a leading "/".]

Example:

tar zcvf june_data.tar.gz JUNE_data
Bad example:
tar zcvf june_data.tar.gz ~/JUNE_data

This latter case would expand to "$HOME/JUNE_data", which would mean that that directory would be the only place to which the files could be extracted from the archive. (For example, you might want to put the files back into JUNE_data.OLD or into a directory in /data, and not clobber your existing JUNE_data directory. You lose this flexibility with absolute pathnames.)

The basic rule here is not to specify the files to be tar'ed with a leading "/" (including "~" which an implied leading slash.) (You are allowed to use an absolute path for the name of the tar file you are creating.)

Follow-through, after tar file creation

Once you have created the tar file, please do one or both of the following items:
  • Delete the original file or directory (if the purpose was archiving for clearing out disk space, as opposed to sharing the files with someone else.) Otherwise you are consuming up to twice the disk space: the original file(s) or directory(ies) and the tar file.
  • Move the tar file itself to another location (a local disk on your workstation, copy to another computer, etc) and delete the tar file from the system.
The whole goal of this exercise was to free up disk space on the server, not consume more of it!

How to extract from tar files

Don't plan to uncompress first: you can simply work with the tar file in its compressed form! (Some web browsers unhelpfully decompress .tar.gz files.)

You simply replace the "c" with an "x" (eXtract):

gtar zxvf filename.tar.gz
To extract a particular file, specify it on the command line:
gtar zxvf filename.tar.gz file3
If you don't know where that file is or the exact way the tar file refers to it you can grep for it:
gtar ztvf filename.tar.gz | grep file3
which might show something like "./dir_AB/file3". Now reissue the command:
gtar ztvf filename.tar.gz ./dir_AB/file3
The dirAB directory will be created and the file3 will be found there.

Important comment & rant:

Always examine the table of contents of a tar file (especially one you get from someone else) before extracting anything. This is done by using the table of contents "t" flag (instead of "c" for 'create'):
gtar ztvf filename.tar.gz
The reason this is vital is that not everyone follows the "good form" of tarring from a directory (instead of random files). You want to know whether to create a directory into which to put the files or whether the tar file will create one for you. example of 'good' tar file:
	gtar ztvf filename.tar.gz
	dir_AB/file1
	dir_AB/file2
	dir_AB/file3
	dir_AB/sub_dir_xx/file4
	dir_AB/sub_dir_xx/file5
	dir_AB/sub_dir_xx/file6

example of 'unfriendly' tar file:
	gtar ztvf filename.tar.gz
	file1
	file2
	file3
	sub_dir_xx/file4
	sub_dir_xx/file5
	sub_dir_xx/file6
The latter will 'litter' your current directory with file1, file2, and file3 amidst everything else you already had there. It is much better for it to create dir_AB and for you to know that everything you just extracted is there. If you check the table of contents, and the form is 'unfriendly' you have the opportunity of creating a directory first:
	gtar ztvf filename.tar.gz (determine that it is unfriendly)
	mkdir dir_foo
	cd dir_foo
	gtar zxvf filename.tar.gz
Think about this issue as well when you are creating tar files. You might want to go up one directory ("cd ..") and do the tar from there.

END IMPORTANT COMMENT & RANT. (with apologies to the Mac Bible :-) )

 


David Friedlander
October 1995, revised April 2020.