Archiving Data to HPSS
The NERSC HPSS Archive is a large tape library that can store the simulations files for long periods of time. It is recommended to move your data to HPSS frequently, since the scratch filesystems fill up and NERSC will purge data periodically.
The script nersc.xfer.slurm
in job_scripts/perlmutter/
:
can be used to archive data to
HPSS automatically. This is submitted to the xfer queue and runs the
script process.xrb
in job_scripts/hpss/
:
which continually looks for output and stores it to HPSS.
By default, the destination directory on HPSS will be have the same name
as the directory your plotfiles are located in. This can be changed by
editing the``$HPSS_DIR`` variable at the top of process.xrb
.
The following describes how to use the scripts:
Copy the
process.xrb
script and the slurm scriptnersc.xfer.slurm
into the directory with the plotfiles.Submit the archive job:
sbatch nersc.xfer.slurm
The script
process.xrb
is called from the xfer job and will run in the background and continually wait until checkpoint or plotfiles are created.Note
process.xrb
always leaves the most recent plotfile and checkpoint file alone, since data may still be written to it.The script will use
htar
to archive the plotfiles and checkpoints to HPSS.If the
htar
command was successful, then the plotfiles are copied into aplotfile/
subdirectory. This is actually important, since you don’t want to try archiving the data a second time and overwriting the stored copy, especially if a purge took place. The same is done with checkpoint files.
Some additional notes:
If the
ftime
executable is in your path (ftime.cpp
lives inamrex/Tools/Plotfile/
), then the script will create a file calledftime.out
that lists the name of the plotfile and the corresponding simulation time.Right when the job is run, the script will tar up all of the diagnostic files,
ftime.out
, submission script, and inputs and archive them on HPSS. The.tar
file is given a name that contains the date-string to allow multiple archives to co-exist.When
process.xrb
is running, it creates a lockfile (calledprocess.jobid
) that ensures that only one instance of the script is running at any one time.Warning
Sometimes if the job is not terminated normally, the
process.jobid
file will be left behind. Later jobs should be able to detect this and clean up the stale lockfile, but if this doesn’t work, you can delete the file if you know the script is not running.
Jobs in the xfer queue start up quickly. The best approach is to start
one as you start your main job (or make it dependent on the main
job). The sample process.xrb
script will wait for output and then
archive it as it is produced.