Archiving Data to HPSS
The NERSC HPSS Archive is a large tape library that can store the simulations files for long periods of time. It is recommended to move your data to HPSS frequently, since the scratch filesystems fill up and NERSC will purge data periodically.
The script nersc.xfer.slurm in job_scripts/perlmutter/:
can be used to archive data to
HPSS automatically. This is submitted to the xfer queue and runs the
script process.xrb in job_scripts/hpss/:
which continually looks for output and stores it to HPSS.
By default, the destination directory on HPSS will be have the same name
as the directory your plotfiles are located in.  This can be changed by
editing the``$HPSS_DIR`` variable at the top of process.xrb.
The following describes how to use the scripts:
Copy the
process.xrbscript and the slurm scriptnersc.xfer.slurminto the directory with the plotfiles.Submit the archive job:
sbatch nersc.xfer.slurmThe script
process.xrbis called from the xfer job and will run in the background and continually wait until checkpoint or plotfiles are created.Note
process.xrbalways leaves the most recent plotfile and checkpoint file alone, since data may still be written to it.The script will use
htarto archive the plotfiles and checkpoints to HPSS.If the
htarcommand was successful, then the plotfiles are copied into aplotfile/subdirectory. This is actually important, since you don’t want to try archiving the data a second time and overwriting the stored copy, especially if a purge took place. The same is done with checkpoint files.
Some additional notes:
If the
ftimeexecutable is in your path (ftime.cpplives inamrex/Tools/Plotfile/), then the script will create a file calledftime.outthat lists the name of the plotfile and the corresponding simulation time.Right when the job is run, the script will tar up all of the diagnostic files,
ftime.out, submission script, and inputs and archive them on HPSS. The.tarfile is given a name that contains the date-string to allow multiple archives to co-exist.When
process.xrbis running, it creates a lockfile (calledprocess.jobid) that ensures that only one instance of the script is running at any one time.Warning
Sometimes if the job is not terminated normally, the
process.jobidfile will be left behind. Later jobs should be able to detect this and clean up the stale lockfile, but if this doesn’t work, you can delete the file if you know the script is not running.
Jobs in the xfer queue start up quickly. The best approach is to start
one as you start your main job (or make it dependent on the main
job). The sample process.xrb script will wait for output and then
archive it as it is produced.