Archiving Data to HPSS

The NERSC HPSS Archive is a large tape library that can store the simulations files for long periods of time. It is recommended to move your data to HPSS frequently, since the scratch filesystems fill up and NERSC will purge data periodically.

The script nersc.xfer.slurm in job_scripts/perlmutter/:

nersc.xfer.slurm

can be used to archive data to HPSS automatically. This is submitted to the xfer queue and runs the script process.xrb in job_scripts/hpss/:

process.xrb

which continually looks for output and stores it to HPSS. By default, the destination directory on HPSS will be have the same name as the directory your plotfiles are located in. This can be changed by editing the``$HPSS_DIR`` variable at the top of process.xrb.

The following describes how to use the scripts:

  1. Copy the process.xrb script and the slurm script nersc.xfer.slurm into the directory with the plotfiles.

  2. Submit the archive job:

    sbatch nersc.xfer.slurm
    

    The script process.xrb is called from the xfer job and will run in the background and continually wait until checkpoint or plotfiles are created.

    Note

    process.xrb always leaves the most recent plotfile and checkpoint file alone, since data may still be written to it.

    The script will use htar to archive the plotfiles and checkpoints to HPSS.

    If the htar command was successful, then the plotfiles are copied into a plotfile/ subdirectory. This is actually important, since you don’t want to try archiving the data a second time and overwriting the stored copy, especially if a purge took place. The same is done with checkpoint files.

Some additional notes:

  • If the ftime executable is in your path (ftime.cpp lives in amrex/Tools/Plotfile/), then the script will create a file called ftime.out that lists the name of the plotfile and the corresponding simulation time.

  • Right when the job is run, the script will tar up all of the diagnostic files, ftime.out, submission script, and inputs and archive them on HPSS. The .tar file is given a name that contains the date-string to allow multiple archives to co-exist.

  • When process.xrb is running, it creates a lockfile (called process.jobid) that ensures that only one instance of the script is running at any one time.

    Warning

    Sometimes if the job is not terminated normally, the process.jobid file will be left behind. Later jobs should be able to detect this and clean up the stale lockfile, but if this doesn’t work, you can delete the file if you know the script is not running.

Jobs in the xfer queue start up quickly. The best approach is to start one as you start your main job (or make it dependent on the main job). The sample process.xrb script will wait for output and then archive it as it is produced.