Managing Jobs at OLCF

Frontier

Machine details

Queue policies are here: https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#scheduling-policy

Filesystem is called orion, and is Lustre: https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#data-and-storage

Warning

The Orion / Lustre filesystem has been broken since Jan 2025 making I/O performance very unstable. To work around this problem we currently advise having each MPI process write its own file. This is enabled automatically in the submission script below. Restarting is also an issue, with 50% of restarts hanging due to filesystem issues. The script below will kill the job after 5 minutes if it detects that the restart has failed.

Note

We also explicitly set the filesystem striping using the LFS tools to help I/O performance.

Submitting jobs

Frontier uses SLURM.

Here’s a script that uses our best practices on Frontier. It uses 64 nodes (512 GPUs) and does the following:

Sets the filesystem striping (see https://docs.olcf.ornl.gov/data/index.html#lfs-setstripe-wrapper)
Includes logic for automatically restarting from the last checkpoint file (useful for job-chaining). This is done via the find_chk_file function.
Installs a signal handler to create a dump_and_stop file shortly before the queue window ends. This ensures that we get a checkpoint at the very end of the queue window.
Can do a special check on restart to ensure that we don’t hang on reading the initial checkpoint file (uncomment out the line):
```
(sleep 300; check_restart ) &
```
This uses the check_restart function and will kill the job if it doesn’t detect a successful restart within 5 minutes.
Adds special I/O parameters to the job to work around filesystem issues (these are defined in FILE_IO_PARAMS.

#!/bin/bash
#SBATCH -A AST106
#SBATCH -J subch
#SBATCH -o %x-%j.out
#SBATCH -t 02:00:00
#SBATCH -p batch
# here N is the number of compute nodes
#SBATCH -N 64
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=7
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=closest
#SBATCH --signal=B:URG@300

EXEC=./Castro3d.hip.x86-trento.MPI.HIP.SMPLSDC.ex
INPUTS=inputs_3d.N14.coarse

module load cpe
module load PrgEnv-gnu
module load cray-mpich
module load craype-accel-amd-gfx90a
module load rocm/6.3.1

export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH

# set the file system striping

echo $SLURM_SUBMIT_DIR

module load lfs-wrapper
lfs setstripe -c 32 -S 10M $SLURM_SUBMIT_DIR

function find_chk_file {
    # find_chk_file takes a single argument -- the wildcard pattern
    # for checkpoint files to look through
    chk=$1

    # find the latest 2 restart files.  This way if the latest didn't
    # complete we fall back to the previous one.
    temp_files=$(find . -maxdepth 1 -name "${chk}" -print | sort | tail -2)
    restartFile=""
    for f in ${temp_files}
    do
        # the Header is the last thing written -- check if it's there, otherwise,
        # fall back to the second-to-last check file written
        if [ ! -f ${f}/Header ]; then
            restartFile=""
        else
            restartFile="${f}"
        fi
    done

}

# look for 7-digit chk files
find_chk_file "*chk???????"

if [ "${restartFile}" = "" ]; then
    # look for 6-digit chk files
    find_chk_file "*chk??????"
fi

if [ "${restartFile}" = "" ]; then
    # look for 5-digit chk files
    find_chk_file "*chk?????"
fi

# restartString will be empty if no chk files are found -- i.e. new run
if [ "${restartFile}" = "" ]; then
    restartString=""
else
    restartString="amr.restart=${restartFile}"
fi


# clean up any run management files left over from previous runs
rm -f dump_and_stop

# The `--signal=B:URG@<n>` option tells slurm to send SIGURG to this batch
# script n seconds before the runtime limit, so we can exit gracefully.
function sig_handler {
    touch dump_and_stop
    # disable this signal handler
    trap - URG
    echo "BATCH: allocation ending soon; telling Castro to dump a checkpoint and stop"
}
trap sig_handler URG


export OMP_NUM_THREADS=1
export NMPI_PER_NODE=8
export TOTAL_NMPI=$(( ${SLURM_JOB_NUM_NODES} * ${NMPI_PER_NODE} ))

function check_restart {
    echo "RESTART CHECK!!!"
    outfile="${SLURM_JOB_NAME}-${SLURM_JOB_ID}.out"
    echo "RESTART CHECK: checking ${outfile}"
    restart_success=$(grep "Restart time" ${outfile})
    if [ $? == "1" ]; then
       echo "RESTART CHECK: canceling job"
       date
       scancel $SLURM_JOB_ID
    else
       echo "RESTART CHECK: restart appears to be successful"
    fi
}


# frontier's file system is troublesome, so modify the way
# we have AMReX does I/O

FILE_IO_PARAMS="
amr.plot_nfiles = -1
amr.checkpoint_nfiles = -1
"

echo appending parameters: ${FILE_IO_PARAMS}

(sleep 300; check_restart ) &

# execute srun in the background then use the builtin wait so the shell can
# handle the signal
srun -n${TOTAL_NMPI} -N${SLURM_JOB_NUM_NODES} --ntasks-per-node=8 --gpus-per-task=1 ./$EXEC $INPUTS ${restartString} ${FILE_IO_PARAMS} &
pid=$!
wait $pid
ret=$?

if (( ret == 128 + 23 )); then
    # received SIGURG, keep waiting
    wait $pid
    ret=$?
fi

exit $ret

The job is submitted as:

sbatch frontier.slurm

where frontier.slurm is the name of the submission script.

Note

If the job times out before writing out a checkpoint (leaving a dump_and_stop file behind), you can give it more time between the warning signal and the end of the allocation by adjusting the #SBATCH --signal=B:URG@<n> line at the top of the script.

Also, by default, AMReX will output a plotfile at the same time as a checkpoint file, which means you’ll get one from the dump_and_stop, which may not be at the same time intervals as your amr.plot_per. To suppress this, set:

amr.write_plotfile_with_checkpoint = 0

Also see the WarpX docs: https://warpx.readthedocs.io/en/latest/install/hpc/frontier.html

GPU-aware MPI

Some codes run better with GPU-aware MPI. To enable this add the following to your submission script:

export MPICH_GPU_SUPPORT_ENABLED=1
export FI_MR_CACHE_MONITOR=memhooks

and set the runtime parameter:

amrex.use_gpu_aware_mpi=1

Job Status

You can check on the status of your jobs via:

squeue --me

and get an estimated start time via:

squeue --me --start

Job Chaining

The script chainslurm.sh can be used to start a job chain, with each job depending on the previous. For example, to start up 10 jobs:

chainslurm -1 10 frontier.slurm

If you want to add the chain to an existing queued job, change the -1 to the job-id of the existing job.

Debugging

Debugging is done with rocgdb. Here’s a workflow that works:

Setup the environment:

module load PrgEnv-gnu
module load cray-mpich/8.1.27
module load craype-accel-amd-gfx90a
module load amd-mixed/5.6.0

Build the executable. Usually it’s best to disable MPI if possible and maybe turn on TEST=TRUE:

make USE_HIP=TRUE TEST=TRUE USE_MPI=FALSE -j 4

Startup an interactive session:

salloc -A ast106 -J mz -t 0:30:00 -p batch -N 1

This will automatically log you onto the compute now.

Note

It’s a good idea to do:

module restore

and then reload the same modules used for compiling in the interactive shell.

Now set the following environment variables:

export HIP_ENABLE_DEFERRED_LOADING=0
export AMD_SERIALIZE_KERNEL=3
export AMD_SERIALIZE_COPY=3

Note

You can also set

export AMD_LOG_LEVEL=3

to get a lot of information about the GPU calls.

Run the debugger:

rocgdb ./Castro2d.hip.x86-trento.HIP.ex

Set the following inside of the debugger:

set pagination off
b abort

The run:

run inputs

If it doesn’t crash with the trace, then try:

interrupt
bt

It might say that the memory location is not precise, to enable precise memory, in the debugger, do:

set amdgpu precise-memory on
show amdgpu precise-memory

and rerun.

Troubleshooting

Workaround to prevent hangs for collectives:

export FI_MR_CACHE_MONITOR=memhooks

Some AMReX reports are that it hangs if the initial Arena size is too big, and we should do

amrex.the_arena_init_size=0

The arena size would then grow as needed with time.