Managing Jobs at OLCF
Frontier
Machine details
Queue policies are here: https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#scheduling-policy
Filesystem is called orion
, and is Lustre:
https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#data-and-storage
Warning
The Orion / Lustre filesystem has been broken since Jan 2025 making I/O performance very unstable. To work around this problem we currently advise having each MPI process write its own file. This is enabled automatically in the submission script below. Restarting is also an issue, with 50% of restarts hanging due to filesystem issues. The script below will kill the job after 5 minutes if it detects that the restart has failed.
Note
We also explicitly set the filesystem striping using the LFS tools to help I/O performance.
Submitting jobs
Frontier uses SLURM.
Here’s a script that uses our best practices on Frontier. It uses 64 nodes (512 GPUs) and does the following:
Sets the filesystem striping (see https://docs.olcf.ornl.gov/data/index.html#lfs-setstripe-wrapper)
Includes logic for automatically restarting from the last checkpoint file (useful for job-chaining). This is done via the
find_chk_file
function.Installs a signal handler to create a
dump_and_stop
file shortly before the queue window ends. This ensures that we get a checkpoint at the very end of the queue window.Can do a special check on restart to ensure that we don’t hang on reading the initial checkpoint file (uncomment out the line):
(sleep 300; check_restart ) &
This uses the
check_restart
function and will kill the job if it doesn’t detect a successful restart within 5 minutes.Adds special I/O parameters to the job to work around filesystem issues (these are defined in
FILE_IO_PARAMS
.
#!/bin/bash
#SBATCH -A AST106
#SBATCH -J subch
#SBATCH -o %x-%j.out
#SBATCH -t 02:00:00
#SBATCH -p batch
# here N is the number of compute nodes
#SBATCH -N 64
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=7
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=closest
#SBATCH --signal=B:URG@300
EXEC=./Castro3d.hip.x86-trento.MPI.HIP.SMPLSDC.ex
INPUTS=inputs_3d.N14.coarse
module load cpe
module load PrgEnv-gnu
module load cray-mpich
module load craype-accel-amd-gfx90a
module load rocm/6.3.1
export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
# set the file system striping
echo $SLURM_SUBMIT_DIR
module load lfs-wrapper
lfs setstripe -c 32 -S 10M $SLURM_SUBMIT_DIR
function find_chk_file {
# find_chk_file takes a single argument -- the wildcard pattern
# for checkpoint files to look through
chk=$1
# find the latest 2 restart files. This way if the latest didn't
# complete we fall back to the previous one.
temp_files=$(find . -maxdepth 1 -name "${chk}" -print | sort | tail -2)
restartFile=""
for f in ${temp_files}
do
# the Header is the last thing written -- check if it's there, otherwise,
# fall back to the second-to-last check file written
if [ ! -f ${f}/Header ]; then
restartFile=""
else
restartFile="${f}"
fi
done
}
# look for 7-digit chk files
find_chk_file "*chk???????"
if [ "${restartFile}" = "" ]; then
# look for 6-digit chk files
find_chk_file "*chk??????"
fi
if [ "${restartFile}" = "" ]; then
# look for 5-digit chk files
find_chk_file "*chk?????"
fi
# restartString will be empty if no chk files are found -- i.e. new run
if [ "${restartFile}" = "" ]; then
restartString=""
else
restartString="amr.restart=${restartFile}"
fi
# clean up any run management files left over from previous runs
rm -f dump_and_stop
# The `--signal=B:URG@<n>` option tells slurm to send SIGURG to this batch
# script n seconds before the runtime limit, so we can exit gracefully.
function sig_handler {
touch dump_and_stop
# disable this signal handler
trap - URG
echo "BATCH: allocation ending soon; telling Castro to dump a checkpoint and stop"
}
trap sig_handler URG
export OMP_NUM_THREADS=1
export NMPI_PER_NODE=8
export TOTAL_NMPI=$(( ${SLURM_JOB_NUM_NODES} * ${NMPI_PER_NODE} ))
function check_restart {
echo "RESTART CHECK!!!"
outfile="${SLURM_JOB_NAME}-${SLURM_JOB_ID}.out"
echo "RESTART CHECK: checking ${outfile}"
restart_success=$(grep "Restart time" ${outfile})
if [ $? == "1" ]; then
echo "RESTART CHECK: canceling job"
date
scancel $SLURM_JOB_ID
else
echo "RESTART CHECK: restart appears to be successful"
fi
}
# frontier's file system is troublesome, so modify the way
# we have AMReX does I/O
FILE_IO_PARAMS="
amr.plot_nfiles = -1
amr.checkpoint_nfiles = -1
"
echo appending parameters: ${FILE_IO_PARAMS}
(sleep 300; check_restart ) &
# execute srun in the background then use the builtin wait so the shell can
# handle the signal
srun -n${TOTAL_NMPI} -N${SLURM_JOB_NUM_NODES} --ntasks-per-node=8 --gpus-per-task=1 ./$EXEC $INPUTS ${restartString} ${FILE_IO_PARAMS} &
pid=$!
wait $pid
ret=$?
if (( ret == 128 + 23 )); then
# received SIGURG, keep waiting
wait $pid
ret=$?
fi
exit $ret
The job is submitted as:
sbatch frontier.slurm
where frontier.slurm
is the name of the submission script.
Note
If the job times out before writing out a checkpoint (leaving a
dump_and_stop
file behind), you can give it more time between the
warning signal and the end of the allocation by adjusting the
#SBATCH --signal=B:URG@<n>
line at the top of the script.
Also, by default, AMReX will output a plotfile at the same time as a checkpoint file,
which means you’ll get one from the dump_and_stop
, which may not be at the same
time intervals as your amr.plot_per
. To suppress this, set:
amr.write_plotfile_with_checkpoint = 0
Also see the WarpX docs: https://warpx.readthedocs.io/en/latest/install/hpc/frontier.html
GPU-aware MPI
Some codes run better with GPU-aware MPI. To enable this add the following to your submission script:
export MPICH_GPU_SUPPORT_ENABLED=1
export FI_MR_CACHE_MONITOR=memhooks
and set the runtime parameter:
amrex.use_gpu_aware_mpi=1
Job Status
You can check on the status of your jobs via:
squeue --me
and get an estimated start time via:
squeue --me --start
Job Chaining
The script chainslurm.sh can be used to start a job chain, with each job depending on the previous. For example, to start up 10 jobs:
chainslurm -1 10 frontier.slurm
If you want to add the chain to an existing queued job, change the -1
to the job-id
of the existing job.
Debugging
Debugging is done with rocgdb
. Here’s a workflow that works:
Setup the environment:
module load PrgEnv-gnu
module load cray-mpich/8.1.27
module load craype-accel-amd-gfx90a
module load amd-mixed/5.6.0
Build the executable. Usually it’s best to disable MPI if possible
and maybe turn on TEST=TRUE
:
make USE_HIP=TRUE TEST=TRUE USE_MPI=FALSE -j 4
Startup an interactive session:
salloc -A ast106 -J mz -t 0:30:00 -p batch -N 1
This will automatically log you onto the compute now.
Note
It’s a good idea to do:
module restore
and then reload the same modules used for compiling in the interactive shell.
Now set the following environment variables:
export HIP_ENABLE_DEFERRED_LOADING=0
export AMD_SERIALIZE_KERNEL=3
export AMD_SERIALIZE_COPY=3
Note
You can also set
export AMD_LOG_LEVEL=3
to get a lot of information about the GPU calls.
Run the debugger:
rocgdb ./Castro2d.hip.x86-trento.HIP.ex
Set the following inside of the debugger:
set pagination off
b abort
The run:
run inputs
If it doesn’t crash with the trace, then try:
interrupt
bt
It might say that the memory location is not precise, to enable precise memory, in the debugger, do:
set amdgpu precise-memory on
show amdgpu precise-memory
and rerun.
Troubleshooting
Workaround to prevent hangs for collectives:
export FI_MR_CACHE_MONITOR=memhooks
Some AMReX reports are that it hangs if the initial Arena size is too big, and we should do
amrex.the_arena_init_size=0
The arena size would then grow as needed with time.