Managing Jobs at OLCF
Frontier
Machine details
Queue policies are here: https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#scheduling-policy
Filesystem is called orion
, and is Lustre:
https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#data-and-storage
Warning
The Orion / Lustre filesystem has been broken since Jan 2025 making I/O performance very unstable. To work around this problem we currently advise having each MPI process write its own file. This is enabled automatically in the submission script below. Restarting is also an issue, with 50% of restarts hanging due to filesystem issues. The script below will kill the job after 5 minutes if it detects that the restart has failed.
Note
We also explicitly set the filesystem striping using the LFS tools to help I/O performance.
Submitting jobs
Frontier uses SLURM.
Here’s a script that runs on GPUs and has the I/O fixes described above.
#!/bin/bash
#SBATCH -A AST106
#SBATCH -J subch
#SBATCH -o %x-%j.out
#SBATCH -t 02:00:00
#SBATCH -p batch
# here N is the number of compute nodes
#SBATCH -N 64
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=7
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=closest
EXEC=./Castro3d.hip.x86-trento.MPI.HIP.SMPLSDC.ex
INPUTS=inputs_3d.N14.coarse
module load cpe
module load PrgEnv-gnu
module load cray-mpich
module load craype-accel-amd-gfx90a
module load rocm/6.3.1
export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
# libfabric workaround
export FI_MR_CACHE_MONITOR=memhooks
# set the file system striping
echo $SLURM_SUBMIT_DIR
module load lfs-wrapper
lfs setstripe -c 32 -S 10M $SLURM_SUBMIT_DIR
module list
function find_chk_file {
# find_chk_file takes a single argument -- the wildcard pattern
# for checkpoint files to look through
chk=$1
# find the latest 2 restart files. This way if the latest didn't
# complete we fall back to the previous one.
temp_files=$(find . -maxdepth 1 -name "${chk}" -print | sort | tail -2)
restartFile=""
for f in ${temp_files}
do
# the Header is the last thing written -- check if it's there, otherwise,
# fall back to the second-to-last check file written
if [ ! -f ${f}/Header ]; then
restartFile=""
else
restartFile="${f}"
fi
done
}
# look for 7-digit chk files
find_chk_file "*chk???????"
if [ "${restartFile}" = "" ]; then
# look for 6-digit chk files
find_chk_file "*chk??????"
fi
if [ "${restartFile}" = "" ]; then
# look for 5-digit chk files
find_chk_file "*chk?????"
fi
# restartString will be empty if no chk files are found -- i.e. new run
if [ "${restartFile}" = "" ]; then
restartString=""
else
restartString="amr.restart=${restartFile}"
fi
export OMP_NUM_THREADS=1
export NMPI_PER_NODE=8
export TOTAL_NMPI=$(( ${SLURM_JOB_NUM_NODES} * ${NMPI_PER_NODE} ))
function check_restart {
echo "RESTART CHECK!!!"
outfile="${SLURM_JOB_NAME}-${SLURM_JOB_ID}.out"
echo "RESTART CHECK: checking ${outfile}"
restart_success=$(grep "Restart time" ${outfile})
if [ $? == "1" ]; then
echo "RESTART CHECK: canceling job"
date
scancel $SLURM_JOB_ID
else
echo "RESTART CHECK: restart appears to be successful"
fi
}
# frontier's file system is troublesome, so modify the way
# we have AMReX does I/O
FILE_IO_PARAMS="
amr.plot_nfiles = -1
amr.checkpoint_nfiles = -1
"
echo appending parameters: ${FILE_IO_PARAMS}
(sleep 300; check_restart ) &
srun -n${TOTAL_NMPI} -N${SLURM_JOB_NUM_NODES} --ntasks-per-node=8 --gpus-per-task=1 ./$EXEC $INPUTS ${restartString} ${FILE_IO_PARAMS}
The job is submitted as:
sbatch frontier.slurm
where frontier.slurm
is the name of the submission script.
Also see the WarpX docs: https://warpx.readthedocs.io/en/latest/install/hpc/frontier.html
GPU-aware MPI
Some codes run better with GPU-aware MPI. To enable this add the following to your submission script:
export MPICH_GPU_SUPPORT_ENABLED=1
export FI_MR_CACHE_MONITOR=memhooks
and set the runtime parameter:
amrex.use_gpu_aware_mpi=1
Job Status
You can check on the status of your jobs via:
squeue --me
and get an estimated start time via:
squeue --me --start
Job Chaining
The script chainslurm.sh can be used to start a job chain, with each job depending on the previous. For example, to start up 10 jobs:
chainslurm -1 10 frontier.slurm
If you want to add the chain to an existing queued job, change the -1
to the job-id
of the existing job.
Debugging
Debugging is done with rocgdb
. Here’s a workflow that works:
Setup the environment:
module load PrgEnv-gnu
module load cray-mpich/8.1.27
module load craype-accel-amd-gfx90a
module load amd-mixed/5.6.0
Build the executable. Usually it’s best to disable MPI if possible
and maybe turn on TEST=TRUE
:
make USE_HIP=TRUE TEST=TRUE USE_MPI=FALSE -j 4
Startup an interactive session:
salloc -A ast106 -J mz -t 0:30:00 -p batch -N 1
This will automatically log you onto the compute now.
Note
It’s a good idea to do:
module restore
and then reload the same modules used for compiling in the interactive shell.
Now set the following environment variables:
export HIP_ENABLE_DEFERRED_LOADING=0
export AMD_SERIALIZE_KERNEL=3
export AMD_SERIALIZE_COPY=3
Note
You can also set
export AMD_LOG_LEVEL=3
to get a lot of information about the GPU calls.
Run the debugger:
rocgdb ./Castro2d.hip.x86-trento.HIP.ex
Set the following inside of the debugger:
set pagination off
b abort
The run:
run inputs
If it doesn’t crash with the trace, then try:
interrupt
bt
It might say that the memory location is not precise, to enable precise memory, in the debugger, do:
set amdgpu precise-memory on
show amdgpu precise-memory
and rerun.
Troubleshooting
Workaround to prevent hangs for collectives:
export FI_MR_CACHE_MONITOR=memhooks
Some AMReX reports are that it hangs if the initial Arena size is too big, and we should do
amrex.the_arena_init_size=0
The arena size would then grow as needed with time.