Managing Jobs at OLCF

Frontier

Machine details

Queue policies are here: https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#scheduling-policy

Filesystem is called orion, and is Lustre: https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#data-and-storage

Warning

The Orion / Lustre filesystem has been broken since Jan 2025 making I/O performance very unstable. To work around this problem we currently advise having each MPI process write its own file. This is enabled automatically in the submission script below. Restarting is also an issue, with 50% of restarts hanging due to filesystem issues. The script below will kill the job after 5 minutes if it detects that the restart has failed.

Note

We also explicitly set the filesystem striping using the LFS tools to help I/O performance.

Submitting jobs

Frontier uses SLURM.

Here’s a script that runs on GPUs and has the I/O fixes described above.

#!/bin/bash
#SBATCH -A AST106
#SBATCH -J subch
#SBATCH -o %x-%j.out
#SBATCH -t 02:00:00
#SBATCH -p batch
# here N is the number of compute nodes
#SBATCH -N 64
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=7
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=closest

EXEC=./Castro3d.hip.x86-trento.MPI.HIP.SMPLSDC.ex
INPUTS=inputs_3d.N14.coarse

module load cpe
module load PrgEnv-gnu
module load cray-mpich
module load craype-accel-amd-gfx90a
module load rocm/6.3.1

export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH

# libfabric workaround
export FI_MR_CACHE_MONITOR=memhooks

# set the file system striping

echo $SLURM_SUBMIT_DIR

module load lfs-wrapper
lfs setstripe -c 32 -S 10M $SLURM_SUBMIT_DIR

module list

function find_chk_file {
    # find_chk_file takes a single argument -- the wildcard pattern
    # for checkpoint files to look through
    chk=$1

    # find the latest 2 restart files.  This way if the latest didn't
    # complete we fall back to the previous one.
    temp_files=$(find . -maxdepth 1 -name "${chk}" -print | sort | tail -2)
    restartFile=""
    for f in ${temp_files}
    do
        # the Header is the last thing written -- check if it's there, otherwise,
        # fall back to the second-to-last check file written
        if [ ! -f ${f}/Header ]; then
            restartFile=""
        else
            restartFile="${f}"
        fi
    done

}

# look for 7-digit chk files
find_chk_file "*chk???????"

if [ "${restartFile}" = "" ]; then
    # look for 6-digit chk files
    find_chk_file "*chk??????"
fi

if [ "${restartFile}" = "" ]; then
    # look for 5-digit chk files
    find_chk_file "*chk?????"
fi

# restartString will be empty if no chk files are found -- i.e. new run
if [ "${restartFile}" = "" ]; then
    restartString=""
else
    restartString="amr.restart=${restartFile}"
fi

export OMP_NUM_THREADS=1
export NMPI_PER_NODE=8
export TOTAL_NMPI=$(( ${SLURM_JOB_NUM_NODES} * ${NMPI_PER_NODE} ))

function check_restart {
    echo "RESTART CHECK!!!"
    outfile="${SLURM_JOB_NAME}-${SLURM_JOB_ID}.out"
    echo "RESTART CHECK: checking ${outfile}"
    restart_success=$(grep "Restart time" ${outfile})
    if [ $? == "1" ]; then
       echo "RESTART CHECK: canceling job"
       date
       scancel $SLURM_JOB_ID
    else
       echo "RESTART CHECK: restart appears to be successful"
    fi
}


# frontier's file system is troublesome, so modify the way
# we have AMReX does I/O

FILE_IO_PARAMS="
amr.plot_nfiles = -1
amr.checkpoint_nfiles = -1
"

echo appending parameters: ${FILE_IO_PARAMS}

(sleep 300; check_restart ) &

srun -n${TOTAL_NMPI} -N${SLURM_JOB_NUM_NODES} --ntasks-per-node=8 --gpus-per-task=1 ./$EXEC $INPUTS ${restartString} ${FILE_IO_PARAMS}

The job is submitted as:

sbatch frontier.slurm

where frontier.slurm is the name of the submission script.

Also see the WarpX docs: https://warpx.readthedocs.io/en/latest/install/hpc/frontier.html

GPU-aware MPI

Some codes run better with GPU-aware MPI. To enable this add the following to your submission script:

export MPICH_GPU_SUPPORT_ENABLED=1
export FI_MR_CACHE_MONITOR=memhooks

and set the runtime parameter:

amrex.use_gpu_aware_mpi=1

Job Status

You can check on the status of your jobs via:

squeue --me

and get an estimated start time via:

squeue --me --start

Job Chaining

The script chainslurm.sh can be used to start a job chain, with each job depending on the previous. For example, to start up 10 jobs:

chainslurm -1 10 frontier.slurm

If you want to add the chain to an existing queued job, change the -1 to the job-id of the existing job.

Debugging

Debugging is done with rocgdb. Here’s a workflow that works:

Setup the environment:

module load PrgEnv-gnu
module load cray-mpich/8.1.27
module load craype-accel-amd-gfx90a
module load amd-mixed/5.6.0

Build the executable. Usually it’s best to disable MPI if possible and maybe turn on TEST=TRUE:

make USE_HIP=TRUE TEST=TRUE USE_MPI=FALSE -j 4

Startup an interactive session:

salloc -A ast106 -J mz -t 0:30:00 -p batch -N 1

This will automatically log you onto the compute now.

Note

It’s a good idea to do:

module restore

and then reload the same modules used for compiling in the interactive shell.

Now set the following environment variables:

export HIP_ENABLE_DEFERRED_LOADING=0
export AMD_SERIALIZE_KERNEL=3
export AMD_SERIALIZE_COPY=3

Note

You can also set

export AMD_LOG_LEVEL=3

to get a lot of information about the GPU calls.

Run the debugger:

rocgdb ./Castro2d.hip.x86-trento.HIP.ex

Set the following inside of the debugger:

set pagination off
b abort

The run:

run inputs

If it doesn’t crash with the trace, then try:

interrupt
bt

It might say that the memory location is not precise, to enable precise memory, in the debugger, do:

set amdgpu precise-memory on
show amdgpu precise-memory

and rerun.

Troubleshooting

Workaround to prevent hangs for collectives:

export FI_MR_CACHE_MONITOR=memhooks

Some AMReX reports are that it hangs if the initial Arena size is too big, and we should do

amrex.the_arena_init_size=0

The arena size would then grow as needed with time.