Managing Jobs at OLCF

Frontier

Machine details

Queue policies are here: https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#scheduling-policy

Filesystem is called orion, and is Lustre: https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#data-and-storage

Submitting jobs

Frontier uses SLURM.

Here’s a script that runs with 2 nodes using all 8 GPUs per node:

#!/bin/bash
#SBATCH -A AST106
#SBATCH -J testing
#SBATCH -o %x-%j.out
#SBATCH -t 00:05:00
#SBATCH -p batch
# here N is the number of compute nodes
#SBATCH -N 2
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=7
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=closest

EXEC=Castro3d.hip.x86-trento.MPI.HIP.ex
INPUTS=inputs.3d.sph

module load PrgEnv-gnu
module load craype-accel-amd-gfx90a
module load cray-mpich/8.1.27
module load amd-mixed/6.0.0

export OMP_NUM_THREADS=1
export NMPI_PER_NODE=8
export TOTAL_NMPI=$(( ${SLURM_JOB_NUM_NODES} * ${NMPI_PER_NODE} ))

srun -n${TOTAL_NMPI} -N${SLURM_JOB_NUM_NODES} --ntasks-per-node=8 --gpus-per-task=1 ./$EXEC $INPUTS

Note

As of June 2023, it is necessary to explicitly use -n and -N on the srun line.

The job is submitted as:

sbatch frontier.slurm

where frontier.slurm is the name of the submission script.

A sample job script that includes the automatic restart functions can be found here: https://github.com/AMReX-Astro/workflow/blob/main/job_scripts/frontier/frontier.slurm

Also see the WarpX docs: https://warpx.readthedocs.io/en/latest/install/hpc/frontier.html

GPU-aware MPI

Some codes run better with GPU-aware MPI. To enable this add the following to your submission script:

export MPICH_GPU_SUPPORT_ENABLED=1
export FI_MR_CACHE_MONITOR=memhooks

and set the runtime parameter:

amrex.use_gpu_aware_mpi=1

Job Status

You can check on the status of your jobs via:

squeue --me

and get an estimated start time via:

squeue --me --start

Job Chaining

The script chainslurm.sh can be used to start a job chain, with each job depending on the previous. For example, to start up 10 jobs:

chainslurm -1 10 frontier.slurm

If you want to add the chain to an existing queued job, change the -1 to the job-id of the existing job.

Debugging

Debugging is done with rocgdb. Here’s a workflow that works:

Setup the environment:

module load PrgEnv-gnu
module load cray-mpich/8.1.27
module load craype-accel-amd-gfx90a
module load amd-mixed/5.6.0

Build the executable. Usually it’s best to disable MPI if possible and maybe turn on TEST=TRUE:

make USE_HIP=TRUE TEST=TRUE USE_MPI=FALSE -j 4

Startup an interactive session:

salloc -A ast106 -J mz -t 0:30:00 -p batch -N 1

This will automatically log you onto the compute now.

Note

It’s a good idea to do:

module restore

and then reload the same modules used for compiling in the interactive shell.

Now set the following environment variables:

export HIP_ENABLE_DEFERRED_LOADING=0
export AMD_SERIALIZE_KERNEL=3
export AMD_SERIALIZE_COPY=3

Note

You can also set

export AMD_LOG_LEVEL=3

to get a lot of information about the GPU calls.

Run the debugger:

rocgdb ./Castro2d.hip.x86-trento.HIP.ex

Set the following inside of the debugger:

set pagination off
b abort

The run:

run inputs

If it doesn’t crash with the trace, then try:

interrupt
bt

It might say that the memory location is not precise, to enable precise memory, in the debugger, do:

set amdgpu precise-memory on
show amdgpu precise-memory

and rerun.

Troubleshooting

Workaround to prevent hangs for collectives:

export FI_MR_CACHE_MONITOR=memhooks

Some AMReX reports are that it hangs if the initial Arena size is too big, and we should do

amrex.the_arena_init_size=0

The arena size would then grow as needed with time.