Managing Jobs at OLCF
Frontier
Machine details
Queue policies are here: https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#scheduling-policy
Filesystem is called orion
, and is Lustre:
https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#data-and-storage
Submitting jobs
Frontier uses SLURM.
Here’s a script that runs with 2 nodes using all 8 GPUs per node:
#!/bin/bash
#SBATCH -A AST106
#SBATCH -J testing
#SBATCH -o %x-%j.out
#SBATCH -t 00:05:00
#SBATCH -p batch
# here N is the number of compute nodes
#SBATCH -N 2
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=7
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=closest
EXEC=Castro3d.hip.x86-trento.MPI.HIP.ex
INPUTS=inputs.3d.sph
module load PrgEnv-gnu
module load craype-accel-amd-gfx90a
module load cray-mpich/8.1.27
module load amd-mixed/6.0.0
export OMP_NUM_THREADS=1
export NMPI_PER_NODE=8
export TOTAL_NMPI=$(( ${SLURM_JOB_NUM_NODES} * ${NMPI_PER_NODE} ))
srun -n${TOTAL_NMPI} -N${SLURM_JOB_NUM_NODES} --ntasks-per-node=8 --gpus-per-task=1 ./$EXEC $INPUTS
Note
As of June 2023, it is necessary to explicitly use -n
and -N
on the srun
line.
The job is submitted as:
sbatch frontier.slurm
where frontier.slurm
is the name of the submission script.
A sample job script that includes the automatic restart functions can be found here: https://github.com/AMReX-Astro/workflow/blob/main/job_scripts/frontier/frontier.slurm
Also see the WarpX docs: https://warpx.readthedocs.io/en/latest/install/hpc/frontier.html
GPU-aware MPI
Some codes run better with GPU-aware MPI. To enable this add the following to your submission script:
export MPICH_GPU_SUPPORT_ENABLED=1
export FI_MR_CACHE_MONITOR=memhooks
and set the runtime parameter:
amrex.use_gpu_aware_mpi=1
Job Status
You can check on the status of your jobs via:
squeue --me
and get an estimated start time via:
squeue --me --start
Job Chaining
The script chainslurm.sh can be used to start a job chain, with each job depending on the previous. For example, to start up 10 jobs:
chainslurm -1 10 frontier.slurm
If you want to add the chain to an existing queued job, change the -1
to the job-id
of the existing job.
Debugging
Debugging is done with rocgdb
. Here’s a workflow that works:
Setup the environment:
module load PrgEnv-gnu
module load cray-mpich/8.1.27
module load craype-accel-amd-gfx90a
module load amd-mixed/5.6.0
Build the executable. Usually it’s best to disable MPI if possible
and maybe turn on TEST=TRUE
:
make USE_HIP=TRUE TEST=TRUE USE_MPI=FALSE -j 4
Startup an interactive session:
salloc -A ast106 -J mz -t 0:30:00 -p batch -N 1
This will automatically log you onto the compute now.
Note
It’s a good idea to do:
module restore
and then reload the same modules used for compiling in the interactive shell.
Now set the following environment variables:
export HIP_ENABLE_DEFERRED_LOADING=0
export AMD_SERIALIZE_KERNEL=3
export AMD_SERIALIZE_COPY=3
Note
You can also set
export AMD_LOG_LEVEL=3
to get a lot of information about the GPU calls.
Run the debugger:
rocgdb ./Castro2d.hip.x86-trento.HIP.ex
Set the following inside of the debugger:
set pagination off
b abort
The run:
run inputs
If it doesn’t crash with the trace, then try:
interrupt
bt
It might say that the memory location is not precise, to enable precise memory, in the debugger, do:
set amdgpu precise-memory on
show amdgpu precise-memory
and rerun.
Troubleshooting
Workaround to prevent hangs for collectives:
export FI_MR_CACHE_MONITOR=memhooks
Some AMReX reports are that it hangs if the initial Arena size is too big, and we should do
amrex.the_arena_init_size=0
The arena size would then grow as needed with time.