Managing Jobs at NERSC
Perlmutter
Perlmutter has 1536 GPU nodes, each with 4 NVIDIA A100 GPUs – therefore it is best to use 4 MPI tasks per node.
Note
you need to load the same modules used to compile the executable in your submission script, otherwise, it will fail at runtime because it can’t find the CUDA libraries.
Below is an example that runs on 16 nodes with 4 GPUs per node, and also includes the restart logic to allow for job chaining.
#!/bin/bash
#SBATCH -A m3018_g
#SBATCH -C gpu
#SBATCH -q regular
#SBATCH -J Castro
#SBATCH -o subch_%j.out
#SBATCH -t 2:00:00
#SBATCH -N 16
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=map_gpu:0,1,2,3
#SBATCH --signal=B:URG@120
export CASTRO_EXEC=./Castro2d.gnu.MPI.CUDA.SMPLSDC.ex
export INPUTS=inputs_2d.N14
module load PrgEnv-gnu
module load cudatoolkit
function find_chk_file {
# find_chk_file takes a single argument -- the wildcard pattern
# for checkpoint files to look through
chk=$1
# find the latest 2 restart files. This way if the latest didn't
# complete we fall back to the previous one.
temp_files=$(find . -maxdepth 1 -name "${chk}" -print | sort | tail -2)
restartFile=""
for f in ${temp_files}
do
# the Header is the last thing written -- if it's there, update the restart file
if [ -f ${f}/Header ]; then
restartFile="${f}"
fi
done
}
# look for 7-digit chk files
find_chk_file "*chk???????"
if [ "${restartFile}" = "" ]; then
# look for 6-digit chk files
find_chk_file "*chk??????"
fi
if [ "${restartFile}" = "" ]; then
# look for 5-digit chk files
find_chk_file "*chk?????"
fi
# restartString will be empty if no chk files are found -- i.e. new run
if [ "${restartFile}" = "" ]; then
restartString=""
else
restartString="amr.restart=${restartFile}"
fi
# clean up any run management files left over from previous runs
rm -f dump_and_stop
# The `--signal=B:URG@<n>` option tells slurm to send SIGURG to this batch
# script n seconds before the runtime limit, so we can exit gracefully.
function sig_handler {
touch dump_and_stop
# disable this signal handler
trap - URG
echo "BATCH: allocation ending soon; telling Castro to dump a checkpoint and stop"
}
trap sig_handler URG
workdir=`basename ${SLURM_SUBMIT_DIR}`
slack_job_start.py "starting NERSC job: ${workdir} ${restartFile}" @michael
# execute srun in the background then use the builtin wait so the shell can
# handle the signal
srun -n $((SLURM_NTASKS_PER_NODE * SLURM_NNODES)) ${CASTRO_EXEC} ${INPUTS} ${restartString} &
pid=$!
wait $pid
ret=$?
if (( ret == 128 + 23 )); then
# received SIGURG, keep waiting
wait $pid
ret=$?
fi
exit $ret
Note
With large reaction networks, you may get GPU out-of-memory errors during
the first burner call. If this happens, you can add
amrex.the_arena_init_size=0
after ${restartString}
in the srun call
so AMReX doesn’t reserve 3/4 of the GPU memory for the device arena.
Note
If the job times out before writing out a checkpoint (leaving a
dump_and_stop
file behind), you can give it more time between the
warning signal and the end of the allocation by adjusting the
#SBATCH --signal=B:URG@<n>
line at the top of the script.
Below is an example that runs on CPU-only nodes. Here ntasks-per-node
refers to number of MPI processes (used for distributed parallelism) per node,
and cpus-per-task
refers to number of hyper threads used per task
(used for shared-memory parallelism). Since Perlmutter CPU node has
2 sockets * 64 cores/socket * 2 threads/core = 256 threads, set cpus-per-task
to 256/(ntasks-per-node)
. However, it is actually best to assign
each OpenMP thread per physical core, so it is best to set OMP_NUM_THREADS
to
cpus-per-task/2
. See more detailed instructions within the script.
#!/bin/bash
#SBATCH --job-name=perlmutter_script
#SBATCH --account=m3018
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=16
#SBATCH --cpus-per-task=16
#SBATCH --qos=regular
#SBATCH --time=02:00:00
#SBATCH --constraint=cpu
#***********************INSTRUCTIONS*******************************************************
#In order to couple each MPI process with OpenMP threads we have designed the following
#strategy:
#
# 1. First we fix one node in the --qos debug queue and fix a 2**n number of
# --ntasks-per-node, starting with n=0.
#
# 2. Next, we compute the number --cpus-per-task=256/-ntask-per-node. This is
# the number of virtual cores available to each MPI process on each node.
# Each physical core is composed by two virtual cores in a (64+64) where each
# NUMA domain will contain 64 physical cores.
#
# 3. Based on the available number of virtual cores, we obtain the compute of physical
# cores and bind each OpenMP thread to each available physical core, using:
# export OMP_NUM_THREADS. Also, a lower number may also be selected (in case of
# memory shortage); however in principle we want to squeeze all the available resources
# first.
#
# 4. Run the script and check the wall-clock-timestep. In perlmutter I use
# grep Coarse <slurm_output>
#
# 5. Repeat the steps 1-4 until the perfect MPI/OpenMP balance is reached for the
# choice of n.
#
# 6. Compare different amr.max_grid_size until the optimal value is reached. Usually is located
# near half the Level 0 half_size. Furthermore, test several amr.blocking_factor sizes.
#
# 7. Finally, increase the number of nodes from 1, 2, 4, 8 and compare the
# wall-clock time change. If the problem scales correctly, the wall-clock time will
# go down by a factor of ~ 2, as we increase the nodes. However such scaling will break after
# one particular bigger node choice. This is the perfect number of nodes we have to select.
#
#8. Run a chain of jobs using this script and ./chainslurm.sh
export OMP_NUM_THREADS=8
export OMP_PLACES=cores
export OMP_PROC_BIND=spread
#export MPICH_MAX_THREAD_SAFETY=multiple
export CASTRO_EXEC=./Castro2d.gnu.x86-milan.MPI.OMP.ex
export INPUTS_FILE=./inputs_nova_t7
function find_chk_file {
# find_chk_file takes a single argument -- the wildcard pattern
# for checkpoint files to look through
chk=$1
# find the latest 2 restart files. This way if the latest didn't
# complete we fall back to the previous one.
temp_files=$(find . -maxdepth 1 -name "${chk}" -print | sort | tail -2)
restartFile=""
for f in ${temp_files}
do
# the Header is the last thing written -- check if it's there, otherwise,
# fall back to the second-to-last check file written
if [ ! -f ${f}/Header ]; then
restartFile=""
else
restartFile="${f}"
fi
done
}
# look for 7-digit chk files
find_chk_file "*chk???????"
if [ "${restartFile}" = "" ]; then
# look for 6-digit chk files
find_chk_file "*chk??????"
fi
if [ "${restartFile}" = "" ]; then
# look for 5-digit chk files
find_chk_file "*chk?????"
fi
# restartString will be empty if no chk files are found -- i.e. new run
if [ "${restartFile}" != "" ]; then
restartString="amr.restart=${restartFile}"
else
restartString=""
fi
srun -n $((SLURM_NTASKS_PER_NODE * SLURM_NNODES)) -c ${SLURM_CPUS_PER_TASK} --cpu-bind=cores ${CASTRO_EXEC} ${INPUTS_FILE} ${restartString}
Note
Jobs should be run in your $SCRATCH
directory. By default,
SLURM will change directory into the submission directory.
Alternately, you can run in the common file system, $CFS/m3018
,
which everyone in the project has access to.
Jobs are submitted as:
sbatch script.slurm
You can check the status of your jobs via:
squeue --me
and an estimate of the start time can be found via:
squeue --me --start
to cancel a job, you would use scancel
.
Filesystems
We can run on the common filesystem, CFS, to share files with everyone
in the project. For instance, for project m3018
, we would do:
cd $CFS/m3018
There is a 20 TB quota here, which can be checked via:
showquota m3018
Chaining
To chain jobs, such that one queues up after the previous job finished, use the chainslurm.sh script in that same directory:
chainslurm.sh jobid number script
where jobid
is the existing job you want to start you chain from,
number
is the number of new jobs to chain from this starting job,
and script
is the job submission script to use (the same one you
used originally most likely).
Note
The script can also create the initial job to start the chain.
If jobid
is set to -1
, then the script will first submit a
job with no dependencies and then chain the remaining number
-1
jobs to depend on the previous.
You can view the job dependency using:
squeue -l -j job-id
where job-id
is the number of the job. A job can be canceled
using scancel
, and the status can be checked using squeue -u
username
or squeue --me
.