Managing Jobs at NERSC

Perlmutter

Perlmutter has 1536 GPU nodes, each with 4 NVIDIA A100 GPUs – therefore it is best to use 4 MPI tasks per node.

Note

you need to load the same modules used to compile the executable in your submission script, otherwise, it will fail at runtime because it can’t find the CUDA libraries.

Below is an example that runs on 16 nodes with 4 GPUs per node, and also includes the restart logic to allow for job chaining.

#!/bin/bash

#SBATCH -A m3018_g
#SBATCH -C gpu
#SBATCH -q regular
#SBATCH -J Castro
#SBATCH -o subch_%j.out
#SBATCH -t 2:00:00
#SBATCH -N 16
#SBATCH --ntasks-per-node=4
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=map_gpu:0,1,2,3
#SBATCH --signal=B:URG@120

export CASTRO_EXEC=./Castro2d.gnu.MPI.CUDA.SMPLSDC.ex
export INPUTS=inputs_2d.N14

module load PrgEnv-gnu
module load cudatoolkit

function find_chk_file {
    # find_chk_file takes a single argument -- the wildcard pattern
    # for checkpoint files to look through
    chk=$1

    # find the latest 2 restart files.  This way if the latest didn't
    # complete we fall back to the previous one.
    temp_files=$(find . -maxdepth 1 -name "${chk}" -print | sort | tail -2)
    restartFile=""
    for f in ${temp_files}
    do
        # the Header is the last thing written -- if it's there, update the restart file
        if [ -f ${f}/Header ]; then
            restartFile="${f}"
        fi
    done

}

# look for 7-digit chk files
find_chk_file "*chk???????"

if [ "${restartFile}" = "" ]; then
    # look for 6-digit chk files
    find_chk_file "*chk??????"
fi

if [ "${restartFile}" = "" ]; then
    # look for 5-digit chk files
    find_chk_file "*chk?????"
fi

# restartString will be empty if no chk files are found -- i.e. new run
if [ "${restartFile}" = "" ]; then
    restartString=""
else
    restartString="amr.restart=${restartFile}"
fi

# clean up any run management files left over from previous runs
rm -f dump_and_stop

# The `--signal=B:URG@<n>` option tells slurm to send SIGURG to this batch
# script n seconds before the runtime limit, so we can exit gracefully.
function sig_handler {
    touch dump_and_stop
    # disable this signal handler
    trap - URG
    echo "BATCH: allocation ending soon; telling Castro to dump a checkpoint and stop"
}
trap sig_handler URG

workdir=`basename ${SLURM_SUBMIT_DIR}`
slack_job_start.py "starting NERSC job: ${workdir} ${restartFile}" @michael


# execute srun in the background then use the builtin wait so the shell can
# handle the signal
srun -n $((SLURM_NTASKS_PER_NODE * SLURM_NNODES)) ${CASTRO_EXEC} ${INPUTS} ${restartString} &
pid=$!
wait $pid
ret=$?

if (( ret == 128 + 23 )); then
    # received SIGURG, keep waiting
    wait $pid
    ret=$?
fi

exit $ret

Note

With large reaction networks, you may get GPU out-of-memory errors during the first burner call. If this happens, you can add amrex.the_arena_init_size=0 after ${restartString} in the srun call so AMReX doesn’t reserve 3/4 of the GPU memory for the device arena.

Note

If the job times out before writing out a checkpoint (leaving a dump_and_stop file behind), you can give it more time between the warning signal and the end of the allocation by adjusting the #SBATCH --signal=B:URG@<n> line at the top of the script.

Below is an example that runs on CPU-only nodes. Here ntasks-per-node refers to number of MPI processes (used for distributed parallelism) per node, and cpus-per-task refers to number of hyper threads used per task (used for shared-memory parallelism). Since Perlmutter CPU node has 2 sockets * 64 cores/socket * 2 threads/core = 256 threads, set cpus-per-task to 256/(ntasks-per-node). However, it is actually best to assign each OpenMP thread per physical core, so it is best to set OMP_NUM_THREADS to cpus-per-task/2. See more detailed instructions within the script.

#!/bin/bash
#SBATCH --job-name=perlmutter_script
#SBATCH --account=m3018
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=16
#SBATCH --cpus-per-task=16
#SBATCH --qos=regular
#SBATCH --time=02:00:00
#SBATCH --constraint=cpu

#***********************INSTRUCTIONS*******************************************************
#In order to couple each MPI process with OpenMP threads we have designed the following
#strategy:
#
# 1. First we fix one node in the --qos debug queue and fix a 2**n number of
#    --ntasks-per-node, starting with n=0.
#
# 2. Next, we compute the number --cpus-per-task=256/-ntask-per-node. This is
#    the number of virtual cores available to each MPI process on each node.
#    Each physical core is composed by two virtual cores in a (64+64) where each
#    NUMA domain will contain 64 physical cores.
#
# 3. Based on the available number of virtual cores, we obtain the compute of physical
#    cores and bind each OpenMP thread to each available physical core, using:
#    export OMP_NUM_THREADS. Also, a lower number may also be selected (in case of
#    memory shortage); however in principle we want to squeeze all the available resources
#    first.
#
# 4. Run the script and check the wall-clock-timestep. In perlmutter I use
#    grep Coarse <slurm_output>
#
# 5. Repeat the steps 1-4 until the perfect MPI/OpenMP balance is reached for the
#    choice of n.
#
# 6. Compare different amr.max_grid_size until the optimal value is reached. Usually is located
#    near half the Level 0 half_size. Furthermore, test several amr.blocking_factor sizes.
#
# 7. Finally, increase the number of nodes from 1, 2, 4, 8 and compare the
#    wall-clock time change. If the problem scales correctly, the wall-clock time will
#    go down by a factor of ~ 2, as we increase the nodes. However such scaling will break after
#    one particular bigger node choice. This is the perfect number of nodes we have to select.
#
#8. Run a chain of jobs using this script and ./chainslurm.sh

export OMP_NUM_THREADS=8
export OMP_PLACES=cores
export OMP_PROC_BIND=spread

#export MPICH_MAX_THREAD_SAFETY=multiple
export CASTRO_EXEC=./Castro2d.gnu.x86-milan.MPI.OMP.ex
export INPUTS_FILE=./inputs_nova_t7

function find_chk_file {
    # find_chk_file takes a single argument -- the wildcard pattern
    # for checkpoint files to look through
    chk=$1

    # find the latest 2 restart files.  This way if the latest didn't
    # complete we fall back to the previous one.
    temp_files=$(find . -maxdepth 1 -name "${chk}" -print | sort | tail -2)
    restartFile=""
    for f in ${temp_files}
    do
        # the Header is the last thing written -- check if it's there, otherwise,
        # fall back to the second-to-last check file written
        if [ ! -f ${f}/Header ]; then
            restartFile=""
        else
            restartFile="${f}"
        fi
    done

}

# look for 7-digit chk files
find_chk_file "*chk???????"

if [ "${restartFile}" = "" ]; then
    # look for 6-digit chk files
    find_chk_file "*chk??????"
fi

if [ "${restartFile}" = "" ]; then
    # look for 5-digit chk files
    find_chk_file "*chk?????"
fi

# restartString will be empty if no chk files are found -- i.e. new run
if [ "${restartFile}" != "" ]; then
    restartString="amr.restart=${restartFile}"
else
    restartString=""
fi

srun -n $((SLURM_NTASKS_PER_NODE * SLURM_NNODES)) -c ${SLURM_CPUS_PER_TASK} --cpu-bind=cores  ${CASTRO_EXEC} ${INPUTS_FILE} ${restartString}

Note

Jobs should be run in your $SCRATCH directory. By default, SLURM will change directory into the submission directory.

Alternately, you can run in the common file system, $CFS/m3018, which everyone in the project has access to.

Jobs are submitted as:

sbatch script.slurm

You can check the status of your jobs via:

squeue --me

and an estimate of the start time can be found via:

squeue --me --start

to cancel a job, you would use scancel.

Filesystems

We can run on the common filesystem, CFS, to share files with everyone in the project. For instance, for project m3018, we would do:

cd $CFS/m3018

There is a 20 TB quota here, which can be checked via:

showquota m3018

Chaining

To chain jobs, such that one queues up after the previous job finished, use the chainslurm.sh script in that same directory:

chainslurm.sh jobid number script

where jobid is the existing job you want to start you chain from, number is the number of new jobs to chain from this starting job, and script is the job submission script to use (the same one you used originally most likely).

Note

The script can also create the initial job to start the chain. If jobid is set to -1, then the script will first submit a job with no dependencies and then chain the remaining number-1 jobs to depend on the previous.

You can view the job dependency using:

squeue -l -j job-id

where job-id is the number of the job. A job can be canceled using scancel, and the status can be checked using squeue -u username or squeue --me.