Managing Jobs at OLCF

Summit

Summit Architecture:

Let us start by reviewing the node architecture of Summit. Our goal is to provide the necessary insight to make better decisions in the construction of our particular AMReX-Astro job scripts, and to explain how our code interacts with Summit. All the exposed information in this section is a condensed version of the Summit documentation guide, and should not replace it.

In Summit, a node is composed by two sockets: each one with 21 CPU physical cores (+1 reserved for the system), 3 GPUs and 1 RAM memory bank. The sockets are connected by a bus allowing communication among them. Each CPU physical core may define up to 4 threads. The whole structure of the node can be depicted as follows:

_images/summit-node-description-1.png

Figure extracted from https://docs.olcf.ornl.gov/systems/summit_user_guide.html#job-launcher-jsrun.

A resource set is a minimal collection of CPU physical cores and GPUs, on which a certain number of MPI processes and OpenMP threads operates through the execution of the code. Therefore, for each resource set, we need to allocate:

  • A number of CPU physical cores.

  • A number of physical GPUs.

  • A number of MPI processes.

  • The number of OpenMP threads.

where each core supports up to 4 threads; however, this option is not supported in AMReX and we will not extend our discussion here. For now, we fix just only one thread through the whole execution of our code. The next step is to determine the maximum number of resource sets that may fit into one node.

In Castro we construct each resource set with: 1 CPU physical core, 1 GPU, and only 1 MPI process. The next step is to see how many resources sets fits into one node. According to the node architecture depicted in Figure 1, we can fit up to 6 resource sets per node as in Figure 2.

_images/image56.png

Figure modified and extracted from https://docs.olcf.ornl.gov/systems/summit_user_guide.html#job-launcher-jsrun.

Requesting Allocation:

To allocate the resource sets we need to summon the command bsub in addition to some flags:

Flag

Description

-nnodes

allocates the number of nodes we need to run our code. Is important to perform the calculation described in the previous section to select the correct number of nodes in our setup.

-W

allocates the walltime of the selected nodes. The format we use in Summit is [hours:]minutes, there is no room for seconds in Summit. The maximum walltime that we can request is 03:00 (three hours).

-alloc_flags

allocates the maximum number of threads available per CPU core. By default the option is smt4 that allows 4 threads per core. However, since we will consider only one thread through the whole execution we will setup the option smt1. Also -alloc_flags stands for more options, however we are only interested in the one discussed before.

-J

defines the name of the allocation. The value %J correspond to the allocation ID number.

-o

defines the output name that contains the standard output stream, after running all the jobs inside the requested allocation.

-e

defines the output file name containing the standard error stream, similar to the -o flag. If -e is not supplied, then the -o option is assumed by default.

-q

defines the queue on which our application will run. There are several options, however, we alternate between two options: the standard production queue batch and the debugging queue debug. The debug queue is designed to allocate an small number of nodes in order to see that our code is running smoothly without bugs.

-Is

flags for interactive job followed by the shell name. The Unix bash shell option is /bin/bash. This flag is very useful for debugging, because the standard output can be checked as the code is running. Is important to mention that any interactive job can only be summoned by command line and not by running a bash script.

For example, if we want to allocate one node to run an interactive job in the debug queue for 30 minutes we may setup:

bsub -nnodes 1 -q debug -W 0:30 -P ast106 -alloc_flags smt1 -J example -o stdout_to_show.%J -e stderr_to_show.%J -Is /bin/bash

Note

An interactive job can only be allocated by the use of the command line. No script can be defined for interactive jobs.

Submitting a Job:

Once our allocation is granted, is important to load the same modules used in the compilation process of the executable and export the variable OMP_NUM_THREADS to setup the number of threads per MPI process.

In Castro, we have used the following modules:

module load gcc/10.2.0
module load cuda/11.5.2
module load python

and fixed only one thread per MPI process by:

export OMP_NUM_THREADS=1

The next step is to submit our job. The command jsrun, provided with the total number of resource sets, the number of CPU physical cores per resource set, the number of GPUs per resource set, the number of MPI processes allocated per resource set, works as follows:

jsrun -n[number of resource sets] -c[number of CPU physical cores] -g[number of GPUs] -a[number of MPI processes] -r[number of max resources per node] ./[executable] [executable inputs]

In Castro we will use:

jsrun -n [number of resource sets] -a1 -c1 -g1 -r6 ./$CASTRO $INPUTS

where the CASTRO and INPUTS environment variables are placeholders to the executable and input file names respectively.

Now, in order to use all the resources we have allocated to run our jobs, the number of resource sets should match the number of AMReX boxes (grids) of the corresponding level with the biggest number of them. Let us consider an extract piece from a Castro problem standard output:

INITIAL GRIDS
Level 0   2 grids  32768 cells  100 % of domain
          smallest grid: 128 x 128  biggest grid: 128 x 128
Level 1   8 grids  131072 cells  100 % of domain
          smallest grid: 128 x 128  biggest grid: 128 x 128
Level 2   8 grids  524288 cells  100 % of domain
          smallest grid: 256 x 256  biggest grid: 256 x 256
Level 3   32 grids  2097152 cells  100 % of domain
          smallest grid: 256 x 256  biggest grid: 256 x 256
Level 4   128 grids  7864320 cells  93.75 % of domain
          smallest grid: 256 x 128  biggest grid: 256 x 256
Level 5   480 grids  30408704 cells  90.625 % of domain
          smallest grid: 256 x 128  biggest grid: 256 x 256

In this example, Level 5 contains the biggest number of AMReX boxes: 480. From here, we may assert that a good allocation for this problem are 480 resource sets, equivalent to 80 nodes by setting 6 resources per node. However, note that that Level 0 uses only 2 AMReX boxes, this implies that from the 480 resources available, 398 resources will remain idle until the two working processes sweep the entire Level 0.

Note

Therefore, is important, if possible, to keep the number of boxes on each level balanced to maximize the use of the allocated resources.

Writting a Job Script:

In order to make our life easier, instead of submitting an allocation command line, loading the modules, setting the threads/MPI process, and writing another command line to submit our jobs, we can make an script to pack all these command into one executable .sh file, that can be submitted via bsub just once.

We start our job script, summoning the shell with the statement !/bin/bash. Then we add the bsub allocations flags, starting with #BSUB as follows:

#!/bin/bash
#BSUB -P ast106
#BSUB -W 2:00
#BSUB -nnodes 80
#BSUB -alloc_flags smt1
#BSUB -J luna_script
#BSUB -o luna_output.%J
#BSUB -e luna_sniffing_output.%J

In addition we add the modules statements, fixing only one thread per MPI process:

module load gcc/10.2.0
module load cuda/11.5.2
module load python

export OMP_NUM_THREADS=1

and define the environment variables:

CASTRO=./Castro2d.gnu.MPI.CUDA.ex
INPUTS=inputs_luna

n_res=480                # The max allocated number of resource sets is
n_cpu_cores_per_res=1    # nnodes * n_max_res_per_node. In this case we will
n_mpi_per_res=1          # use all the allocated resource sets to run the job
n_gpu_per_res=1          # below.
n_max_res_per_node=6

Once the allocation ends, the job is downgraded/killed, leaving us as we started. As we pointed out, the maximum allocation time in Summit is 03:00 (three hours), but, we may need sometimes weeks, months, or maybe years to complete our runs. Now is when the automatic restarting section of the script comes to our salvation.

From here we can add an optional (or mandatory) setting to our script. As the code executes, after a certain number of timesteps, the code creates checkpoint files of the form chkxxxxxxx, chkxxxxxx or chkxxxxx. This checkpoint files can be read by our executable and run from the simulation time where the checkpoint was created. This is implemented as follows:

function find_chk_file {
   # find_chk_file takes a single argument -- the wildcard pattern
   # for checkpoint files to look through
   chk=$1

   # find the latest 2 restart files.  This way if the latest didn't
   # complete we fall back to the previous one.
   temp_files=$(find . -maxdepth 1 -name "${chk}" -print | sort | tail -2)
   restartFile=""
   for f in ${temp_files}
   do
      # the Header is the last thing written -- if it's there, update the restart file
      if [ -f ${f}/Header ]; then
         restartFile="${f}"
      fi
   done
}

# look for 7-digit chk files
find_chk_file "*chk???????"

if [ "${restartFile}" = "" ]; then
   # look for 6-digit chk files
   find_chk_file "*chk??????"
fi

if [ "${restartFile}" = "" ]; then
   # look for 5-digit chk files
   find_chk_file "*chk?????"
fi

# restartString will be empty if no chk files are found -- i.e. new run
if [ "${restartFile}" = "" ]; then
   restartString=""
else
   restartString="amr.restart=${restartFile}"
fi

The function find_chk_file searches the submission directory for checkpoint files. Because AMReX appends digits as the number of steps increase (with a minimum of 5 digits), we search for files with 7-digits, 6-digits, and then finally 5-digits, to ensure we pick up the latest file.

We can also ask the job manager to send a warning signal some amount of time before the allocation expires by passing -wa 'signal' and -wt '[hour:]minute' to bsub. We can then have bash create a dump_and_stop file when it receives the signal, which will tell Castro to output a checkpoint file and exit cleanly after it finishes the current timestep. An important detail that I couldn’t find documented anywhere is that the job manager sends the signal to all the processes in the job, not just the submission script, and we have to use a signal that is ignored by default so Castro doesn’t immediately crash upon receiving it. SIGCHLD, SIGURG, and SIGWINCH are the only signals that fit this requirement and of these, SIGURG is the least likely to be triggered by other events.

#BSUB -wa URG
#BSUB -wt 2

...

function sig_handler {
   touch dump_and_stop
   # disable this signal handler
   trap - URG
   echo "BATCH: allocation ending soon; telling Castro to dump a checkpoint and stop"
}
trap sig_handler URG

We use the jsrun command to launch Castro on the compute nodes. In order for bash to handle the warning signal before Castro exits, we must put jsrun in the background and use the shell builtin wait:

jsrun -n$n_res -c$n_cpu_cores_per_res -a$n_mpi_per_res -g$n_gpu_per_res -r$n_max_res_per_node $CASTRO $INPUTS ${restartString} &
wait
# use jswait to wait for Castro (job step 1/1) to finish and get the exit code
jswait 1

Finally, once the script is completed and saved as luna_script.sh, we can submit it by:

bsub luna_script.sh

Monitoring a Job:

You can monitor the status of your jobs using bjobs. Also, a slightly nicer view of your jobs can be viewed using jobstat as:

jobstat -u username

Script Template:

Packing all the information before, lead us to the following script template

 1#!/bin/bash
 2#BSUB -P ast106
 3#BSUB -W 2:00
 4#BSUB -nnodes 80
 5#BSUB -alloc_flags smt1
 6#BSUB -J luna_script
 7#BSUB -o luna_output.%J
 8#BSUB -e luna_sniffing_output.%J
 9#BSUB -wa URG
10#BSUB -wt 2
11
12module load gcc/10.2.0
13module load cuda/11.5.2
14module load python
15
16export OMP_NUM_THREADS=1
17
18CASTRO=./Castro2d.gnu.MPI.CUDA.ex
19INPUTS=inputs_luna
20
21n_res=480                # The max allocated number of resource sets is
22n_cpu_cores_per_res=1    # nnodes * n_max_res_per_node. In this case we will
23n_mpi_per_res=1          # use all the allocated resource sets to run the job below,
24n_gpu_per_res=1          # however we can define more enviroment variables to allocate two jobs
25n_max_res_per_node=6     # simultaneous jobs, where n_res = n_res_1 + n_res2 allocates for two jobs.
26
27function find_chk_file {
28    # find_chk_file takes a single argument -- the wildcard pattern
29    # for checkpoint files to look through
30    chk=$1
31
32    # find the latest 2 restart files.  This way if the latest didn't
33    # complete we fall back to the previous one.
34    temp_files=$(find . -maxdepth 1 -name "${chk}" -print | sort | tail -2)
35    restartFile=""
36    for f in ${temp_files}
37    do
38        # the Header is the last thing written -- if it's there, update the restart file
39        if [ -f ${f}/Header ]; then
40            # The scratch FS sometimes gives I/O errors when trying to read
41            # from recently-created files, which crashes Castro. Avoid this by
42            # making sure we can read from all the data files.
43            if head --quiet -c1 "${f}/Header" "${f}"/Level_*/* >/dev/null; then
44                restartFile="${f}"
45            fi
46        fi
47    done
48}
49
50# look for 7-digit chk files
51find_chk_file "*chk???????"
52
53if [ "${restartFile}" = "" ]; then
54    # look for 6-digit chk files
55    find_chk_file "*chk??????"
56fi
57
58if [ "${restartFile}" = "" ]; then
59    # look for 5-digit chk files
60    find_chk_file "*chk?????"
61fi
62
63# restartString will be empty if no chk files are found -- i.e. new run
64if [ "${restartFile}" = "" ]; then
65    restartString=""
66else
67    restartString="amr.restart=${restartFile}"
68fi
69
70# clean up any run management files left over from previous runs
71rm -f dump_and_stop
72
73warning_time=$(bjobs -noheader -o action_warning_time "$LSB_JOBID")
74# The `-wa URG -wt <n>` options tell bsub to send SIGURG to all processes n
75# minutes before the runtime limit, so we can exit gracefully.
76# SIGURG is ignored by default, so it won't make Castro crash.
77function sig_handler {
78    touch dump_and_stop
79    # disable this signal handler
80    trap - URG
81    echo "BATCH: $warning_time left in allocation; telling Castro to dump a checkpoint and stop"
82}
83trap sig_handler URG
84
85# execute jsrun in the background then use the builtin wait so the shell can
86# handle the signal
87jsrun -n$n_res -c$n_cpu_cores_per_res -a$n_mpi_per_res -g$n_gpu_per_res -r$n_max_res_per_node $CASTRO $INPUTS ${restartString} &
88wait
89# use jswait to wait for Castro (job step 1/1) to finish and get the exit code
90jswait 1

Chaining jobs

The script job_scripts/summit/chain_submit.sh can be used to setup job dependencies, i.e., a job chain.

First you submit a job as usual using bsub, and make note of the job-id that it prints upon submission (the same id you would see with bjobs or jobstat). Then you setup N jobs to depend on the one you just submitted as:

chain_submit.sh job-id N submit_script.sh

where you replace job-id with the id return from your first submission, replace N with the number of additional jobs, and replace submit_script with the name of the script you use to submit the job. This will queue up N additional jobs, each depending on the previous. Your submission script should use the automatic restarting features discussed above.

Archiving to HPSS

You can access HPSS from submit using the data transfer nodes by submitting a job via SLURM:

sbatch -N 1 -t 15:00 -A ast106 --cluster dtn test_hpss.sh

where test_hpss.sh is a SLURM script that contains the htar commands needed to archive your data. This uses slurm as the job manager.

An example is provided by the process.xrb archiving script in job_scripts/hpss/ and associated summit_hpss.submit submission script in jobs_scripts/summit/. Together these will detect new plotfiles as they are generated, tar them up (using htar) and archive them onto HPSS. They will also store the inputs, probin, and other runtime generated files. If ftime is found in your path, it will also create a file called ftime.out that lists the simulation time corresponding to each plotfile.

Once the plotfiles are archived they are moved to a subdirectory under your run directory called plotfiles/.

By default, the files will be archived to a directory in HPSS with the same name as the directory your plotfiles are located in. This can be changed by editing the $HPSS_DIR variable at the top of process.xrb.

To use this, we do the following:

  1. Copy the process.xrb and summit_hpss.submit scripts into the directory with the plotfiles.

  2. Launch the script via:

    sbatch summit_hpss.submit
    

    It will run for the full time you asked, searching for plotfiles as they are created and moving them to HPSS as they are produced (it will always leave the very last plotfile alone, since it can’t tell if it is still being written).

Files may be unarchived in bulk from HPSS on OLCF systems using the hpss_xfer.py script, which is available in the job_scripts directory. It requires Python 3 to be loaded to run. The command:

./hpss_xfer.py plt00000 -s hpss_dir -o plotfile_dir

will fetch hpss_dir/plt00000.tar from the HPSS filesystem and unpack it in plotfile_dir. If run with no arguments in the problem launch directory, the script will attempt to recover all plotfiles archived by process.titan. Try running ./hpss_xfer.py --help for a description of usage and arguments.

Frontier

Machine details

Queue policies are here: https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#scheduling-policy

Filesystem is called orion, and is Lustre: https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#data-and-storage

Submitting jobs

Frontier uses SLURM.

Here’s a script that runs with 2 nodes using all 8 GPUs per node:

#!/bin/bash
#SBATCH -A AST106
#SBATCH -J testing
#SBATCH -o %x-%j.out
#SBATCH -t 00:05:00
#SBATCH -p batch
# here N is the number of compute nodes
#SBATCH -N 2
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=7
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=closest

EXEC=Castro3d.hip.x86-trento.MPI.HIP.ex
INPUTS=inputs.3d.sph

module load PrgEnv-gnu
module load craype-accel-amd-gfx90a
module load cray-mpich/8.1.27
module load amd-mixed/6.0.0

export OMP_NUM_THREADS=1
export NMPI_PER_NODE=8
export TOTAL_NMPI=$(( ${SLURM_JOB_NUM_NODES} * ${NMPI_PER_NODE} ))

srun -n${TOTAL_NMPI} -N${SLURM_JOB_NUM_NODES} --ntasks-per-node=8 --gpus-per-task=1 ./$EXEC $INPUTS

Note

As of June 2023, it is necessary to explicitly use -n and -N on the srun line.

The job is submitted as:

sbatch frontier.slurm

where frontier.slurm is the name of the submission script.

A sample job script that includes the automatic restart functions can be found here: https://github.com/AMReX-Astro/workflow/blob/main/job_scripts/frontier/frontier.slurm

Also see the WarpX docs: https://warpx.readthedocs.io/en/latest/install/hpc/frontier.html

Job Status

You can check on the status of your jobs via:

squeue --me

and get an estimated start time via:

squeue --me --start

Job Chaining

The script chainslurm.sh can be used to start a job chain, with each job depending on the previous. For example, to start up 10 jobs:

chainslurm -1 10 frontier.slurm

If you want to add the chain to an existing queued job, change the -1 to the job-id of the existing job.

Debugging

Debugging is done with rocgdb. Here’s a workflow that works:

Setup the environment:

module load PrgEnv-gnu
module load cray-mpich/8.1.27
module load craype-accel-amd-gfx90a
module load amd-mixed/5.6.0

Build the executable. Usually it’s best to disable MPI if possible and maybe turn on TEST=TRUE:

make USE_HIP=TRUE TEST=TRUE USE_MPI=FALSE -j 4

Startup an interactive session:

salloc -A ast106 -J mz -t 0:30:00 -p batch -N 1

This will automatically log you onto the compute now.

Note

It’s a good idea to do:

module restore

and then reload the same modules used for compiling in the interactive shell.

Now set the following environment variables:

export HIP_ENABLE_DEFERRED_LOADING=0
export AMD_SERIALIZE_KERNEL=3
export AMD_SERIALIZE_COPY=3

Note

You can also set

export AMD_LOG_LEVEL=3

to get a lot of information about the GPU calls.

Run the debugger:

rocgdb ./Castro2d.hip.x86-trento.HIP.ex

Set the following inside of the debugger:

set pagination off
b abort

The run:

run inputs

If it doesn’t crash with the trace, then try:

interrupt
bt

Troubleshooting

Workaround to prevent hangs for collectives:

export FI_MR_CACHE_MONITOR=memhooks

Some AMReX reports are that it hangs if the initial Arena size is too big, and we should do

amrex.the_arena_init_size=0

The arena size would then grow as needed with time. There is a suggestion that if the size is larger than