Managing Jobs at OLCF¶
Summit¶
Summit Architecture:¶
Let us start by reviewing the node architecture of Summit. Our goal is to provide the necessary insight to make better decisions in the construction of our particular AMReX-Astro job scripts, and to explain how our code interacts with Summit. All the exposed information in this section is a condensed version of the Summit documentation guide, and should not replace it.
In Summit, a node is composed by two sockets: each one with 21 CPU physical cores (+1 reserved for the system), 3 GPUs and 1 RAM memory bank. The sockets are connected by a bus allowing communication among them. Each CPU physical core may define up to 4 threads. The whole structure of the node can be depicted as follows:
A resource set is a minimal collection of CPU physical cores and GPUs, on which a certain number of MPI processes and OpenMP threads operates through the execution of the code. Therefore, for each resource set, we need to allocate:
A number of CPU physical cores.
A number of physical GPUs.
A number of MPI processes.
The number of OpenMP threads.
where each core supports up to 4 threads; however, this option is not supported in AMReX and we will not extend our discussion here. For now, we fix just only one thread through the whole execution of our code. The next step is to determine the maximum number of resource sets that may fit into one node.
In Castro we construct each resource set with: 1 CPU physical core, 1 GPU, and only 1 MPI process. The next step is to see how many resources sets fits into one node. According to the node architecture depicted in Figure 1, we can fit up to 6 resource sets per node as in Figure 2.
Requesting Allocation:¶
To allocate the resource sets we need to summon the command bsub
in addition to some flags:
Flag |
Description |
---|---|
|
allocates the number of nodes we need to run our code. Is important to perform the calculation described in the previous section to select the correct number of nodes in our setup. |
|
allocates the walltime of the selected nodes. The format we use in Summit is [hours:]minutes, there is no room for seconds in Summit. The maximum walltime that we can request is 03:00 (three hours). |
|
allocates the maximum number of threads available per CPU core. By default the option is |
|
defines the name of the allocation. The value |
|
defines the output name that contains the standard output stream, after running all the jobs inside the requested allocation. |
|
defines the output file name containing the standard error stream, similar to the |
|
defines the queue on which our application will run. There are several options, however, we alternate
between two options: the standard production queue |
|
flags for interactive job followed by the shell name. The Unix bash shell option is |
For example, if we want to allocate one node to run an interactive job in the debug queue for 30 minutes we may setup:
bsub -nnodes 1 -q debug -W 0:30 -P ast106 -alloc_flags smt1 -J example -o stdout_to_show.%J -e stderr_to_show.%J -Is /bin/bash
Note
An interactive job can only be allocated by the use of the command line. No script can be defined for interactive jobs.
Submitting a Job:¶
Once our allocation is granted, is important to load the same modules used in the compilation process of the executable and
export the variable OMP_NUM_THREADS
to setup the number of threads per MPI process.
In Castro, we have used the following modules:
module load gcc/10.2.0
module load cuda/11.5.2
module load python
and fixed only one thread per MPI process by:
export OMP_NUM_THREADS=1
The next step is to submit our job. The command jsrun, provided with the total number of resource sets, the number of CPU physical cores per resource set, the number of GPUs per resource set, the number of MPI processes allocated per resource set, works as follows:
jsrun -n[number of resource sets] -c[number of CPU physical cores] -g[number of GPUs] -a[number of MPI processes] -r[number of max resources per node] ./[executable] [executable inputs]
In Castro we will use:
jsrun -n [number of resource sets] -a1 -c1 -g1 -r6 ./$CASTRO $INPUTS
where the CASTRO
and INPUTS
environment variables are placeholders to the executable and input file names respectively.
Now, in order to use all the resources we have allocated to run our jobs, the number of resource sets should match the number of AMReX boxes (grids) of the corresponding level with the biggest number of them. Let us consider an extract piece from a Castro problem standard output:
INITIAL GRIDS
Level 0 2 grids 32768 cells 100 % of domain
smallest grid: 128 x 128 biggest grid: 128 x 128
Level 1 8 grids 131072 cells 100 % of domain
smallest grid: 128 x 128 biggest grid: 128 x 128
Level 2 8 grids 524288 cells 100 % of domain
smallest grid: 256 x 256 biggest grid: 256 x 256
Level 3 32 grids 2097152 cells 100 % of domain
smallest grid: 256 x 256 biggest grid: 256 x 256
Level 4 128 grids 7864320 cells 93.75 % of domain
smallest grid: 256 x 128 biggest grid: 256 x 256
Level 5 480 grids 30408704 cells 90.625 % of domain
smallest grid: 256 x 128 biggest grid: 256 x 256
In this example, Level 5 contains the biggest number of AMReX boxes: 480. From here, we may assert that a good allocation for this problem are 480 resource sets, equivalent to 80 nodes by setting 6 resources per node. However, note that that Level 0 uses only 2 AMReX boxes, this implies that from the 480 resources available, 398 resources will remain idle until the two working processes sweep the entire Level 0.
Note
Therefore, is important, if possible, to keep the number of boxes on each level balanced to maximize the use of the allocated resources.
Writting a Job Script:¶
In order to make our life easier, instead of submitting an allocation
command line, loading the modules, setting the threads/MPI process,
and writing another command line to submit our jobs, we can make an
script to pack all these command into one executable .sh
file,
that can be submitted via bsub
just once.
We start our job script, summoning the shell with the statement
!/bin/bash
. Then we add the bsub
allocations flags, starting
with #BSUB
as follows:
#!/bin/bash
#BSUB -P ast106
#BSUB -W 2:00
#BSUB -nnodes 80
#BSUB -alloc_flags smt1
#BSUB -J luna_script
#BSUB -o luna_output.%J
#BSUB -e luna_sniffing_output.%J
In addition we add the modules statements, fixing only one thread per MPI process:
module load gcc/10.2.0
module load cuda/11.5.2
module load python
export OMP_NUM_THREADS=1
and define the environment variables:
CASTRO=./Castro2d.gnu.MPI.CUDA.ex
INPUTS=inputs_luna
n_res=480 # The max allocated number of resource sets is
n_cpu_cores_per_res=1 # nnodes * n_max_res_per_node. In this case we will
n_mpi_per_res=1 # use all the allocated resource sets to run the job
n_gpu_per_res=1 # below.
n_max_res_per_node=6
Once the allocation ends, the job is downgraded/killed, leaving us as we started. As we pointed out, the maximum allocation time in Summit is 03:00 (three hours), but, we may need sometimes weeks, months, or maybe years to complete our runs. Now is when the automatic restarting section of the script comes to our salvation.
From here we can add an optional (or mandatory) setting to our script. As the code executes,
after a certain number of timesteps, the code creates checkpoint files of the form chkxxxxxxx
, chkxxxxxx
or chkxxxxx
. This checkpoint files can be read by our executable and run from the simulation time where
the checkpoint was created. This is implemented as follows:
function find_chk_file {
# find_chk_file takes a single argument -- the wildcard pattern
# for checkpoint files to look through
chk=$1
# find the latest 2 restart files. This way if the latest didn't
# complete we fall back to the previous one.
temp_files=$(find . -maxdepth 1 -name "${chk}" -print | sort | tail -2)
restartFile=""
for f in ${temp_files}
do
# the Header is the last thing written -- if it's there, update the restart file
if [ -f ${f}/Header ]; then
restartFile="${f}"
fi
done
}
# look for 7-digit chk files
find_chk_file "*chk???????"
if [ "${restartFile}" = "" ]; then
# look for 6-digit chk files
find_chk_file "*chk??????"
fi
if [ "${restartFile}" = "" ]; then
# look for 5-digit chk files
find_chk_file "*chk?????"
fi
# restartString will be empty if no chk files are found -- i.e. new run
if [ "${restartFile}" = "" ]; then
restartString=""
else
restartString="amr.restart=${restartFile}"
fi
The function find_chk_file
searches the submission directory for
checkpoint files. Because AMReX appends digits as the number of steps
increase (with a minimum of 5 digits), we search for files with
7-digits, 6-digits, and then finally 5-digits, to ensure we pick up
the latest file.
We can also ask the job manager to send a warning signal some amount
of time before the allocation expires by passing -wa 'signal'
and
-wt '[hour:]minute'
to bsub
. We can then have bash create a
dump_and_stop
file when it receives the signal, which will tell
Castro to output a checkpoint file and exit cleanly after it finishes
the current timestep. An important detail that I couldn’t find
documented anywhere is that the job manager sends the signal to all
the processes in the job, not just the submission script, and we have
to use a signal that is ignored by default so Castro doesn’t
immediately crash upon receiving it. SIGCHLD, SIGURG, and SIGWINCH
are the only signals that fit this requirement and of these, SIGURG is
the least likely to be triggered by other events.
#BSUB -wa URG
#BSUB -wt 2
...
function sig_handler {
touch dump_and_stop
# disable this signal handler
trap - URG
echo "BATCH: allocation ending soon; telling Castro to dump a checkpoint and stop"
}
trap sig_handler URG
We use the jsrun
command to launch Castro on the compute nodes. In
order for bash to handle the warning signal before Castro exits, we
must put jsrun
in the background and use the shell builtin
wait
:
jsrun -n$n_res -c$n_cpu_cores_per_res -a$n_mpi_per_res -g$n_gpu_per_res -r$n_max_res_per_node $CASTRO $INPUTS ${restartString} &
wait
# use jswait to wait for Castro (job step 1/1) to finish and get the exit code
jswait 1
Finally, once the script is completed and saved as luna_script.sh
, we can submit it by:
bsub luna_script.sh
Monitoring a Job:¶
You can monitor the status of your jobs using bjobs
. Also, a slightly nicer view of your jobs can be viewed using jobstat
as:
jobstat -u username
Script Template:¶
Packing all the information before, lead us to the following script template
1#!/bin/bash
2#BSUB -P ast106
3#BSUB -W 2:00
4#BSUB -nnodes 80
5#BSUB -alloc_flags smt1
6#BSUB -J luna_script
7#BSUB -o luna_output.%J
8#BSUB -e luna_sniffing_output.%J
9#BSUB -wa URG
10#BSUB -wt 2
11
12module load gcc/10.2.0
13module load cuda/11.5.2
14module load python
15
16export OMP_NUM_THREADS=1
17
18CASTRO=./Castro2d.gnu.MPI.CUDA.ex
19INPUTS=inputs_luna
20
21n_res=480 # The max allocated number of resource sets is
22n_cpu_cores_per_res=1 # nnodes * n_max_res_per_node. In this case we will
23n_mpi_per_res=1 # use all the allocated resource sets to run the job below,
24n_gpu_per_res=1 # however we can define more enviroment variables to allocate two jobs
25n_max_res_per_node=6 # simultaneous jobs, where n_res = n_res_1 + n_res2 allocates for two jobs.
26
27function find_chk_file {
28 # find_chk_file takes a single argument -- the wildcard pattern
29 # for checkpoint files to look through
30 chk=$1
31
32 # find the latest 2 restart files. This way if the latest didn't
33 # complete we fall back to the previous one.
34 temp_files=$(find . -maxdepth 1 -name "${chk}" -print | sort | tail -2)
35 restartFile=""
36 for f in ${temp_files}
37 do
38 # the Header is the last thing written -- if it's there, update the restart file
39 if [ -f ${f}/Header ]; then
40 # The scratch FS sometimes gives I/O errors when trying to read
41 # from recently-created files, which crashes Castro. Avoid this by
42 # making sure we can read from all the data files.
43 if head --quiet -c1 "${f}/Header" "${f}"/Level_*/* >/dev/null; then
44 restartFile="${f}"
45 fi
46 fi
47 done
48}
49
50# look for 7-digit chk files
51find_chk_file "*chk???????"
52
53if [ "${restartFile}" = "" ]; then
54 # look for 6-digit chk files
55 find_chk_file "*chk??????"
56fi
57
58if [ "${restartFile}" = "" ]; then
59 # look for 5-digit chk files
60 find_chk_file "*chk?????"
61fi
62
63# restartString will be empty if no chk files are found -- i.e. new run
64if [ "${restartFile}" = "" ]; then
65 restartString=""
66else
67 restartString="amr.restart=${restartFile}"
68fi
69
70# clean up any run management files left over from previous runs
71rm -f dump_and_stop
72
73warning_time=$(bjobs -noheader -o action_warning_time "$LSB_JOBID")
74# The `-wa URG -wt <n>` options tell bsub to send SIGURG to all processes n
75# minutes before the runtime limit, so we can exit gracefully.
76# SIGURG is ignored by default, so it won't make Castro crash.
77function sig_handler {
78 touch dump_and_stop
79 # disable this signal handler
80 trap - URG
81 echo "BATCH: $warning_time left in allocation; telling Castro to dump a checkpoint and stop"
82}
83trap sig_handler URG
84
85# execute jsrun in the background then use the builtin wait so the shell can
86# handle the signal
87jsrun -n$n_res -c$n_cpu_cores_per_res -a$n_mpi_per_res -g$n_gpu_per_res -r$n_max_res_per_node $CASTRO $INPUTS ${restartString} &
88wait
89# use jswait to wait for Castro (job step 1/1) to finish and get the exit code
90jswait 1
Chaining jobs¶
The script job_scripts/summit/chain_submit.sh
can be used to setup job dependencies,
i.e., a job chain.
First you submit a job as usual using bsub
, and make note of the
job-id that it prints upon submission (the same id you would see with
bjobs
or jobstat
). Then you setup N jobs to depend on the one
you just submitted as:
chain_submit.sh job-id N submit_script.sh
where you replace job-id
with the id return from your first
submission, replace N
with the number of additional jobs, and
replace submit_script
with the name of the script you use to
submit the job. This will queue up N additional jobs, each depending
on the previous. Your submission script should use the automatic
restarting features discussed above.
Archiving to HPSS¶
You can access HPSS from submit using the data transfer nodes by submitting a job via SLURM:
sbatch -N 1 -t 15:00 -A ast106 --cluster dtn test_hpss.sh
where test_hpss.sh
is a SLURM script that contains the htar
commands needed to archive your data. This uses slurm
as the job
manager.
An example is provided by the process.xrb
archiving script in
job_scripts/hpss/
and associated summit_hpss.submit
submission script
in jobs_scripts/summit/
. Together these will detect new plotfiles as they
are generated, tar them up (using htar
) and archive them onto HPSS. They
will also store the inputs, probin, and other runtime generated files. If
ftime
is found in your path, it will also create a file called
ftime.out
that lists the simulation time corresponding to each plotfile.
Once the plotfiles are archived they are moved to a subdirectory under
your run directory called plotfiles/
.
By default, the files will be archived to a directory in HPSS with the same
name as the directory your plotfiles are located in. This can be changed
by editing the $HPSS_DIR
variable at the top of process.xrb
.
To use this, we do the following:
Copy the
process.xrb
andsummit_hpss.submit
scripts into the directory with the plotfiles.Launch the script via:
sbatch summit_hpss.submit
It will run for the full time you asked, searching for plotfiles as they are created and moving them to HPSS as they are produced (it will always leave the very last plotfile alone, since it can’t tell if it is still being written).
Files may be unarchived in bulk from HPSS on OLCF systems using the
hpss_xfer.py
script, which is available in the job_scripts
directory. It requires Python 3 to be loaded to run. The command:
./hpss_xfer.py plt00000 -s hpss_dir -o plotfile_dir
will fetch hpss_dir/plt00000.tar
from the HPSS filesystem and
unpack it in plotfile_dir
. If run with no arguments in the problem
launch directory, the script will attempt to recover all plotfiles
archived by process.titan
. Try running ./hpss_xfer.py --help
for a description of usage and arguments.
Frontier¶
Machine details¶
Queue policies are here: https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#scheduling-policy
Filesystem is called orion
, and is Lustre:
https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#data-and-storage
Submitting jobs¶
Frontier uses SLURM.
Here’s a script that runs with 2 nodes using all 8 GPUs per node:
#!/bin/bash
#SBATCH -A AST106
#SBATCH -J testing
#SBATCH -o %x-%j.out
#SBATCH -t 00:05:00
#SBATCH -p batch
# here N is the number of compute nodes
#SBATCH -N 2
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=7
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=closest
EXEC=Castro3d.hip.x86-trento.MPI.HIP.ex
INPUTS=inputs.3d.sph
module load PrgEnv-gnu
module load craype-accel-amd-gfx90a
module load cray-mpich/8.1.27
module load amd-mixed/6.0.0
export OMP_NUM_THREADS=1
export NMPI_PER_NODE=8
export TOTAL_NMPI=$(( ${SLURM_JOB_NUM_NODES} * ${NMPI_PER_NODE} ))
srun -n${TOTAL_NMPI} -N${SLURM_JOB_NUM_NODES} --ntasks-per-node=8 --gpus-per-task=1 ./$EXEC $INPUTS
Note
As of June 2023, it is necessary to explicitly use -n
and -N
on the srun
line.
The job is submitted as:
sbatch frontier.slurm
where frontier.slurm
is the name of the submission script.
A sample job script that includes the automatic restart functions can be found here: https://github.com/AMReX-Astro/workflow/blob/main/job_scripts/frontier/frontier.slurm
Also see the WarpX docs: https://warpx.readthedocs.io/en/latest/install/hpc/frontier.html
Job Status¶
You can check on the status of your jobs via:
squeue --me
and get an estimated start time via:
squeue --me --start
Job Chaining¶
The script chainslurm.sh can be used to start a job chain, with each job depending on the previous. For example, to start up 10 jobs:
chainslurm -1 10 frontier.slurm
If you want to add the chain to an existing queued job, change the -1
to the job-id
of the existing job.
Debugging¶
Debugging is done with rocgdb
. Here’s a workflow that works:
Setup the environment:
module load PrgEnv-gnu
module load cray-mpich/8.1.27
module load craype-accel-amd-gfx90a
module load amd-mixed/5.6.0
Build the executable. Usually it’s best to disable MPI if possible
and maybe turn on TEST=TRUE
:
make USE_HIP=TRUE TEST=TRUE USE_MPI=FALSE -j 4
Startup an interactive session:
salloc -A ast106 -J mz -t 0:30:00 -p batch -N 1
This will automatically log you onto the compute now.
Note
It’s a good idea to do:
module restore
and then reload the same modules used for compiling in the interactive shell.
Now set the following environment variables:
export HIP_ENABLE_DEFERRED_LOADING=0
export AMD_SERIALIZE_KERNEL=3
export AMD_SERIALIZE_COPY=3
Note
You can also set
export AMD_LOG_LEVEL=3
to get a lot of information about the GPU calls.
Run the debugger:
rocgdb ./Castro2d.hip.x86-trento.HIP.ex
Set the following inside of the debugger:
set pagination off
b abort
The run:
run inputs
If it doesn’t crash with the trace, then try:
interrupt
bt
Troubleshooting¶
Workaround to prevent hangs for collectives:
export FI_MR_CACHE_MONITOR=memhooks
Some AMReX reports are that it hangs if the initial Arena size is too big, and we should do
amrex.the_arena_init_size=0
The arena size would then grow as needed with time. There is a suggestion that if the size is larger than