.. highlight:: bash Managing Jobs at OLCF ===================== Summit ------ Summit Architecture: ^^^^^^^^^^^^^^^^^^^^ Let us start by reviewing the node architecture of Summit. Our goal is to provide the necessary insight to make better decisions in the construction of our particular AMReX-Astro job scripts, and to explain how our code interacts with Summit. All the exposed information in this section is a condensed version of the `Summit documentation guide `_, and should not replace it. In Summit, a node is composed by two sockets: each one with 21 CPU physical cores (+1 reserved for the system), 3 GPUs and 1 RAM memory bank. The sockets are connected by a bus allowing communication among them. Each CPU physical core may define up to 4 threads. The whole structure of the node can be depicted as follows: .. figure:: ./figs/summit-node-description-1.png :width: 100% :align: center Figure extracted from ``https://docs.olcf.ornl.gov/systems/summit_user_guide.html#job-launcher-jsrun``. A resource set is a minimal collection of CPU physical cores and GPUs, on which a certain number of MPI processes and OpenMP threads operates through the execution of the code. Therefore, for each resource set, we need to allocate: - A number of CPU physical cores. - A number of physical GPUs. - A number of MPI processes. - The number of OpenMP threads. where each core supports up to 4 threads; however, this option is not supported in AMReX and we will not extend our discussion here. For now, we fix just only one thread through the whole execution of our code. The next step is to determine the maximum number of resource sets that may fit into one node. In Castro we construct each resource set with: 1 CPU physical core, 1 GPU, and only 1 MPI process. The next step is to see how many resources sets fits into one node. According to the node architecture depicted in Figure 1, we can fit up to 6 resource sets per node as in Figure 2. .. figure:: ./figs/image56.png :width: 100% :align: center Figure modified and extracted from ``https://docs.olcf.ornl.gov/systems/summit_user_guide.html#job-launcher-jsrun``. Requesting Allocation: ^^^^^^^^^^^^^^^^^^^^^^ To allocate the resource sets we need to summon the command ``bsub`` in addition to some flags: .. list-table:: :widths: 25 75 :header-rows: 1 * - Flag - Description * - ``-nnodes`` - allocates the number of nodes we need to run our code. Is important to perform the calculation described in the previous section to select the correct number of nodes in our setup. * - ``-W`` - allocates the walltime of the selected nodes. The format we use in Summit is [hours:]minutes, there is no room for seconds in Summit. The maximum walltime that we can request is 03:00 (three hours). * - ``-alloc_flags`` - allocates the maximum number of threads available per CPU core. By default the option is ``smt4`` that allows 4 threads per core. However, since we will consider only one thread through the whole execution we will setup the option ``smt1``. Also ``-alloc_flags`` stands for more options, however we are only interested in the one discussed before. * - ``-J`` - defines the name of the allocation. The value ``%J`` correspond to the allocation ID number. * - ``-o`` - defines the **output name that contains the standard output stream**, after running all the jobs inside the requested allocation. * - ``-e`` - defines the **output file name containing the standard error stream**, similar to the ``-o`` flag. If ``-e`` is not supplied, then the ``-o`` option is assumed by default. * - ``-q`` - defines the queue on which our application will run. There are several options, however, we alternate between two options: the standard production queue ``batch`` and the debugging queue ``debug``. The ``debug`` queue is designed to allocate an small number of nodes in order to see that our code is running smoothly without bugs. * - ``-Is`` - flags for interactive job followed by the shell name. The Unix bash shell option is ``/bin/bash``. This flag is very useful for debugging, because the standard output can be checked as the code is running. Is important to mention that any interactive job can only be summoned by command line and not by running a bash script. For example, if we want to allocate one node to run an interactive job in the debug queue for 30 minutes we may setup: .. prompt:: bash bsub -nnodes 1 -q debug -W 0:30 -P ast106 -alloc_flags smt1 -J example -o stdout_to_show.%J -e stderr_to_show.%J -Is /bin/bash .. note:: An interactive job can only be allocated by the use of the command line. No script can be defined for interactive jobs. Submitting a Job: ^^^^^^^^^^^^^^^^^ Once our allocation is granted, is important to load the same modules used in the compilation process of the executable and export the variable ``OMP_NUM_THREADS`` to setup the number of threads per MPI process. In Castro, we have used the following modules: .. code-block:: module load gcc/10.2.0 module load cuda/11.5.2 module load python and fixed only one thread per MPI process by: .. code-block:: export OMP_NUM_THREADS=1 The next step is to submit our job. The command `jsrun`, provided with the *total number of resource sets*, the *number of CPU physical cores per resource set*, *the number of GPUs per resource set*, *the number of MPI processes allocated per resource set*, works as follows: .. prompt:: bash jsrun -n[number of resource sets] -c[number of CPU physical cores] -g[number of GPUs] -a[number of MPI processes] -r[number of max resources per node] ./[executable] [executable inputs] In Castro we will use: .. prompt:: bash jsrun -n [number of resource sets] -a1 -c1 -g1 -r6 ./$CASTRO $INPUTS where the ``CASTRO`` and ``INPUTS`` environment variables are placeholders to the executable and input file names respectively. Now, in order to use all the resources we have allocated to run our jobs, the number of resource sets should match the number of AMReX boxes (grids) of the corresponding level with the biggest number of them. Let us consider an extract piece from a Castro problem standard output: .. code-block:: INITIAL GRIDS Level 0 2 grids 32768 cells 100 % of domain smallest grid: 128 x 128 biggest grid: 128 x 128 Level 1 8 grids 131072 cells 100 % of domain smallest grid: 128 x 128 biggest grid: 128 x 128 Level 2 8 grids 524288 cells 100 % of domain smallest grid: 256 x 256 biggest grid: 256 x 256 Level 3 32 grids 2097152 cells 100 % of domain smallest grid: 256 x 256 biggest grid: 256 x 256 Level 4 128 grids 7864320 cells 93.75 % of domain smallest grid: 256 x 128 biggest grid: 256 x 256 Level 5 480 grids 30408704 cells 90.625 % of domain smallest grid: 256 x 128 biggest grid: 256 x 256 In this example, Level 5 contains the biggest number of AMReX boxes: 480. From here, we may assert that a good allocation for this problem are 480 resource sets, equivalent to 80 nodes by setting 6 resources per node. However, note that that Level 0 uses only 2 AMReX boxes, this implies that from the 480 resources available, 398 resources will remain idle until the two working processes sweep the entire Level 0. .. note:: Therefore, is important, if possible, to keep the number of boxes on each level balanced to maximize the use of the allocated resources. Writting a Job Script: ^^^^^^^^^^^^^^^^^^^^^^ In order to make our life easier, instead of submitting an allocation command line, loading the modules, setting the threads/MPI process, and writing another command line to submit our jobs, we can make an script to pack all these command into one executable ``.sh`` file, that can be submitted via ``bsub`` just once. We start our job script, summoning the shell with the statement ``!/bin/bash``. Then we add the ``bsub`` allocations flags, starting with ``#BSUB`` as follows: .. code-block:: bash #!/bin/bash #BSUB -P ast106 #BSUB -W 2:00 #BSUB -nnodes 80 #BSUB -alloc_flags smt1 #BSUB -J luna_script #BSUB -o luna_output.%J #BSUB -e luna_sniffing_output.%J In addition we add the modules statements, fixing only one thread per MPI process: .. code-block:: module load gcc/10.2.0 module load cuda/11.5.2 module load python export OMP_NUM_THREADS=1 and define the environment variables: .. code-block:: CASTRO=./Castro2d.gnu.MPI.CUDA.ex INPUTS=inputs_luna n_res=480 # The max allocated number of resource sets is n_cpu_cores_per_res=1 # nnodes * n_max_res_per_node. In this case we will n_mpi_per_res=1 # use all the allocated resource sets to run the job n_gpu_per_res=1 # below. n_max_res_per_node=6 Once the allocation ends, the job is downgraded/killed, leaving us as we started. As we pointed out, the maximum allocation time in Summit is 03:00 (three hours), but, we may need sometimes weeks, months, or maybe years to complete our runs. Now is when the automatic restarting section of the script comes to our salvation. From here we can add an optional (or mandatory) setting to our script. As the code executes, after a certain number of timesteps, the code creates checkpoint files of the form ``chkxxxxxxx``, ``chkxxxxxx`` or ``chkxxxxx``. This checkpoint files can be read by our executable and run from the simulation time where the checkpoint was created. This is implemented as follows: .. code-block:: function find_chk_file { # find_chk_file takes a single argument -- the wildcard pattern # for checkpoint files to look through chk=$1 # find the latest 2 restart files. This way if the latest didn't # complete we fall back to the previous one. temp_files=$(find . -maxdepth 1 -name "${chk}" -print | sort | tail -2) restartFile="" for f in ${temp_files} do # the Header is the last thing written -- if it's there, update the restart file if [ -f ${f}/Header ]; then restartFile="${f}" fi done } # look for 7-digit chk files find_chk_file "*chk???????" if [ "${restartFile}" = "" ]; then # look for 6-digit chk files find_chk_file "*chk??????" fi if [ "${restartFile}" = "" ]; then # look for 5-digit chk files find_chk_file "*chk?????" fi # restartString will be empty if no chk files are found -- i.e. new run if [ "${restartFile}" = "" ]; then restartString="" else restartString="amr.restart=${restartFile}" fi The function ``find_chk_file`` searches the submission directory for checkpoint files. Because AMReX appends digits as the number of steps increase (with a minimum of 5 digits), we search for files with 7-digits, 6-digits, and then finally 5-digits, to ensure we pick up the latest file. We can also ask the job manager to send a warning signal some amount of time before the allocation expires by passing ``-wa 'signal'`` and ``-wt '[hour:]minute'`` to ``bsub``. We can then have bash create a ``dump_and_stop`` file when it receives the signal, which will tell Castro to output a checkpoint file and exit cleanly after it finishes the current timestep. An important detail that I couldn't find documented anywhere is that the job manager sends the signal to all the processes in the job, not just the submission script, and we have to use a signal that is ignored by default so Castro doesn't immediately crash upon receiving it. SIGCHLD, SIGURG, and SIGWINCH are the only signals that fit this requirement and of these, SIGURG is the least likely to be triggered by other events. .. code-block:: bash #BSUB -wa URG #BSUB -wt 2 ... function sig_handler { touch dump_and_stop # disable this signal handler trap - URG echo "BATCH: allocation ending soon; telling Castro to dump a checkpoint and stop" } trap sig_handler URG We use the ``jsrun`` command to launch Castro on the compute nodes. In order for bash to handle the warning signal before Castro exits, we must put ``jsrun`` in the background and use the shell builtin ``wait``: .. code-block:: bash jsrun -n$n_res -c$n_cpu_cores_per_res -a$n_mpi_per_res -g$n_gpu_per_res -r$n_max_res_per_node $CASTRO $INPUTS ${restartString} & wait # use jswait to wait for Castro (job step 1/1) to finish and get the exit code jswait 1 Finally, once the script is completed and saved as ``luna_script.sh``, we can submit it by: .. prompt:: bash bsub luna_script.sh Monitoring a Job: ^^^^^^^^^^^^^^^^^ You can monitor the status of your jobs using ``bjobs``. Also, a slightly nicer view of your jobs can be viewed using ``jobstat`` as: .. prompt:: bash jobstat -u username Script Template: ^^^^^^^^^^^^^^^^ Packing all the information before, lead us to the following script template .. literalinclude:: ../../job_scripts/summit/summit_template.sh :language: sh :linenos: Chaining jobs ^^^^^^^^^^^^^ The script ``job_scripts/summit/chain_submit.sh`` can be used to setup job dependencies, i.e., a job chain. First you submit a job as usual using ``bsub``, and make note of the job-id that it prints upon submission (the same id you would see with ``bjobs`` or ``jobstat``). Then you setup N jobs to depend on the one you just submitted as: .. prompt:: bash chain_submit.sh job-id N submit_script.sh where you replace ``job-id`` with the id return from your first submission, replace ``N`` with the number of additional jobs, and replace ``submit_script`` with the name of the script you use to submit the job. This will queue up N additional jobs, each depending on the previous. Your submission script should use the automatic restarting features discussed above. Archiving to HPSS ----------------- You can access HPSS from submit using the data transfer nodes by submitting a job via SLURM: .. prompt:: bash sbatch -N 1 -t 15:00 -A ast106 --cluster dtn test_hpss.sh where ``test_hpss.sh`` is a SLURM script that contains the ``htar`` commands needed to archive your data. This uses ``slurm`` as the job manager. An example is provided by the ``process.xrb`` archiving script in ``job_scripts/hpss/`` and associated ``summit_hpss.submit`` submission script in ``jobs_scripts/summit/``. Together these will detect new plotfiles as they are generated, tar them up (using ``htar``) and archive them onto HPSS. They will also store the inputs, probin, and other runtime generated files. If ``ftime`` is found in your path, it will also create a file called ``ftime.out`` that lists the simulation time corresponding to each plotfile. Once the plotfiles are archived they are moved to a subdirectory under your run directory called ``plotfiles/``. By default, the files will be archived to a directory in HPSS with the same name as the directory your plotfiles are located in. This can be changed by editing the ``$HPSS_DIR`` variable at the top of ``process.xrb``. To use this, we do the following: #. Copy the ``process.xrb`` and ``summit_hpss.submit`` scripts into the directory with the plotfiles. #. Launch the script via: .. prompt:: bash sbatch summit_hpss.submit It will run for the full time you asked, searching for plotfiles as they are created and moving them to HPSS as they are produced (it will always leave the very last plotfile alone, since it can't tell if it is still being written). Files may be unarchived in bulk from HPSS on OLCF systems using the ``hpss_xfer.py`` script, which is available in the job_scripts directory. It requires Python 3 to be loaded to run. The command: .. prompt:: bash ./hpss_xfer.py plt00000 -s hpss_dir -o plotfile_dir will fetch ``hpss_dir/plt00000.tar`` from the HPSS filesystem and unpack it in ``plotfile_dir``. If run with no arguments in the problem launch directory, the script will attempt to recover all plotfiles archived by ``process.titan``. Try running :code:`./hpss_xfer.py --help` for a description of usage and arguments. Frontier -------- Machine details ^^^^^^^^^^^^^^^ Queue policies are here: https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#scheduling-policy Filesystem is called ``orion``, and is Lustre: https://docs.olcf.ornl.gov/systems/frontier_user_guide.html#data-and-storage Submitting jobs ^^^^^^^^^^^^^^^ Frontier uses SLURM. Here's a script that runs with 2 nodes using all 8 GPUs per node: .. code:: bash #!/bin/bash #SBATCH -A AST106 #SBATCH -J testing #SBATCH -o %x-%j.out #SBATCH -t 00:05:00 #SBATCH -p batch # here N is the number of compute nodes #SBATCH -N 2 #SBATCH --ntasks-per-node=8 #SBATCH --cpus-per-task=7 #SBATCH --gpus-per-task=1 #SBATCH --gpu-bind=closest EXEC=Castro3d.hip.x86-trento.MPI.HIP.ex INPUTS=inputs.3d.sph module load PrgEnv-gnu module load craype-accel-amd-gfx90a module load cray-mpich/8.1.27 module load amd-mixed/6.0.0 export OMP_NUM_THREADS=1 export NMPI_PER_NODE=8 export TOTAL_NMPI=$(( ${SLURM_JOB_NUM_NODES} * ${NMPI_PER_NODE} )) srun -n${TOTAL_NMPI} -N${SLURM_JOB_NUM_NODES} --ntasks-per-node=8 --gpus-per-task=1 ./$EXEC $INPUTS .. note:: As of June 2023, it is necessary to explicitly use ``-n`` and ``-N`` on the ``srun`` line. The job is submitted as: .. prompt:: bash sbatch frontier.slurm where ``frontier.slurm`` is the name of the submission script. A sample job script that includes the automatic restart functions can be found here: https://github.com/AMReX-Astro/workflow/blob/main/job_scripts/frontier/frontier.slurm Also see the WarpX docs: https://warpx.readthedocs.io/en/latest/install/hpc/frontier.html Job Status ^^^^^^^^^^ You can check on the status of your jobs via: .. prompt:: bash squeue --me and get an estimated start time via: .. prompt:: bash squeue --me --start Job Chaining ^^^^^^^^^^^^ The script `chainslurm.sh `_ can be used to start a job chain, with each job depending on the previous. For example, to start up 10 jobs: .. prompt:: bash chainslurm -1 10 frontier.slurm If you want to add the chain to an existing queued job, change the ``-1`` to the job-id of the existing job. Debugging ^^^^^^^^^ Debugging is done with ``rocgdb``. Here's a workflow that works: Setup the environment: .. prompt:: bash module load PrgEnv-gnu module load cray-mpich/8.1.27 module load craype-accel-amd-gfx90a module load amd-mixed/5.6.0 Build the executable. Usually it's best to disable MPI if possible and maybe turn on ``TEST=TRUE``: .. prompt:: bash make USE_HIP=TRUE TEST=TRUE USE_MPI=FALSE -j 4 Startup an interactive session: .. prompt:: bash salloc -A ast106 -J mz -t 0:30:00 -p batch -N 1 This will automatically log you onto the compute now. .. note:: It's a good idea to do: .. prompt:: bash module restore and then reload *the same* modules used for compiling in the interactive shell. Now set the following environment variables: .. prompt:: bash export HIP_ENABLE_DEFERRED_LOADING=0 export AMD_SERIALIZE_KERNEL=3 export AMD_SERIALIZE_COPY=3 .. note:: You can also set .. prompt:: bash export AMD_LOG_LEVEL=3 to get *a lot* of information about the GPU calls. Run the debugger: .. prompt:: bash rocgdb ./Castro2d.hip.x86-trento.HIP.ex Set the following inside of the debugger: .. prompt:: :prompts: (gdb) set pagination off b abort The run: .. prompt:: :prompts: (gdb) run inputs If it doesn't crash with the trace, then try: .. prompt:: :prompts: (gdb) interrupt bt Troubleshooting ^^^^^^^^^^^^^^^ Workaround to prevent hangs for collectives: :: export FI_MR_CACHE_MONITOR=memhooks Some AMReX reports are that it hangs if the initial Arena size is too big, and we should do :: amrex.the_arena_init_size=0 The arena size would then grow as needed with time. There is a suggestion that if the size is larger than