.. _ch:mpiplusx: ****************************** Running Options: CPUs and GPUs ****************************** Castro uses MPI for coarse parallelization, distributing boxes across compute nodes. For fine-grained parallelism, OpenMP is used for CPU-based computing and CUDA is used for GPUs. Running on CPUs =============== The preferred was of running on CPUs is to use MPI+OpenMP, compiling as:: USE_MPI=TRUE USE_OMP=TRUE Castro uses tiling to divide boxes into smaller tiles and distributes these tiles to the OpenMP threads. This is all managed at the MFIter level -- no OpenMP directives need to be present in the compute kernels themselves. See `MFIter with Tiling `_ for more information. The optimal number of OpenMP threads depends on the computer architecture, and some experimentation is needed. Tiling works best with larger boxes, so increasing ``amr.max_grid_size`` can benefit performance. Running on GPUs =============== Castro's compute kernels can run on GPUs and this is the preferred way to run on supercomputers with GPUs. The exact same compute kernels are used on GPUs as on CPUs. .. note:: Almost all of Castro runs on GPUs, with the main exception being the true SDC solver (``USE_TRUE_SDC = TRUE``). When using GPUs, almost all of the computing is done on the GPUs. In the ``MFIter`` loops over boxes, the loops put a single zone on each GPU thread, to take advantage of the massive parallelism. The Microphysics routines (EOS, nuclear reaction networks, etc.) also take advantage of GPUs, so entire simulations can be run on the GPU. Best performance is obtained with bigger boxes, so setting ``amr.max_grid_size = 128`` and ``amr.blocking_factor = 32`` can give good performance. Castro / AMReX have an option to use managed memory for the GPU -- this means that the data will automatically be migrated from host to device (and vice versa) as needed, whenever a page fault is encountered. This can be enabled via: ``amrex.the_arena_is_managed=1``. By default, Castro will abort if it runs out of GPU memory. You can disable this via ``amrex.abort_on_out_of_gpu_memory=0`` -- together with running with managed memory, this can allow the memory to be swapped off of the GPU to make more room available. This is not recommended -- oversubscribing the GPU memory will severely impact performance. .. index:: castro.hydro_memory_footprint_ratio The CTU hydrodynamics scheme creates a lot of temporary ``FAB`` 's when doing the update. This can lead to the code oversubscribing the GPU memory during the hydro advance. To alleviate this, Castro can break a box into tiles and work on one tile at a time (this is the approach we use with OpenMP). In the hydro solver, this is controlled by ``hydro_tile_size``. By setting ``castro.hydro_memory_footprint_ratio`` to a number > 0, Castro will dynamically estimate a good tile size to use for the hydro during the first timestep and then use this subsequently. This larger the number, the more local memory we will allow the hydro solver to use (generally this means a larger ``hydro_tile_size``). This can allow you to run a problem on a smaller number of GPUs if the hydro temporary memory was the cause of oversubscription. Current recommendations are to try ``castro.hydro_memory_footprint_ratio`` between ``2.0`` and ``4.0``. NVIDIA GPUs ----------- With NVIDIA GPUs, we use MPI+CUDA, compiled with GCC and the NVIDIA compilers. To enable this, compile with:: USE_MPI = TRUE USE_OMP = FALSE USE_CUDA = TRUE .. note:: For recent GPUs, like the NVIDIA RTX 4090, you may need to change the default CUDA architecture. This can be done by adding: .. code:: CUDA_ARCH=89 to the ``make`` line or ``GNUmakefile``. AMD GPUs -------- For AMD GPUs, we use MPI+HIP, compiled with the ROCm compilers. To enable this, compile with:: USE_MPI = TRUE USE_OMP = FALSE USE_HIP = TRUE Working at Supercomputing Centers ================================= Our best practices for running any of the AMReX Astrophysics codes at different supercomputing centers is produced in our workflow documentation: https://amrex-astro.github.io/workflow/