Running Options: CPUs and GPUs

Castro uses MPI for coarse parallelization, distributing boxes across compute nodes. For fine-grained parallelism, OpenMP is used for CPU-based computing and CUDA is used for GPUs.

Running on CPUs

The preferred was of running on CPUs is to use MPI+OpenMP, compiling as:

USE_MPI=TRUE
USE_OMP=TRUE

Castro uses tiling to divide boxes into smaller tiles and distributes these tiles to the OpenMP threads. This is all managed at the MFIter level – no OpenMP directives need to be present in the compute kernels themselves. See MFIter with Tiling for more information.

The optimal number of OpenMP threads depends on the computer architecture, and some experimentation is needed. Tiling works best with larger boxes, so increasing amr.max_grid_size can benefit performance.

Running on GPUs

Castro’s compute kernels can run on GPUs and this is the preferred way to run on supercomputers with GPUs. The exact same compute kernels are used on GPUs as on CPUs.

Note

Almost all of Castro runs on GPUs, with the main exception being the true SDC solver (USE_TRUE_SDC = TRUE).

When using GPUs, almost all of the computing is done on the GPUs. In the MFIter loops over boxes, the loops put a single zone on each GPU thread, to take advantage of the massive parallelism. The Microphysics routines (EOS, nuclear reaction networks, etc.) also take advantage of GPUs, so entire simulations can be run on the GPU.

Best performance is obtained with bigger boxes, so setting amr.max_grid_size = 128 and amr.blocking_factor = 32 can give good performance.

Castro / AMReX have an option to use managed memory for the GPU – this means that the data will automatically be migrated from host to device (and vice versa) as needed, whenever a page fault is encountered. This can be enabled via: amrex.the_arena_is_managed=1.

By default, Castro will abort if it runs out of GPU memory. You can disable this via amrex.abort_on_out_of_gpu_memory=0 – together with running with managed memory, this can allow the memory to be swapped off of the GPU to make more room available. This is not recommended – oversubscribing the GPU memory will severely impact performance.

The CTU hydrodynamics scheme creates a lot of temporary FAB ‘s when doing the update. This can lead to the code oversubscribing the GPU memory during the hydro advance. To alleviate this, Castro can break a box into tiles and work on one tile at a time (this is the approach we use with OpenMP). In the hydro solver, this is controlled by hydro_tile_size. By setting castro.hydro_memory_footprint_ratio to a number > 0, Castro will dynamically estimate a good tile size to use for the hydro during the first timestep and then use this subsequently. This larger the number, the more local memory we will allow the hydro solver to use (generally this means a larger hydro_tile_size). This can allow you to run a problem on a smaller number of GPUs if the hydro temporary memory was the cause of oversubscription. Current recommendations are to try castro.hydro_memory_footprint_ratio between 2.0 and 4.0.

NVIDIA GPUs

With NVIDIA GPUs, we use MPI+CUDA, compiled with GCC and the NVIDIA compilers. To enable this, compile with:

USE_MPI = TRUE
USE_OMP = FALSE
USE_CUDA = TRUE

Note

For recent GPUs, like the NVIDIA RTX 4090, you may need to change the default CUDA architecture. This can be done by adding:

CUDA_ARCH=89

to the make line or GNUmakefile.

Note

CUDA 11.2 and later can do link time optimization. This can increase performance by 10-30% (depending on the application), but may greatly increase the compilation time. This is disabled by default. To enable link time optimization, add:

CUDA_LTO=TRUE

to the make line of GNUmakefile.

AMD GPUs

For AMD GPUs, we use MPI+HIP, compiled with the ROCm compilers. To enable this, compile with:

USE_MPI = TRUE
USE_OMP = FALSE
USE_HIP = TRUE

Printing Warnings from GPU Kernels

Castro will output warnings if several assumptions are violated (often triggering a retry in the process). On GPUs, printing from a kernel (using printf()) can increase the number of registers a kernel needs, causing performance problems. As a result, warnings are disabled by wrapping them in #ifndef AMREX_USE_GPU.

However, for debugging GPU runs, sometimes we want to see these warnings. The build option USE_GPU_PRINTF=TRUE will enable these (by setting the preprocessor flag ALLOW_GPU_PRINTF).

Note

Not every warning has been enabled for GPUs.

Tip

On AMD architectures, it seems necessary to use unbuffered I/O. This can be accomplished in the job submission script (for SLURM) by doing

srun -u ./Castro...

Working at Supercomputing Centers

Our best practices for running any of the AMReX Astrophysics codes at different supercomputing centers is produced in our workflow documentation: https://amrex-astro.github.io/workflow/