In this chapter, we will present the GPU support in MAESTROeX, including necessary build parameters, how to offload a routine to GPU, and some basic profiling and debugging options. Note that currently MAESTROeX only supports NVIDIA GPUs.


MAESTROeX has only been tested with NVIDIA/CUDA. In theory AMD/HIP should work.

Building GPU Support

To build MAESTROeX with GPU support, add the following argument to the GNUmakefile:


We also need to set USE_OMP = FALSE because OpenMP is currently not compatible with building with CUDA. USE_CUDA = TRUE and USE_OMP = TRUE will fail to compile. However, you may use MPI with CUDA for additional parallelization.

Depending on which system you are running on, it may be necessary to specify the CUDA Capability using the CUDA_ARCH flag. The CUDA Capability will depend on the specific GPU hardware you are running on. On a Linux system, the capability of your device can typically be found by compiling and running the deviceQuery script found in the CUDA samples directory: /usr/local/cuda/samples/1_Utilities/deviceQuery (its exact location may vary depending on where CUDA is installed on your system). The default value of this flag is 70, corresponding to a capability of 7.x. For a device with capability 6.x, the flag should be set to:


Profiling with GPUs

NVIDIA’s profiler, nvprof, is recommended when profiling for GPUs. It returns data on how long each kernel launch lasted on the GPU, the number of threads and registers used, the occupancy of the GPU and provides recommendations for improving the code. For more information on how to use nvprof, see NVIDIA’s User’s Guide.

If a quicker profiling method is preferred, AMReX’s timers can be used to report some generic timings that may be useful in categorizing an application. To yield a consistent timing of a routine, a timer will need to be wrapped around an MFIter loop that encompasses the entire set of GPU launches contained within. For example:

BL_PROFILE_VAR("A_NAME", blp);     // Profiling start
for (MFIter mfi(mf); mfi.isValid(); ++mfi)
    // code that runs on the GPU
BL_PROFILE_STOP(blp);              // Profiling stop

For now, this is the best way to profile GPU codes using the compiler flag TINY_PROFILE = TRUE. If you require further profiling detail, use nvprof.

Basic GPU Debugging

  • Turn off GPU offloading for some part of the code with

... ;
  • To test if your kernels have launched, run

nvprof ./
  • Run under nvprof -o profile%p.nvvp ./ for a small problem and examine page faults using NVIDIA’s visual profiler, nvvp

  • Run under cuda-memcheck

  • Run under cuda-gdb

  • Run with CUDA_LAUNCH_BLOCKING=1. This means that only one kernel will run at a time. This can help identify if there are race conditions.