.. _sec:gpu:

***
GPU
***

In this chapter, we will present the GPU support in MAESTROeX,
including necessary build parameters, how to offload a routine
to GPU, and some basic profiling and debugging options.
Note that currently MAESTROeX only supports NVIDIA GPUs.

Requirements
============

MAESTROeX has only been tested with NVIDIA/CUDA.  In theory AMD/HIP
should work.

.. _sec:gpubuild:

Building GPU Support
====================

To build MAESTROeX with GPU support, add the following argument
to the GNUmakefile:

::

      USE_CUDA := TRUE

We also need to set ``USE_OMP = FALSE`` because OpenMP is currently
not compatible with building with CUDA.
``USE_CUDA = TRUE`` and ``USE_OMP = TRUE`` will fail to compile.
However, you may use MPI with CUDA for additional parallelization.


Depending on which system you are running on, it may be necessary to specify
the CUDA Capability using the ``CUDA_ARCH`` flag. The CUDA Capability will
depend on the specific GPU hardware you are running on. On a Linux system, the
capability of your device can typically be found by compiling and running the ``deviceQuery``
script found in the CUDA samples directory:
``/usr/local/cuda/samples/1_Utilities/deviceQuery`` (its exact location may
vary depending on where CUDA is installed on your system). The default value of
this flag is 70, corresponding to a capability of 7.x. For a device with
capability 6.x, the flag should be set to:

::

    CUDA_ARCH := 60

.. _sec:gpuporting:



.. _sec:gpuprofile:

Profiling with GPUs
===================

NVIDIA's profiler, ``nvprof``, is recommended when profiling for GPUs.
It returns data on how long each kernel launch lasted on the GPU,
the number of threads and registers used, the occupancy of the GPU
and provides recommendations for improving the code.  For more information on how to
use ``nvprof``, see NVIDIA's User's Guide.

If a quicker profiling method is preferred, AMReX's timers can be used
to report some generic timings that may be useful in categorizing an application.
To yield a consistent timing of a routine, a timer will need to be wrapped
around an ``MFIter`` loop that encompasses the entire set of GPU launches
contained within. For example:

.. code-block:: c++

    BL_PROFILE_VAR("A_NAME", blp);     // Profiling start
    for (MFIter mfi(mf); mfi.isValid(); ++mfi)
    {
        // code that runs on the GPU
    }
    BL_PROFILE_STOP(blp);              // Profiling stop

For now, this is the best way to profile GPU codes using the compiler flag ``TINY_PROFILE = TRUE``.
If you require further profiling detail, use ``nvprof``.

.. _sec:gpudebug:

Basic GPU Debugging
===================

- Turn off GPU offloading for some part of the code with

.. code-block:: c++

    Gpu::setLaunchRegion(0);
    ... ;
    Gpu::setLaunchRegion(1);

- To test if your kernels have launched, run

.. code-block:: sh

   nvprof ./Maestro2d.xxx

- Run under ``nvprof -o profile%p.nvvp ./Maestro2d.xxx`` for
  a small problem and examine page faults using NVIDIA's visual profiler, `nvvp`

- Run under ``cuda-memcheck``

- Run under ``cuda-gdb``

- Run with ``CUDA_LAUNCH_BLOCKING=1``.  This means that only one
  kernel will run at a time.  This can help identify if there are race
  conditions.