Performance Tips
Getting good performance is a mix of single-processor performance and parallel efficiency.
Parallel Efficiency
Load Balancing
Good parallel performance occurs when there are the same number of boxes on each MPI task. If too many MPI tasks are used, then there will not be enough work to go around.
Use a large amr.max_grid_size and aim for one box per MPI task one
each level.
Usually it’s a good idea to do a small scaling study before running a large simulation to find the optimal number of nodes to run your problem.
Gravity / Multigrid
Multigrid performance is best when the grid can be coarsened down to
the smallest possible size. This means that the number of zones in
each dimension of a box should be a power-of-two. This can be controlled
by amr.blocking_factor. See also the AMReX docs on linear solvers.
It can also be faster to run without subcycling in some instances,
which is controlled by amr.subcycling_mode, as described
in Subcycling.
For isolated systems, we use a multipole expansion for constructing
the Dirichlet boundary conditions for the Poisson solve. The
construction of the multipole expansion can be expensive. If the
system is roughly spherical and the boundaries are far from the mass,
then you probably can get away with fewer multipole moments. For
reference, for the white dwarf mergers in Katz et al. [51],
\(l_\mathrm{max} = 6\) was used. The maximum multipole moment is
set by gravity.max_multipole_order.
Global Diagnostics
Global Diagnostics are output at regular intervals. These can
require reduction operations across all processors, which can be
expensive for big calculations. Often we don’t need the diagnostics
every step, so setting castro.sum_interval to a larger value (like
10) could improve performance.
Metrics
To understand where the simulation is spending the most time, it is suggested to run using the AMReX Tiny Profiler.
Simply build as:
make TINY_PROFILE=TRUE
and then run your simulation (any number of processors, CPU or GPU). Typically running for 10–100 steps is enough to get a sense of where the time is spent. Upon completion, a detailed report will be output to stdout showing which functions / kernels are using the most time.
This can help you determine where to focus your optimization efforts.
Alternately, you can use the GNU Profiler on CPUs to get line-by-line performance details. This can be enabled by building with
make USE_GPROF=TRUE
running, and then using the gprof command line tool.
GPUs
For GPU performance, see the discussion in Running on GPUs.
The main parameter to explore when running on GPUs is
castro.hydro_memory_footprint_ratio, which can use tiling in the
hydrodynamics solver to prevent oversubscription of GPU memory.
Reactions
Reactions are often the most time-consuming part of a simulation. The following are some things to try to improve the performance:
Try using the Runge-Kutta-Chebyshev integrator (see the Microphysics ODE integrators docs.
This is an explicit integrator that can work with moderately-stiff networks. Experience shows that it can work well with flames (sometimes being twice as fast as the VODE integrator), but probably not very efficiently with detonations.
Use the analytic Jacobian (selected via
integrator.jacobian=1).The analytic Jacobian is faster to evaluate than the difference-approximation. Note that if the integration fails in a zone, by default Castro will record this failure and reject the step triggering the Castro Retry Mechanism. This can be expensive, so something to try instead is to catch the failure in the burner and retry the burn with the alternate Jacobian (e.g., numerical differencing), since sometimes that helps the integrator get through. This can be enabled via:
integrator.use_burn_retry = 1 integrator.retry_swap_jacobian = 1
sometimes it can be advantageous to force a burn failure if the first pass it taking too many integration steps. This can be done by setting
integrator.ode_max_stepsto a small value (like5000instead of the default150000).Try a single-precision Jacobian.
On GPUs, the Jacobian can consume a lot of memory, and the linear algebra solve is often the most expensive part of the ODE integration. Since the role of the Jacobian is simply to point the nonlinear solver in the right direction for computing the correction, we can sometimes benefit from using a single-precision Jacobian. This needs to be set at compile time by building as:
make USE_SINGLE_PRECISION_JACOBIAN=TRUEDisable burning where it is not needed.
The parameters
castro.react_rho_min,castro.react_rho_max,castro.react_T_min, andcastro.react_T_maxcan be used to control the density and temperature where we burn. Disabling reactions in low density regions (where you don’t expect appreciable energy generation) can help with performance.Experiment with SDC
For explosive flows, the simplified-SDC solver can be more efficient, since it eliminates some stiffness from the system. See Flowchart for information on how to enable the SDC solver.
Use self-consistent NSE
For explosive flows, when the temperature reaches 5 GK or more, the nuclei can enter nuclear statistical equilibrium. The cancellation of the forward and reverse rates can be challenging for the reaction integrator to handle. Instead, we can detect that we are entering NSE and use the NSE solution instead of integration in these cases. This is described in the Microphysics self-consistent NSE docs.