Performance Tips

Getting good performance is a mix of single-processor performance and parallel efficiency.

Parallel Efficiency

Load Balancing

Good parallel performance occurs when there are the same number of boxes on each MPI task. If too many MPI tasks are used, then there will not be enough work to go around.

Use a large amr.max_grid_size and aim for one box per MPI task one each level.

Usually it’s a good idea to do a small scaling study before running a large simulation to find the optimal number of nodes to run your problem.

Gravity / Multigrid

Multigrid performance is best when the grid can be coarsened down to the smallest possible size. This means that the number of zones in each dimension of a box should be a power-of-two. This can be controlled by amr.blocking_factor. See also the AMReX docs on linear solvers.

It can also be faster to run without subcycling in some instances, which is controlled by amr.subcycling_mode, as described in Subcycling.

For isolated systems, we use a multipole expansion for constructing the Dirichlet boundary conditions for the Poisson solve. The construction of the multipole expansion can be expensive. If the system is roughly spherical and the boundaries are far from the mass, then you probably can get away with fewer multipole moments. For reference, for the white dwarf mergers in Katz et al. [51], \(l_\mathrm{max} = 6\) was used. The maximum multipole moment is set by gravity.max_multipole_order.

Global Diagnostics

Global Diagnostics are output at regular intervals. These can require reduction operations across all processors, which can be expensive for big calculations. Often we don’t need the diagnostics every step, so setting castro.sum_interval to a larger value (like 10) could improve performance.

Metrics

To understand where the simulation is spending the most time, it is suggested to run using the AMReX Tiny Profiler.

Simply build as:

make TINY_PROFILE=TRUE

and then run your simulation (any number of processors, CPU or GPU). Typically running for 10–100 steps is enough to get a sense of where the time is spent. Upon completion, a detailed report will be output to stdout showing which functions / kernels are using the most time.

This can help you determine where to focus your optimization efforts.

Alternately, you can use the GNU Profiler on CPUs to get line-by-line performance details. This can be enabled by building with

make USE_GPROF=TRUE

running, and then using the gprof command line tool.

GPUs

For GPU performance, see the discussion in Running on GPUs.

The main parameter to explore when running on GPUs is castro.hydro_memory_footprint_ratio, which can use tiling in the hydrodynamics solver to prevent oversubscription of GPU memory.

Reactions

Reactions are often the most time-consuming part of a simulation. The following are some things to try to improve the performance:

  • Try using the Runge-Kutta-Chebyshev integrator (see the Microphysics ODE integrators docs.

    This is an explicit integrator that can work with moderately-stiff networks. Experience shows that it can work well with flames (sometimes being twice as fast as the VODE integrator), but probably not very efficiently with detonations.

  • Use the analytic Jacobian (selected via integrator.jacobian=1).

    The analytic Jacobian is faster to evaluate than the difference-approximation. Note that if the integration fails in a zone, by default Castro will record this failure and reject the step triggering the Castro Retry Mechanism. This can be expensive, so something to try instead is to catch the failure in the burner and retry the burn with the alternate Jacobian (e.g., numerical differencing), since sometimes that helps the integrator get through. This can be enabled via:

    integrator.use_burn_retry = 1
    integrator.retry_swap_jacobian = 1
    

    sometimes it can be advantageous to force a burn failure if the first pass it taking too many integration steps. This can be done by setting integrator.ode_max_steps to a small value (like 5000 instead of the default 150000).

  • Try a single-precision Jacobian.

    On GPUs, the Jacobian can consume a lot of memory, and the linear algebra solve is often the most expensive part of the ODE integration. Since the role of the Jacobian is simply to point the nonlinear solver in the right direction for computing the correction, we can sometimes benefit from using a single-precision Jacobian. This needs to be set at compile time by building as:

    make USE_SINGLE_PRECISION_JACOBIAN=TRUE
    
  • Disable burning where it is not needed.

    The parameters castro.react_rho_min, castro.react_rho_max, castro.react_T_min, and castro.react_T_max can be used to control the density and temperature where we burn. Disabling reactions in low density regions (where you don’t expect appreciable energy generation) can help with performance.

  • Experiment with SDC

    For explosive flows, the simplified-SDC solver can be more efficient, since it eliminates some stiffness from the system. See Flowchart for information on how to enable the SDC solver.

  • Use self-consistent NSE

    For explosive flows, when the temperature reaches 5 GK or more, the nuclei can enter nuclear statistical equilibrium. The cancellation of the forward and reverse rates can be challenging for the reaction integrator to handle. Instead, we can detect that we are entering NSE and use the NSE solution instead of integration in these cases. This is described in the Microphysics self-consistent NSE docs.