Working at ALCF

Polaris has 560 nodes each with 4 NVIDIA A100 GPUs.

The PBS scheduler is used.

Logging In

ssh into:

polaris.alcf.ornl.gov

To have a custom .bashrc, create a ~/.bash.expert file and add anything there. This is read at the end of /etc/bash.bashrc

Compiling

Load the modules:

module use /soft/modulefiles
module load PrgEnv-gnu
module load nvhpc-mixed

Then you can compile via:

make COMP=cray USE_CUDA=TRUE

Disks

Project workspace is at: /lus/grand/projects/AstroExplosions/

Queues

For production jobs, you submit to the prod queue.

For debugging jobs, there are two options: the debug and debug-scaling options. More information can be found in: https://docs.alcf.anl.gov/polaris/running-jobs/#queues

Note

The smallest node count that seems to be allowed in production is 10 nodes.

Submitting

Clone the GettingStarted repo:

git clone git@github.com:argonne-lcf/GettingStarted.git

you’ll want to use the examples in GettingStarted/Examples/Polaris/affinity_gpu.

In particular, you will need the script set_affinity_gpu_polaris.sh copied into your run directory.

Here’s a submission script that will run on 2 nodes with 4 GPUs / node:

polaris.submit
#!/bin/sh
#PBS -l select=2:system=polaris
#PBS -l place=scatter
#PBS -l walltime=0:30:00
#PBS -q debug
#PBS -A AstroExplosions

EXEC=./Castro2d.gnu.MPI.CUDA.SMPLSDC.ex
INPUTS=inputs_2d.N14.coarse

# Enable GPU-MPI (if supported by application)
##export MPICH_GPU_SUPPORT_ENABLED=1

# Change to working directory
cd ${PBS_O_WORKDIR}

# MPI and OpenMP settings
NNODES=`wc -l < $PBS_NODEFILE`
NRANKS_PER_NODE=4
NDEPTH=8
NTHREADS=1

NTOTRANKS=$(( NNODES * NRANKS_PER_NODE ))

# For applications that need mpiexec to bind MPI ranks to GPUs
mpiexec -n ${NTOTRANKS} --ppn ${NRANKS_PER_NODE} --depth=${NDEPTH} --cpu-bind depth --env OMP_NUM_THREADS=${NTHREADS} -env OMP_PLACES=threads ./set_affinity_gpu_polaris.sh ${EXEC} ${INPUTS}

To submit the job, do:

qsub polaris.submit

To check the status:

qstat -u username

Automatic Restarting

A version of the submission script that automatically restarts from the last checkpoint is:

polaris.submit
#!/bin/sh
#PBS -l select=2:system=polaris
#PBS -l place=scatter
#PBS -l walltime=6:00:00
#PBS -q prod
#PBS -A AstroExplosions
#PBS -j eo

EXEC=./Castro2d.gnu.MPI.CUDA.SMPLSDC.ex
INPUTS=inputs_2d.N14.coarse

module swap PrgEnv-nvhpc PrgEnv-gnu
module load nvhpc-mixed

# Enable GPU-MPI (if supported by application)
##export MPICH_GPU_SUPPORT_ENABLED=1

# Change to working directory
cd ${PBS_O_WORKDIR}

# MPI and OpenMP settings
NNODES=`wc -l < $PBS_NODEFILE`
NRANKS_PER_NODE=4
NDEPTH=8
NTHREADS=1

NTOTRANKS=$(( NNODES * NRANKS_PER_NODE ))


function find_chk_file {
    # find_chk_file takes a single argument -- the wildcard pattern
    # for checkpoint files to look through
    chk=$1

    # find the latest 2 restart files.  This way if the latest didn't
    # complete we fall back to the previous one.
    temp_files=$(find . -maxdepth 1 -name "${chk}" -print | sort | tail -2)
    restartFile=""
    for f in ${temp_files}
    do
        # the Header is the last thing written -- if it's there, update the restart file
        if [ -f ${f}/Header ]; then
            restartFile="${f}"
        fi
    done

}

# look for 7-digit chk files
find_chk_file "*chk???????"

if [ "${restartFile}" = "" ]; then
    # look for 6-digit chk files
    find_chk_file "*chk??????"
fi

if [ "${restartFile}" = "" ]; then
    # look for 5-digit chk files
    find_chk_file "*chk?????"
fi

# restartString will be empty if no chk files are found -- i.e. new run
if [ "${restartFile}" = "" ]; then
    restartString=""
else
    restartString="amr.restart=${restartFile}"
fi


# For applications that need mpiexec to bind MPI ranks to GPUs
mpiexec -n ${NTOTRANKS} --ppn ${NRANKS_PER_NODE} --depth=${NDEPTH} --cpu-bind depth --env OMP_NUM_THREADS=${NTHREADS} -env OMP_PLACES=threads ./set_affinity_gpu_polaris.sh ${EXEC} ${INPUTS} ${restartString}

Job Chaining

A script that can be used to chain jobs with PBS is:

chainqsub.sh
#!/bin/sh -f

if [ ! "$1" ]; then
  echo "usage: chainqsub.sh jobid number script"
  echo "       set jobid -1 for no initial dependency"
  exit -1
fi

if [ ! "$2" ]; then
  echo "usage: chainqsub.sh jobid number script"
  echo "       set jobid -1 for no initial dependency"
  exit -1
fi

if [ ! "$3" ]; then
  echo "usage: chainqsub.sh jobid number script"
  echo "       set jobid -1 for no initial dependency"
  exit -1
fi


oldjob=$1
numjobs=$2
script=$3

if [ $numjobs -gt "20" ]; then
    echo "too many jobs requested"
    exit -1
fi

firstcount=1

if [ $oldjob -eq "-1" ]
then
    echo chaining $numjobs jobs

    echo starting job 1 with no dependency
    aout=$(qsub ${script})
    echo "   " jobid: $aout
    echo " "
    oldjob=$aout
    firstcount=2
    sleep 3
else
    echo chaining $numjobs jobs starting with $oldjob
fi

for count in $(seq $firstcount 1 $numjobs)
do
  echo starting job $count to depend on $oldjob
  aout=$(qsub -W depend=afterany:${oldjob} ${script})
  echo "   " jobid: $aout
  echo " "
  oldjob=$aout
  sleep 2
done

Installing Python

The most recommended way to install python in polaris is to create a virtual environment on the top of the conda-based environment, provided by the module conda, and install all the extra required packages on this virtual environment by using pip. Although is very tempting to clone the whole base environment and fully customize the installed conda packages, some modules like mpi4py may require access to the MPICH libraries that are only tailored to be used by the conda-based environment provided by the conda module. All these instructions follow the guidelines published here: https://docs.alcf.anl.gov/polaris/data-science-workflows/python/

To create the virtual environment:

module use /soft/modulefiles
module load conda
conda activate
VENV_DIR="venvs/polaris"
mkdir -p "${VENV_DIR}"
python -m venv "${VENV_DIR}" --system-site-packages
source "${VENV_DIR}/bin/activate"

To activate it in a new terminal, if the module path /soft/modulefiles is loaded:

module load conda
conda activate
VENV_DIR="venvs/polaris"
source "${VENV_DIR}/bin/activate"

Once the virtual environment is active, any extra package can be installed with the use of pip:

python -m pip install <package>