Working at ALCF¶
Polaris has 560 nodes each with 4 NVIDIA A100 GPUs.
The PBS scheduler is used.
Logging In¶
ssh into:
polaris.alcf.ornl.gov
To have a custom .bashrc
, create a ~/.bash.expert
file and add anything
there. This is read at the end of /etc/bash.bashrc
Compiling¶
Load the modules:
module swap PrgEnv-nvhpc PrgEnv-gnu
# load gcc/11.2.0 version since CUDA doesn't support gcc 12 yet
module load gcc/11.2.0
module load nvhpc-mixed
Then you can compile via:
make COMP=gnu USE_CUDA=TRUE
Disks¶
Project workspace is at: /lus/grand/projects/AstroExplosions/
Queues¶
For production jobs, you submit to the prod
queue.
Note
The smallest node count that seems to be allowed in production is 10 nodes.
Submitting¶
Clone the GettingStarted
repo:
git clone git@github.com:argonne-lcf/GettingStarted.git
you’ll want to use the examples in
GettingStarted/Examples/Polaris/affinity_gpu
.
In particular, you will need the script
set_affinity_gpu_polaris.sh
copied into your run directory.
Here’s a submission script that will run on 2 nodes with 4 GPUs / node:
#!/bin/sh
#PBS -l select=2:system=polaris
#PBS -l place=scatter
#PBS -l walltime=0:30:00
#PBS -q debug
#PBS -A AstroExplosions
EXEC=./Castro2d.gnu.MPI.CUDA.SMPLSDC.ex
INPUTS=inputs_2d.N14.coarse
# Enable GPU-MPI (if supported by application)
##export MPICH_GPU_SUPPORT_ENABLED=1
# Change to working directory
cd ${PBS_O_WORKDIR}
# MPI and OpenMP settings
NNODES=`wc -l < $PBS_NODEFILE`
NRANKS_PER_NODE=4
NDEPTH=8
NTHREADS=1
NTOTRANKS=$(( NNODES * NRANKS_PER_NODE ))
# For applications that need mpiexec to bind MPI ranks to GPUs
mpiexec -n ${NTOTRANKS} --ppn ${NRANKS_PER_NODE} --depth=${NDEPTH} --cpu-bind depth --env OMP_NUM_THREADS=${NTHREADS} -env OMP_PLACES=threads ./set_affinity_gpu_polaris.sh ${EXEC} ${INPUTS}
To submit the job, do:
qsub polaris.submit
To check the status:
qstat -u username
Automatic Restarting¶
A version of the submission script that automatically restarts from the last checkpoint is:
#!/bin/sh
#PBS -l select=2:system=polaris
#PBS -l place=scatter
#PBS -l walltime=6:00:00
#PBS -q prod
#PBS -A AstroExplosions
#PBS -j eo
EXEC=./Castro2d.gnu.MPI.CUDA.SMPLSDC.ex
INPUTS=inputs_2d.N14.coarse
module swap PrgEnv-nvhpc PrgEnv-gnu
module load nvhpc-mixed
# Enable GPU-MPI (if supported by application)
##export MPICH_GPU_SUPPORT_ENABLED=1
# Change to working directory
cd ${PBS_O_WORKDIR}
# MPI and OpenMP settings
NNODES=`wc -l < $PBS_NODEFILE`
NRANKS_PER_NODE=4
NDEPTH=8
NTHREADS=1
NTOTRANKS=$(( NNODES * NRANKS_PER_NODE ))
function find_chk_file {
# find_chk_file takes a single argument -- the wildcard pattern
# for checkpoint files to look through
chk=$1
# find the latest 2 restart files. This way if the latest didn't
# complete we fall back to the previous one.
temp_files=$(find . -maxdepth 1 -name "${chk}" -print | sort | tail -2)
restartFile=""
for f in ${temp_files}
do
# the Header is the last thing written -- if it's there, update the restart file
if [ -f ${f}/Header ]; then
restartFile="${f}"
fi
done
}
# look for 7-digit chk files
find_chk_file "*chk???????"
if [ "${restartFile}" = "" ]; then
# look for 6-digit chk files
find_chk_file "*chk??????"
fi
if [ "${restartFile}" = "" ]; then
# look for 5-digit chk files
find_chk_file "*chk?????"
fi
# restartString will be empty if no chk files are found -- i.e. new run
if [ "${restartFile}" = "" ]; then
restartString=""
else
restartString="amr.restart=${restartFile}"
fi
# For applications that need mpiexec to bind MPI ranks to GPUs
mpiexec -n ${NTOTRANKS} --ppn ${NRANKS_PER_NODE} --depth=${NDEPTH} --cpu-bind depth --env OMP_NUM_THREADS=${NTHREADS} -env OMP_PLACES=threads ./set_affinity_gpu_polaris.sh ${EXEC} ${INPUTS} ${restartString}
Job Chaining¶
A script that can be used to chain jobs with PBS is:
#!/bin/sh -f
if [ ! "$1" ]; then
echo "usage: chainqsub.sh jobid number script"
echo " set jobid -1 for no initial dependency"
exit -1
fi
if [ ! "$2" ]; then
echo "usage: chainqsub.sh jobid number script"
echo " set jobid -1 for no initial dependency"
exit -1
fi
if [ ! "$3" ]; then
echo "usage: chainqsub.sh jobid number script"
echo " set jobid -1 for no initial dependency"
exit -1
fi
oldjob=$1
numjobs=$2
script=$3
if [ $numjobs -gt "20" ]; then
echo "too many jobs requested"
exit -1
fi
firstcount=1
if [ $oldjob -eq "-1" ]
then
echo chaining $numjobs jobs
echo starting job 1 with no dependency
aout=$(qsub ${script})
echo " " jobid: $aout
echo " "
oldjob=$aout
firstcount=2
sleep 3
else
echo chaining $numjobs jobs starting with $oldjob
fi
for count in $(seq $firstcount 1 $numjobs)
do
echo starting job $count to depend on $oldjob
aout=$(qsub -W depend=afterany:${oldjob} ${script})
echo " " jobid: $aout
echo " "
oldjob=$aout
sleep 2
done