Working at ALCF
Polaris has 560 nodes each with 4 NVIDIA A100 GPUs.
The PBS scheduler is used.
Logging In
ssh into:
polaris.alcf.ornl.gov
To have a custom .bashrc
, create a ~/.bash.expert
file and add anything
there. This is read at the end of /etc/bash.bashrc
Compiling
Load the modules:
module use /soft/modulefiles
module load PrgEnv-gnu
module load nvhpc-mixed
Then you can compile via:
make COMP=cray USE_CUDA=TRUE
Disks
Project workspace is at: /lus/grand/projects/AstroExplosions/
Queues
For production jobs, you submit to the prod
queue.
For debugging jobs, there are two options: the debug
and debug-scaling
options. More information can be found in:
https://docs.alcf.anl.gov/polaris/running-jobs/#queues
Note
The smallest node count that seems to be allowed in production is 10 nodes.
Submitting
Clone the GettingStarted
repo:
git clone git@github.com:argonne-lcf/GettingStarted.git
you’ll want to use the examples in
GettingStarted/Examples/Polaris/affinity_gpu
.
In particular, you will need the script
set_affinity_gpu_polaris.sh
copied into your run directory.
Here’s a submission script that will run on 2 nodes with 4 GPUs / node:
#!/bin/sh
#PBS -l select=2:system=polaris
#PBS -l place=scatter
#PBS -l walltime=0:30:00
#PBS -q debug
#PBS -A AstroExplosions
EXEC=./Castro2d.gnu.MPI.CUDA.SMPLSDC.ex
INPUTS=inputs_2d.N14.coarse
# Enable GPU-MPI (if supported by application)
##export MPICH_GPU_SUPPORT_ENABLED=1
# Change to working directory
cd ${PBS_O_WORKDIR}
# MPI and OpenMP settings
NNODES=`wc -l < $PBS_NODEFILE`
NRANKS_PER_NODE=4
NDEPTH=8
NTHREADS=1
NTOTRANKS=$(( NNODES * NRANKS_PER_NODE ))
# For applications that need mpiexec to bind MPI ranks to GPUs
mpiexec -n ${NTOTRANKS} --ppn ${NRANKS_PER_NODE} --depth=${NDEPTH} --cpu-bind depth --env OMP_NUM_THREADS=${NTHREADS} -env OMP_PLACES=threads ./set_affinity_gpu_polaris.sh ${EXEC} ${INPUTS}
To submit the job, do:
qsub polaris.submit
To check the status:
qstat -u username
Automatic Restarting
A version of the submission script that automatically restarts from the last checkpoint is:
#!/bin/sh
#PBS -l select=2:system=polaris
#PBS -l place=scatter
#PBS -l walltime=6:00:00
#PBS -q prod
#PBS -A AstroExplosions
#PBS -j eo
EXEC=./Castro2d.gnu.MPI.CUDA.SMPLSDC.ex
INPUTS=inputs_2d.N14.coarse
module swap PrgEnv-nvhpc PrgEnv-gnu
module load nvhpc-mixed
# Enable GPU-MPI (if supported by application)
##export MPICH_GPU_SUPPORT_ENABLED=1
# Change to working directory
cd ${PBS_O_WORKDIR}
# MPI and OpenMP settings
NNODES=`wc -l < $PBS_NODEFILE`
NRANKS_PER_NODE=4
NDEPTH=8
NTHREADS=1
NTOTRANKS=$(( NNODES * NRANKS_PER_NODE ))
function find_chk_file {
# find_chk_file takes a single argument -- the wildcard pattern
# for checkpoint files to look through
chk=$1
# find the latest 2 restart files. This way if the latest didn't
# complete we fall back to the previous one.
temp_files=$(find . -maxdepth 1 -name "${chk}" -print | sort | tail -2)
restartFile=""
for f in ${temp_files}
do
# the Header is the last thing written -- if it's there, update the restart file
if [ -f ${f}/Header ]; then
restartFile="${f}"
fi
done
}
# look for 7-digit chk files
find_chk_file "*chk???????"
if [ "${restartFile}" = "" ]; then
# look for 6-digit chk files
find_chk_file "*chk??????"
fi
if [ "${restartFile}" = "" ]; then
# look for 5-digit chk files
find_chk_file "*chk?????"
fi
# restartString will be empty if no chk files are found -- i.e. new run
if [ "${restartFile}" = "" ]; then
restartString=""
else
restartString="amr.restart=${restartFile}"
fi
# For applications that need mpiexec to bind MPI ranks to GPUs
mpiexec -n ${NTOTRANKS} --ppn ${NRANKS_PER_NODE} --depth=${NDEPTH} --cpu-bind depth --env OMP_NUM_THREADS=${NTHREADS} -env OMP_PLACES=threads ./set_affinity_gpu_polaris.sh ${EXEC} ${INPUTS} ${restartString}
Job Chaining
A script that can be used to chain jobs with PBS is:
#!/bin/sh -f
if [ ! "$1" ]; then
echo "usage: chainqsub.sh jobid number script"
echo " set jobid -1 for no initial dependency"
exit -1
fi
if [ ! "$2" ]; then
echo "usage: chainqsub.sh jobid number script"
echo " set jobid -1 for no initial dependency"
exit -1
fi
if [ ! "$3" ]; then
echo "usage: chainqsub.sh jobid number script"
echo " set jobid -1 for no initial dependency"
exit -1
fi
oldjob=$1
numjobs=$2
script=$3
if [ $numjobs -gt "20" ]; then
echo "too many jobs requested"
exit -1
fi
firstcount=1
if [ $oldjob -eq "-1" ]
then
echo chaining $numjobs jobs
echo starting job 1 with no dependency
aout=$(qsub ${script})
echo " " jobid: $aout
echo " "
oldjob=$aout
firstcount=2
sleep 3
else
echo chaining $numjobs jobs starting with $oldjob
fi
for count in $(seq $firstcount 1 $numjobs)
do
echo starting job $count to depend on $oldjob
aout=$(qsub -W depend=afterany:${oldjob} ${script})
echo " " jobid: $aout
echo " "
oldjob=$aout
sleep 2
done
Installing Python
The most recommended way to install python in polaris is to create a virtual environment
on the top of the conda-based environment, provided by the module conda, and install all the extra
required packages on this virtual environment by using pip
. Although is very tempting
to clone the whole base environment and fully customize the installed conda packages, some
modules like mpi4py
may require access to the MPICH libraries that are only tailored to be
used by the conda-based environment provided by the conda module. All these instructions
follow the guidelines published here: https://docs.alcf.anl.gov/polaris/data-science-workflows/python/
To create the virtual environment:
module use /soft/modulefiles
module load conda
conda activate
VENV_DIR="venvs/polaris"
mkdir -p "${VENV_DIR}"
python -m venv "${VENV_DIR}" --system-site-packages
source "${VENV_DIR}/bin/activate"
To activate it in a new terminal, if the module path /soft/modulefiles
is loaded:
module load conda
conda activate
VENV_DIR="venvs/polaris"
source "${VENV_DIR}/bin/activate"
Once the virtual environment is active, any extra package can be installed with
the use of pip
:
python -m pip install <package>