...
For threaded applications (pure OpenMP, no MPI), you obtain a full node by requesting --ntasks-per-node = 1 and --cpus-per-task=128. You can choose to exploit or not the SMT
featurefeature (it will depend on the value you assign to the OMP_NUM_THREADS variable) , but in any case switch the binding of the OMP threads on (this is the default for the XL compilers , while for gnu and PGI , and hpc-sdk (ex-pgi) compilers it is off by default).
Different compilers abide by different default settings for the binding and placing of the threads:
- XL compilers
- OMP_PROC_BIND = false (default)/true, close/scatter
- OMP_PLACES = threads (default),cores
- Comments for SMT configurations: in both the default setting for OMP_PLACES (threads) and for OMP_PLACES=cores the threads are placed ALWAYS on the first HW thread of the physical cores. Setting OMP_PROC_BIND=close/scatter makes no difference.
- NVIDIA hpc-sdk compilers (ex PGI):
...
- OMP_PROC_BIND = false (default)/true, close/scatter
- OMP_PLACES = threads (default),cores
- GCC compilers
- OMP_PROC_BIND = false (default)/true, close/scatter
- OMP_PLACES = threads (default),cores
For instance, if you want each OMP thread bound to a physical cores, ask for the full node (--cpus-per-task=128) and
#SBATCH --nodes=1
#SBATCH
...
--ntasks-per-node=1
...
#SBATCH
...
--cpus-per-task=
...
128 # full node
#SBATCH ........
export OMP_PROC_BIND=
...
true
export OMP_PLACES=threads/cores # XL default: cores, pinned to the 1st HW thread of each physical core; threads: pinned to the threads
# hpc-sdk default: threads, pinned to the 1st HW thread of each physical core; cores: placed on all 4 HW threads of the physical cores
# gnu default: threads e OMP_PROC_BIND=close, placed on subsequent threads (4 per physical core); set OMP_PROC_BIND=spread to place one thread per physical core cores: placed on all 4 HW threads of the physical cores
export OMP_NUM_THREADS=32 # the OMP thread will be bound to the physical cores
<your exe>
2) MPI parallelization
For pure MPI applications (hence, no OpenMP threads as parallel elements), set the value of --ntasks-per-node to the number of MPI processes you want to run per node, and --cpus-per-task=4.
For instance, if you want to run an MPI application with 2 processes (no threads), each of them using 1 GPU:
...