Page History

...

For threaded applications (pure OpenMP, no MPI), you obtain a full node by requesting --ntasks-per-node = 1 and --cpus-per-task=128. You can choose to exploit or not the SMT
featurefeature (it will depend on the value you assign to the OMP_NUM_THREADS variable) , but in any case switch the binding of the OMP threads on (this is the default for the XL compilers , while for gnu and PGI , and hpc-sdk (ex-pgi) compilers it is off by default).

Different compilers abide by different default settings for the binding and placing of the threads:

XL compilers
- OMP_PROC_BIND = false (default)/true, close/scatter
- OMP_PLACES = threads (default),cores
- Comments for SMT configurations: in both the default setting for OMP_PLACES (threads) and for OMP_PLACES=cores the threads are placed ALWAYS on the first HW thread of the physical cores. Setting OMP_PROC_BIND=close/scatter makes no difference.
NVIDIA hpc-sdk compilers (ex PGI):
...
- - OMP_PROC_BIND = false (default)/true, close/scatter
  - OMP_PLACES = threads (default),cores
- GCC compilers
  - OMP_PROC_BIND = false (default)/true, close/scatter
  - OMP_PLACES = threads (default),cores
For instance, if you want each OMP thread bound to a physical cores, ask for the full node (--cpus-per-task=128) and
#SBATCH --nodes=1
#SBATCH
...
--ntasks-per-node=1
...

#SBATCH
...
--cpus-per-task=
...
128        # full node
#SBATCH ........

export OMP_PROC_BIND=
...
true
export OMP_PLACES=threads/cores      # XL default: cores, pinned to the 1st HW thread of each physical core; threads: pinned to the threads
                                                            # hpc-sdk default: threads, pinned to the 1st HW thread of each physical core; cores: placed on all 4 HW threads of the physical cores
                                                             # gnu default: threads e OMP_PROC_BIND=close, placed on subsequent threads (4 per physical core); set OMP_PROC_BIND=spread to place one thread per physical core                                                                 cores: placed on all 4 HW threads of the physical cores
export OMP_NUM_THREADS=32 # the OMP thread will be bound to the physical cores

<your exe>
2) MPI parallelization
For pure MPI applications (hence, no OpenMP threads as parallel elements), set the value of --ntasks-per-node to the number of MPI processes you want to run per node, and --cpus-per-task=4.
For instance, if you want to run an MPI application with 2 processes (no threads), each of them using 1 GPU:
...

Page tree

Versions Compared

Old Version 7

New Version 8

Key

2) MPI parallelization