Page History

...

The HTs are enabled on all nodes, which expose themselves to SLURM has having 128 logical cpus. Since the nodes are not assigned in an exclusive way and can be shared by users, we configured the scheduler to always assign the requested cores in an exclusive way, that is from 1 to 128 as a multiple of 4. A job can ask resources up to a maximum to 128 cpus (hence, the upper limit for (ntasks-per-node) * (cpus-per-task) is 128). Even though you request --cpus-per-task<4, the 4 HTs per core will be assigned to your job.

You need to pay attention to the mapping and binding of MPI processes and of the OpenMP threads.

CAVEAT: the following discussion refers to XL 16.1.1, hpc-sdk 2021, gnu 8.4.0.

1) OpenMP parallelization

...

XL compilers
- OMP_PROC_BIND = false (default)/true, OR close/scatter
- OMP_PLACES = threads (default),cores
- Comments for SMT configurations: in both the default setting for OMP_PLACES (=threads) and for OMP_PLACES=cores the threads are placed ALWAYS on the first HW thread of the physical cores. Setting OMP_PROC_BIND=close/scatter makes no difference.
NVIDIA hpc-sdk compilers (ex PGI):
- OMP_PROC_BIND = false (default)/true, closecl ose/scatter
- OMP_PLACES = threads (default),cores
GCC compilers
- OMP_PROC_BIND = false (default)/true, close/scatter
- OMP_PLACES = threads (default),cores

For instance, with GCC compilers (gnu modules) if you want each OMP thread to be bound to a physical cores, ask for the full node (--cpus-per-task=128) and

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1      
#SBATCH --cpus-per-task=

...

128  
#SBATCH ........

export OMP_PROC_BIND=true
export OMP_PLACES=threads/cores

...

# hpc-sdk default: threads, pinned to the 1st HW thread of each physical core; cores: placed on all 4 HW threads of the physical cores

...

   
export OMP_NUM_THREADS=32

...

 
<your exe>

2) MPI parallelization

For pure MPI applications (hence, no OpenMP threads as parallel elements), set the value of --ntasks-per-node to the number of MPI processes you want to run per node, and --cpus-per-task=4.
For instance, if you want to run an MPI application with 2 processes (no threads), each of them using 1 GPU:

...

Page tree

Versions Compared

Old Version 8

New Version 9

Key

1) OpenMP parallelization

2) MPI parallelization