Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • XL compilers
    • OMP_PROC_BIND = false (default)/true, close(default)/spread
    • OMP_PLACES = threads,cores (default)
    • Comments for SMT configurations:
      • in both the default setting for OMP_PLACES (=cores) and for OMP_PLACES=threads, the threads are bound to HW threads of the physical cores.
      • If OMP_NUM_THREADS < (cpus per task): for the default OMP_PLACES (=cores), the OMP threads are closely placed inside the physical cores, that is they are bound to the first HW threads of the physical core (cpu id: 0,1,4,5.....); setting OMP_PROC_BIND=close/spread makes no difference the threads are spread on the HW threads of the physical cores (cpu id: 0,2,4.....). 
      • If OMP_NUM_THREADS < (cpus per task): for OMP_PLACES=threads the OMP threads are by default closely bound to the first OMP_NUM_THREADS HW threads (cpu id: 0,1,2,3.....,OMP_NUM_THREADS-1); for OMP_PROC_BIND=spread the OMP threads are spread on the HW threads of the physical cores (cpu id: 0,2,4,6....2*(OMP_PLACES_NUM_THREADS-1)).
  • NVIDIA hpc-sdk compilers (ex PGI): 
    • OMP_PROC_BIND = false (default)/true, close/spread(default)
    • OMP_PLACES = threads (default), cores
    • Comments for SMT configurations:
      • with the default OMP_PLACES (= threads) the threads are bound to HW threads of the physical cores.
      • If OMP_NUM_THREADS  < (cpus per task):
      setting
      • for the default OMP_
      PROC_BIND=close/spread makes no difference,
      • PLACES (=threads)  the OMP threads are always spread on the HW threads of the physical cores (cpu id: 0,2,4...)
      . The
      • ; setting OMP_PROC_BIND=close/spread makes no difference
      • If OMP_NUM_THREADS  < (cpus per task):  for OMP_PLACES=cores, the OMP threads are bound to the 4 HW threads of the cores (cpu id: 0-3,4-7,8-11....)
      .
      • ; setting OMP_PROC_BIND=close/spread makes no difference
  • GCC compilers
    • OMP_PROC_BIND = false (default)/true, close(default)/scatterspread
    • OMP_PLACES = threads (default),cores
    • Comments for SMT configurations:
      • with the default OMP_PLACES (= threads) the threads are bound to HW threads of the physical cores.
      • If OMP_NUM_THREADS  < (cpus per task): for the default OMP_PLACES (=threads)  the OMP threads are by default closely bound to the first OMP_NUM_THREADS HW threads (cpu id: 0,1,2,3.....,OMP_NUM_THREADS-1); for OMP_PROC_BIND=spread the OMP threads are spread on the HW threads of the physical cores (cpu id: 0,2,4,6....2*(OMP_NUM_THREADS-1)).
      • If OMP_NUM_THREADS  < (cpus per task):  for OMP_PLACES=cores, the OMP threads are bound to the 4 HW threads of the cores (cpu id: 0-3,4-7,8-11....); setting OMP_PROC_BIND=close/spread makes no difference

For instance, with GCC compilers (gnu modules) if you want each 32 OMP thread to be bound to a one HW thread per physical cores, ask for the full node (--cpus-per-task=128) and

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1     
#SBATCH --cpus-per-task=128  
#SBATCH ........

export OMP_PROC_BIND=truespread
export OMP_PLACES=threads/cores # not necessary, it's the default         
export OMP_NUM_THREADS=32
<your exe>


Even though for different binding defaults, with XL and HPC-SDK compilers you simply need to set OMP_PROC_BIND=true, and you will have the OMP threads bound to the first HW thread of each physical core. (Explanation: for XL the default OMP_PLACES is cores, and no difference arises from the OMP_PROC_BIND set to close or spread. For HPC-SDK the default OMP_PLACES is threads, 

2) MPI parallelization

For pure MPI applications (hence, no OpenMP threads as parallel elements), set the value of --ntasks-per-node to the number of MPI processes you want to run per node, and --cpus-per-task=4.
For instance, if you want to run an MPI application with 2 processes (no threads), each of them using 1 GPU:

...

where "myscript.sh" is a serial script that is equally executed by all the tasks alocated allocated by the job, with no communication involved. You have requested 4 tasks, two for each socket, and you want each task to work within one of the 4 gpus available on the node, that are also two for each socket. So, at a first try your myscript.sh may result in something like this:

...

In this way, task 0 will see GPU 0, task 1 will see GPU 1 and so on. This configuration, which may seem intuitively correctedcorrect, can actually result in bad performance for some of the tasks running the executable. That is, because the IDs of the tasks and the GPUs are actually mismatched. While it is true that socket 1 hosts GPUs 0 and 1, and socket 2 hosts GPUs 2 and 3, the same can't be said for the cpu tasks, that are actually scattered: socket 1 hosts tasks 0 and 2, and socket 2 hosts tasks 1 and 3. To summarize:

...

PLEASE NOTE: the problem is related to the istance instance of serial executions working on different GPUs, launched via mpirun. When you submit an actual parallel execution, with MPI communications involved, the process pinning is automatically configured as a more intuitive pinning that keeps the task IDs compact and don't result in tag mismatching while compared with the ID of the GPUs. Your environment is safer in that case.