UG3.2.1: Other batch job examples on M100

Each node has 2 sockets, 16 cores per socket with SMT 4 (4 hardware threads, HTs, per core).

The HTs are enabled on all nodes, which expose themselves to SLURM has having 128 logical cpus. Since the nodes are not assigned in an exclusive way and can be shared by users, we configured the scheduler to always assign the requested cores in an exclusive way, that is from 1 to 128 as a multiple of 4. A job can ask resources up to a maximum to 128 cpus (hence, the upper limit for (ntasks-per-node) * (cpus-per-task) is 128).

You need to pay attention to the mapping and binding of MPI processes and of the OpenMP threads.

1) OpenMP parallelization

For threaded applications (pure OpenMP, no MPI), you obtain a full node by requesting --ntasks-per-node = 1 and --cpus-per-task=128. You can choose to exploit or not the SMT
feature, but in any case switch the binding of the OMP threads on (this is the default for the XL compilers, while for gnu and PGI compilers it is off by default):

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1      
#SBATCH --cpus-per-task=64        # 64 OMP threads
#SBATCH --gres=gpu:1

export OMP_PROC_BIND=true   (not needed for the XL compilers, since this is its default, while for gnu and PGI compilers OMP_PROC_BIND is false by default)
srun ----cpus-per-task=64 <yor exe>

2) MPI parallelization

For pure MPI applications (hence, no OpenMP threads as parallel elements), set the value of --ntasks-per-node to the number of MPI processes you want to run per node, and --cpus-per-task=4.
For instance, if you want to run an MPI application with 2 processes (no threads), each of them using 1 GPU:

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2      # 2 MPI processes per node
#SBATCH --cpus-per-task=4        # 4 HTs per task
#SBATCH --gres=gpu:2

mpirun <your exe>

If you want the entire socket for your 2 MPI tasks, you need to request 32 cpus per task, and use the proper mapping configuration for the MPI tasks
with the mpirun option --map-by <obj>:PE=N. For instance, you can map the processes on the object "socket", and indicate how many Parallel Elements (PE),
in terms of the next step of granularity, which for the object "socket" is the "physical" core (with its 4 HTs):

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2      # 2 MPI processes per node
#SBATCH --cpus-per-task=32        # 4 HTs per task
#SBATCH --gres=gpu:2

mpirun  --map-by socket:PE=8 <your_exe>

In this case, each one of the two tasks per node can be mapped on 8 physical cores of the socket (the physical cpus being 16 per socket), hence the mapping is specified as
--map-by socket:PE=8. You can choose other mapping object, please verify that the result is correct in terms of MPI tasks binding (you can use the --report-bindings option of mpirun).
The number of mappable PEs for the socket object is given by 16 ( = n. of physical cores per socket) / ntasks-per-node ( = n. of MPI processes).
You can change the order in which the processors are assigned to the MPI ranks with the --rank-by option. If you want consecutive processes assigned to consecutive cores use --rank-by core.

If you want to exploit the SMT feature, request the number of tasks and 1 cpus-per-task, and bind (or map) the MPI processes to the "hwthread" element

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=128      # 128 MPI processes per node
#SBATCH --cpus-per-task=1       # 4 HTs per task

mpirun --map-by hwthread <your_exe>    (or mpirun --bind-to hwthread <your_exe>)

3) Hybrid (MPI+OpenMP) parallelization

Non-SMT: ask for a number of cpus-per-task equal to the number of OMP_NUM_THREADS you mean to use multiplied by 4.
Switch the binding of the OMP threads on, and correctly map the MPI processes with the --map-by option.

Example: 4 processes per node, 8 threads per process:

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=32

export OMP_PROC_BIND=true
export OMP_NUM_THREADS=8

mpirun --map-by socket:PE=8 <your_exe>

SMT: set the value of --ntasks-per-node to the number of MPI processes you want to run per node, and --cpus-per-task = OMP_NUM_THREADS (if you want to exploit the SMT in terms of number of OMP threads) or to 128 / (ntasks-per-node) (if you want to exploit the SMT in terms of number of MPI processes).
Always switch the binding of the OMP threads on, and correctly map

Appendix) Pinning GPUs when launching a multiple serial execution within a job

Let's consider a situation like this:

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --ntasks-per-socket=2
#SBATCH --gres=gpu:4

mpirun ./myscript.sh

where "myscript.sh" is a serial script that is equally executed by all the tasks alocated by the job, with no communication involved. You have requested 4 tasks, two for each socket, and you want each task to work within one of the 4 gpus available on the node, that are also two for each socket. So, at a first try your myscript.sh may result in something like this:

export CUDA_VISIBLE_DEVICES=${SLURM_LOCALID}
./my_actual_gpu_executable

In this way, task 0 will see GPU 0, task 1 will see GPU 1 and so on. This configuration, which may seem intuitively corrected, can actually result in bad performance for some of the tasks running the executable. That is, because the IDs of the tasks and the GPUs are actually mismatched. While it is true that socket 1 hosts GPUs 0 and 1, and socket 2 hosts GPUs 2 and 3, the same can't be said for the cpu tasks, that are actually scattered: socket 1 hosts tasks 0 and 2, and socket 2 hosts tasks 1 and 3. To summarize:

FIRST SOCKET:
tasks 0 and 2 ---> GPUs 0 and 1
SECOND SOCKET:
tasks 1 and 3 ---> GPUs 2 and 3

Therefore, while working for GPU 1 and task 1, you are connecting a CPU and a GPU comung from different sockets, and this results in a slowdown in communication (the same happens with GPU 2 and task 2).

There are many ways to work around this unwanted behaviour. We present some of them:

working with mpirun options, specifically the flag --rank-by core (for Spectrum/OpenMPI) will compact the task assignment and rank the CPUs so that task 0 and 1 will belong to the first socket, and task 2 and 3 will belong to the second socket:
```
mpirun --rank-by core ./myscript.sh
```
pinning with Slurm directives. There are a couple of possible options:
```
#SBATCH --accel-bind=g
```
From Slurm manual: "Bind each task to GPUs which are closest to the allocated CPUs."
Another possible option is:
```
#SBATCH --gpu-bind=closest
```
"Bind each task to the GPU(s) which are closest" (useful when more than one process is attached to a single GPU).
A simple and intuitive way may be to adapt your script as follows:

      export CUDA_VISIBLE_DEVICES=${SLURM_LOCALID}

      if [ $SLURM_LOCALID -eq 1 ]
      then
         export CUDA_VISIBLE_DEVICES=2
      fi

      if [ $SLURM_LOCALID -eq 2 ]
      then
         export CUDA_VISIBLE_DEVICES=1
      fi

      ./my_actual_gpu_executable

This way, the CPU tasks and the GPU IDs are matched in a way that they all belong to the same socket, and the communication is optimal.

PLEASE NOTE: the problem is related to the istance of serial executions working on different GPUs, launched via mpirun. When you submit an actual parallel execution, with MPI communications involved, the process pinning is automatically configured as a more intuitive pinning that keeps the task IDs compact and don't result in tag mismatching while compared with the ID of the GPUs. Your environment is safer in that case.

Page tree

UG3.2.1: Other batch job examples on M100

1) OpenMP parallelization

2) MPI parallelization

3) Hybrid (MPI+OpenMP) parallelization

Appendix) Pinning GPUs when launching a multiple serial execution within a job