UG3.2.1: Other batch job examples on M100

Each node has 2 sockets, 16 cores per socket with SMT 4 (4 hardware threads, HTs, per core).

The HTs are enabled on all nodes, which expose themselves to SLURM has having 128 logical cpus. Since the nodes are not assigned in an exclusive way and can be shared by users, we configured the scheduler to always assign the requested cores in an exclusive way, that is from 1 to 128 as a multiple of 4. A job can ask resources up to a maximum to 128 cpus (hence, the upper limit for (ntasks-per-node) * (cpus-per-task) is 128).

You need to pay attention to the mapping and binding of MPI processes and of the OpenMP threads.

1) OpenMP parallelization

For threaded applications (pure OpenMP, no MPI), you obtain a full node by requesting --ntasks-per-node = 1 and --cpus-per-task=128. You can choose to exploit or not the SMT
feature, but in any case switch the binding of the OMP threads on (this is the default for the XL compilers, while for gnu and PGI compilers it is off by default):

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1      
#SBATCH --cpus-per-task=64        # 64 OMP threads
#SBATCH --gres=gpu:1

export OMP_PROC_BIND=true   (not needed for the XL compilers, since this is its default, while for gnu and PGI compilers OMP_PROC_BIND is false by default)
srun ----cpus-per-task=64 <yor exe>

2) MPI parallelization

For pure MPI applications (hence, no OpenMP threads as parallel elements), set the value of --ntasks-per-node to the number of MPI processes you want to run per node, and --cpus-per-task=4.
For instance, if you want to run an MPI application with 2 processes (no threads), each of them using 1 GPU:

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2      # 2 MPI processes per node
#SBATCH --cpus-per-task=4        # 4 HTs per task
#SBATCH --gres=gpu:2

mpirun <your exe>

If you want the entire socket for your 2 MPI tasks, you need to request 32 cpus per task, and use the proper mapping configuration for the MPI tasks
with the mpirun option --map-by <obj>:PE=N. For instance, you can map the processes on the object "socket", and indicate how many Parallel Elements (PE),
in terms of the next step of granularity, which for the object "socket" is the "physical" core (with its 4 HTs):

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2      # 2 MPI processes per node
#SBATCH --cpus-per-task=32        # 4 HTs per task
#SBATCH --gres=gpu:2

mpirun  --map-by socket:PE=8 <your_exe>

In this case, each one of the two tasks per node can be mapped on 8 physical cores of the socket (the physical cpus being 16 per socket), hence the mapping is specified as
--map-by socket:PE=8. You can choose other mapping object, please verify that the result is correct in terms of MPI tasks binding (you can use the --report-bindings option of mpirun).
The number of mappable PEs for the socket object is given by 16 ( = n. of physical cores per socket) / ntasks-per-node ( = n. of MPI processes).
You can change the order in which the processors are assigned to the MPI ranks with the --rank-by option. If you want consecutive processes assigned to consecutive cores use --rank-by core.

If you want to exploit the SMT feature, request the number of tasks and 1 cpus-per-task, and bind (or map) the MPI processes to the "hwthread" element

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=128      # 128 MPI processes per node
#SBATCH --cpus-per-task=1       # 4 HTs per task

mpirun --map-by hwthread <your_exe>    (or mpirun --bind-to hwthread <your_exe>)

3) Hybrid (MPI+OpenMP) parallelization

Non-SMT: ask for a number of cpus-per-task equal to the number of OMP_NUM_THREADS you mean to use multiplied by 4.
Switch the binding of the OMP threads on, and correctly map the MPI processes with the --map-by option.

Example: 4 processes per node, 8 threads per process:

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=32

export OMP_PROC_BIND=true
export OMP_NUM_THREADS=8

mpirun --map-by socket:PE=8 <your_exe>

SMT: set the value of --ntasks-per-node to the number of MPI processes you want to run per node, and --cpus-per-task = OMP_NUM_THREADS (if you want to exploit the SMT in terms of number of OMP threads) or to 128 / (ntasks-per-node) (if you want to exploit the SMT in terms of number of MPI processes).
Always switch the binding of the OMP threads on, and correctly map

Page tree

UG3.2.1: Other batch job examples on M100

1) OpenMP parallelization

2) MPI parallelization

3) Hybrid (MPI+OpenMP) parallelization