Page History

...

which establishes a connection to one of the available login nodes. To connect to Marconi100 you can also indicate explicitly the login nodes:

> login01-ext.m100.cineca.it
> login02-ext.m100.cineca.it 
> login03-ext.m100.cineca.it

For information about data transfer from other computers please follow the instructions and caveats on the dedicated section Data storage or the document Data Management.

...

From the output of the command it is possible to see that GPU0 and GPU1 are connected with the NVLink (NV3), as well as the couple GPU2 & GPU3. The first couple is connected to (virtual) cpus 0-63 (on the first socket), the second to (virtual) cpus 64-127 (on the second socket). The cpus are numbered from 0 to 127 because of a hyperthreading. The two Power9 sockets are connected by a 64 GBps X bus. Each of them is connected with 2 GPUs via NVLink 2.0.

Courtesy of IBM

The knowlwdge knowledge of the topology of the node is important for correctly distribute the parallel threads of your applications in order to get the best performances.

...

In the following is reported an interactive job on 2 cores and two GPUs. Within the job a parallel (MPI) program using 2 MPI tasks and two GPUs is executed. Since the request of tasks per node (--ntasks-per-node) refers to the 128 (virtual) cpus, if you want 2 physical cores you also have to specify that each task is made of 4 (virtual) cpus (--npuscpus-per-takstasks=4).

> salloc -N1 --ntasks-per-node=2 --cpus-per-task=4 --gres=gpu:2 -A <account_name> -p <partition_name> --time=01:00:00
salloc: Granted job allocation 1175

...

#!/bin/bash
#SBATCH -A <account_name>
#SBATCH -p m100_usr_prod
#SBATCH --time 00:10:00     # format: HH:MM:SS
#SBATCH -N 1                # 1 node
#SBATCH --ntasks-per-node=8 # 8 tasks out of 128
#SBATCH --gres=gpu:1        # 1 gpus per node out of 4
#SBATCH --mem=7100          # memory per node out of 246000MB
#SBATCH --job-name=my_batch_job
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<user_email>
srun ./myexecutable

Please note that by requesting --ntasks-per-node=8 your job will be assigned 8 logical cpus (hence, the first 2 cpus with their 4 HTs). You can write your script file (for example script.x) using any editor, then you submit it using the command:

...

SLURM partition	Job QOS	# cores/# GPU per job	max walltime	max running jobs per user/ max n. of cpus/nodes/GPUs per user	priority	notes
m100_all_serial (def. partition)	normal	max = 1 core, 1 GPU max mem= 7600MB	04:00:00	4 cpus/1 GPU	40
m100_usr_prod	m100_qos_dbg	max = 2 nodes	02:00:00	2 nodes/64 cpus/ 8 GPUs	45	runs on 12 nodes
m100_usr_prod	normal	max = 16 nodes	24:00:00	10 jobs	40	runs on 880 nodes
	m100_qos_bprod	min = 17 nodes max =256nodes256 nodes	24:00:00	256 nodes	85	runs on 256 nodes
m100_fua_prod	m100_qos_fuadbg	max = 2 nodes	02:00:00		45	runs on 12 nodes
m100_fua_prod	normal	max = 16 nodes	24:00:00		40	runs on 68 nodes
	qos_special	>16 nodes	>24:00:00		40	request to superc@cineca.it
	qos_lowprio	max = 16 nodes	24:00:00		0	active projects with exhasted exhausted budget

M100 specific information

...

(production queue for academic users)

Examples

 > sbatchsalloc -N1 --ntasks-per-node=2 --cpus-per-task=4 --gres=gpu:2 … --partition=....

export OMP_NUM_THREADS=4
srunmpirun ./myprogram

Two full cores on one node are requested, as well as 2 GPUs. A hybrid code is executed with 2 MPI tasks and 4 OMP threads, exploiting the HT capability of M100. Since 2 GPUs are used, 16 cores will be accounted to this job.

With Spectrum MPI you need to launch your parallel program with "mpirun". For the OpenMPI environment, both srun and mpirun can be used.

> sbatch salloc  -N1 --ntasks-per-node=164 --cpus-per-task=416 --gres=gpu:2 --partition=...

With Spectrum MPI:

export OMP_NUM_THREADS=16
srun
mpirun -n 4 -ntasks-per-node=4 (--cpu-bind=core )  --cpus-per-task=16 -m block:blockmap-by socket:PE=4 ./myprogram

With OpenMPI 

export OMP_NUM_THREADS=16
export OMP_PLACES=threads

srun ./myprogram

16 full cores are requested and 2 GPUs. The 16x4 (virtual) cpus are used for 4 MPI tasks and 16 OMP threads per task. The -m flag in the srun command specifies the desired process distribution between nodes/socket/cores (the default is block:cyclic). Please refer to srun manual for more details on the processes distribution and binding.

...

Page tree

Versions Compared

Old Version 73

New Version 74

Key

M100 specific information

Examples