...
which establishes a connection to one of the available login nodes. To connect to Marconi100 you can also indicate explicitly the login nodes:
> login01-ext.m100.cineca.it
> login02-ext.m100.cineca.it
> login03-ext.m100.cineca.it
For information about data transfer from other computers please follow the instructions and caveats on the dedicated section Data storage or the document Data Management.
...
From the output of the command it is possible to see that GPU0 and GPU1 are connected with the NVLink (NV3), as well as the couple GPU2 & GPU3. The first couple is connected to (virtual) cpus 0-63 (on the first socket), the second to (virtual) cpus 64-127 (on the second socket). The cpus are numbered from 0 to 127 because of a hyperthreading. The two Power9 sockets are connected by a 64 GBps X bus. Each of them is connected with 2 GPUs via NVLink 2.0.
The knowlwdge knowledge of the topology of the node is important for correctly distribute the parallel threads of your applications in order to get the best performances.
...
In the following is reported an interactive job on 2 cores and two GPUs. Within the job a parallel (MPI) program using 2 MPI tasks and two GPUs is executed. Since the request of tasks per node (--ntasks-per-node) refers to the 128 (virtual) cpus, if you want 2 physical cores you also have to specify that each task is made of 4 (virtual) cpus (--npuscpus-per-takstasks=4).
> salloc -N1 --ntasks-per-node=2 --cpus-per-task=4 --gres=gpu:2 -A <account_name> -p <partition_name> --time=01:00:00
salloc: Granted job allocation 1175
...
#!/bin/bash
#SBATCH -A <account_name>
#SBATCH -p m100_usr_prod
#SBATCH --time 00:10:00 # format: HH:MM:SS
#SBATCH -N 1 # 1 node
#SBATCH --ntasks-per-node=8 # 8 tasks out of 128
#SBATCH --gres=gpu:1 # 1 gpus per node out of 4
#SBATCH --mem=7100 # memory per node out of 246000MB
#SBATCH --job-name=my_batch_job
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<user_email>
srun ./myexecutable
Please note that by requesting --ntasks-per-node=8 your job will be assigned 8 logical cpus (hence, the first 2 cpus with their 4 HTs). You can write your script file (for example script.x) using any editor, then you submit it using the command:
...
SLURM partition | Job QOS | # cores/# GPU per job | max walltime | max running jobs per user/ max n. of cpus/nodes/GPUs per user | priority | notes |
m100_all_serial (def. partition) | normal | max = 1 core, max mem= 7600MB | 04:00:00 | 4 cpus/1 GPU | 40 | |
m100_usr_prod | m100_qos_dbg | max = 2 nodes | 02:00:00 | 2 nodes/64 cpus/ 8 GPUs | 45 | runs on 12 nodes |
m100_usr_prod | normal | max = 16 nodes | 24:00:00 | 10 jobs | 40 | runs on 880 nodes |
m100_qos_bprod | min = 17 nodes max =256nodes256 nodes | 24:00:00 | 256 nodes | 85 | runs on 256 nodes | |
m100_fua_prod | m100_qos_fuadbg | max = 2 nodes | 02:00:00 | 45 | runs on 12 nodes | |
m100_fua_prod | normal | max = 16 nodes | 24:00:00 | 40 | runs on 68 nodes | |
qos_special | >16 nodes | >24:00:00 | 40 | request to superc@cineca.it | ||
qos_lowprio | max = 16 nodes | 24:00:00 | 0 | active projects with exhasted exhausted budget |
M100 specific information
...
(production queue for academic users)
Examples
> sbatchsalloc -N1 --ntasks-per-node=2 --cpus-per-task=4 --gres=gpu:2 … --partition=....
export OMP_NUM_THREADS=4
srunmpirun ./myprogram
Two full cores on one node are requested, as well as 2 GPUs. A hybrid code is executed with 2 MPI tasks and 4 OMP threads, exploiting the HT capability of M100. Since 2 GPUs are used, 16 cores will be accounted to this job.
With Spectrum MPI you need to launch your parallel program with "mpirun". For the OpenMPI environment, both srun and mpirun can be used.
> sbatch salloc -N1 --ntasks-per-node=164 --cpus-per-task=416 --gres=gpu:2 --partition=...
With Spectrum MPI:
export OMP_NUM_THREADS=16
srun
mpirun -n 4 -ntasks-per-node=4 (--cpu-bind=core ) --cpus-per-task=16 -m block:blockmap-by socket:PE=4 ./myprogram
With OpenMPI
export OMP_NUM_THREADS=16
export OMP_PLACES=threads
srun ./myprogram
16 full cores are requested and 2 GPUs. The 16x4 (virtual) cpus are used for 4 MPI tasks and 16 OMP threads per task. The -m flag in the srun command specifies the desired process distribution between nodes/socket/cores (the default is block:cyclic). Please refer to srun manual for more details on the processes distribution and binding.
...