Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

which establishes a connection to one of the available login nodes.  To connect to Marconi100 you can also indicate explicitly  the login nodes:

> login01-ext.m100.cineca.it
> login02-ext.m100.cineca.it 
> login03-ext.m100.cineca.it

For information about data transfer from other computers please follow the instructions and caveats on the dedicated section Data storage or the document  Data Management.

...

From the output of the command it is possible to see that GPU0 and GPU1 are connected with the NVLink (NV3), as well as the couple GPU2 & GPU3. The first couple is connected to (virtual) cpus 0-63 (on the first socket), the second to (virtual) cpus 64-127 (on the second socket). The cpus are numbered from 0 to 127 because of a hyperthreading. The two Power9 sockets are connected by a 64 GBps X bus. Each of them is connected with 2 GPUs via NVLink 2.0.


Courtesy of IBM

The knowlwdge knowledge of the topology of the node is important for correctly distribute the parallel threads of your applications in order to get the best performances.

...

In the following is reported an interactive job on 2 cores and two GPUs. Within the job a parallel (MPI) program using 2 MPI tasks and two GPUs is executed. Since the request of tasks per node (--ntasks-per-node) refers to the 128 (virtual) cpus, if you want 2 physical cores you also have to specify that each task is made of 4 (virtual) cpus (--npuscpus-per-takstasks=4).

> salloc -N1 --ntasks-per-node=2 --cpus-per-task=4 --gres=gpu:2 -A <account_name> -p <partition_name> --time=01:00:00
salloc: Granted job allocation 1175

...

#!/bin/bash
#SBATCH -A <account_name>
#SBATCH -p m100_usr_prod
#SBATCH --time 00:10:00     # format: HH:MM:SS
#SBATCH -N 1                # 1 node
#SBATCH --ntasks-per-node=8 # 8 tasks out of 128
#SBATCH --gres=gpu:1        # 1 gpus per node out of 4
#SBATCH --mem=7100          # memory per node out of 246000MB
#SBATCH --job-name=my_batch_job
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<user_email>
srun ./myexecutable

Please note that by requesting --ntasks-per-node=8 your job will be assigned 8 logical cpus (hence, the first 2 cpus with their 4 HTs). You can write your script file (for example script.x) using any editor, then you submit it using the command:

...

SLURM

partition

Job QOS# cores/# GPU
per job
max walltime

max running jobs per user/

max n. of cpus/nodes/GPUs per user

prioritynotes

m100_all_serial

(def. partition)

normal

max = 1 core,
1 GPU

max mem= 7600MB

04:00:00

4 cpus/1 GPU


40
m100_usr_prodm100_qos_dbg

max = 2 nodes

02:00:002 nodes/64 cpus/
8 GPUs

45

runs on 12 nodes

m100_usr_prodnormal

max = 16 nodes

24:00:0010 jobs40

runs on 880 nodes



m100_qos_bprod

min = 17 nodes

max =256nodes256 nodes

24:00:00

256 nodes

85

runs on 256 nodes


m100_fua_prodm100_qos_fuadbgmax = 2 nodes02:00:00
45

runs on 12 nodes


m100_fua_prodnormalmax = 16 nodes24:00:00
40

runs on 68 nodes



qos_special>16 nodes

>24:00:00



40 
request to superc@cineca.it

qos_lowpriomax = 16 nodes24:00:00
0

active projects with exhasted exhausted budget


M100 specific information

...

(production queue for academic users)


Examples

 > sbatchsalloc -N1 --ntasks-per-node=2 --cpus-per-task=4 --gres=gpu:2 --partition=.... 
export OMP_NUM_THREADS=4
srunmpirun ./myprogram

Two full cores on one node are requested, as well as 2 GPUs. A hybrid code is executed with 2 MPI tasks and 4 OMP threads, exploiting the HT capability of M100. Since 2 GPUs are used, 16 cores will be accounted to this job.

With Spectrum MPI you need to launch your parallel program with "mpirun". For the OpenMPI environment, both srun and mpirun can be used.


> sbatch salloc  -N1 --ntasks-per-node=164 --cpus-per-task=416 --gres=gpu:2 --partition=...   
With Spectrum MPI:

export OMP_NUM_THREADS=16
srun
mpirun -n 4 -ntasks-per-node=4 (--cpu-bind=core )  --cpus-per-task=16 -m block:blockmap-by socket:PE=4 ./myprogram


With OpenMPI 

export OMP_NUM_THREADS=16
export OMP_PLACES=threads

srun ./myprogram


16 full cores are requested and 2 GPUs. The 16x4 (virtual) cpus are used for 4 MPI tasks and 16 OMP threads per task. The -m flag in the srun command specifies the desired process distribution between nodes/socket/cores (the default is block:cyclic). Please refer to srun manual for more details on the processes distribution and binding.

...