Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Since all the filesystems are based on IBM Spectrum Scale™ file system (formerly GPFS), the usual unix command "quota" is not working. Use the local command cindata to query for disk usage and quota ("cindata -h" for help):

> cindata

Modules environment

As usual, the software modules are collected in different profiles and organized by functional category (compilers, libraries, tools, applications,..).

The profiles are of two types,  “domain” type (chem, phys, lifesc,..) for the production activity and “programming” type (base and advanced)  for compilation, debugging and profiling activities. They can be loaded together.

The "Base" profile is the default one. It is automatically loaded after login and it contains basic modules for the programming activities (xls, pgi and gnu compilers, math libraries, profiling and debugging tools,..).

If you want to use a module placed under others profiles, for example an application module, you will have to load the corresponding profile:

>module load profile/<profile name>
>module load autoload <module name>

For listing all profiles you have loaded use the following command:

>module list

In order to detect all profiles, categories and modules available on M100 the command “modmap” is available:

>modmap

Spack

...

Production environment

Since M100 is a general purpose system and it is used by several users at the same time, long production jobs must be submitted using a queuing system. This guarantees that the access to the resources is as fair as possible.
Roughly speaking, there are two different modes to use an HPC system: Interactive and Batch. For a general discussion see the section Production Environment and Tools.

Each node of Marconi100 consists in 2 Power9 sockets with 16 cores and 2 Volta GPUs per socket. The multithreading is active with 4 threads per physical core (128 total logical cpus).

Due to how the hardware is detected on a Power9 architecture, the numbering of (logical) cores follows the order of threading:

$ ppc64_cpu --info

Core   0:    0*    1*    2*    3*
Core   1:    4*    5*    6*    7*
Core   2:    8*    9*   10*   11* 
Core   3:   12*   13*   14*   15*

.............. (Cores from 4 to 27)........................

Core  28:  112*  113*  114*  115*
Core  29:  116*  117*  118*  119*  
Core  30:  120*  121*  122*  123*
Core  31:  125*  126*  127*

Since the nodes can be shared by users, Slurm has been configured to allocate one (physical) task per core by default. Without this  option,  by  default one  task  will  be  allocated  per  thread  on  nodes  with more than one ThreadsPerCore (as it is on Marconi100).

As a result of such configuration, for each requested task a physical core with all its 8 threads will be allocated to the task.

Since a physical core (4 HTs) is assigned to one task, a maximum of 32 tasks per node can be asked (--ntasks-per-node), corresponding (as mentioned) to receive 4 logical cpus per task.

Interactive

A serial program can be executed in the standard UNIX way:

> ./program

This is allowed only for very short runs on the login nodes, since the interactive environment has a 10 minutes time limit.

A serial (or multithreaded) program using GPUs and needing more than 10 minutes can be executed interactively within an "Interactive" SLURM batch job, using the "srun" command: the job is queued and scheduled as any other job but, when executed, the remote standard input, output, and error streams are connected to the terminal session from which srun was launched.

For example, to start an interactive session on one node and one GPU launch the command:


GPU and intra/inter connection environment

Marconi100 login and compute nodes host four Tesla Volta (V100) GPUs per node (CUDA compute capability 7.0). The most recent versions of nVIDIA CUDA toolkit and of the Community Edition PGI compilers (supporting CUDA Fortran) is available in the module environment, together with a set of GPU-enabled libraries, applications and tools.  

The topology of the node devices is as follows:

$ nvidia-smi topo -m


....... (inserire)......


The internode communications is based on a Mellanox Infiniband EDR network, and the openmpi and IBM MPI Spectrum libraries are configured so to exploit the Mellanox Fabric Collective Accelerators (also on CUDA memories) and Messaging Accelerators.

nVIDIA GPUDirect technology is fully supported (shared memory, peer-to-peer, RDMA, async), enabling the use of CUDA-aware MPI. 

Modules environment

As usual, the software modules are collected in different profiles and organized by functional category (compilers, libraries, tools, applications,..).

The profiles are of two types,  “domain” type (chem, phys, lifesc,..) for the production activity and “programming” type (base and advanced)  for compilation, debugging and profiling activities. They can be loaded together.

The "Base" profile is the default one. It is automatically loaded after login and it contains basic modules for the programming activities (xls, pgi and gnu compilers, math libraries, profiling and debugging tools,..).

If you want to use a module placed under others profiles, for example an application module, you will have to load the corresponding profile:

>module load profile/<profile name>
>module load autoload <module name>

For listing all profiles you have loaded use the following command:

>module list

In order to detect all profiles, categories and modules available on M100 the command “modmap” is available:

>modmap


Spack

...

Production environment

Since M100 is a general purpose system and it is used by several users at the same time, long production jobs must be submitted using a queuing system. This guarantees that the access to the resources is as fair as possible.
Roughly speaking, there are two different modes to use an HPC system: Interactive and Batch. For a general discussion see the section Production Environment and Tools.

Each node of Marconi100 consists in 2 Power9 sockets with 16 cores and 2 Volta GPUs per socket (32 cores and 4 GPUs per node). The multi-threading is active with 4 threads per physical core (128 total logical cpus).

Due to how the hardware is detected on a Power9 architecture, the numbering of (logical) cores follows the order of threading:

$ ppc64_cpu --info

Core   0:    0*    1*    2*    3*
Core   1:    4*    5*    6*    7*
Core   2:    8*    9*   10*   11* 
Core   3:   12*   13*   14*   15*

.............. (Cores from 4 to 27)........................

Core  28:  112*  113*  114*  115*
Core  29:  116*  117*  118*  119*  
Core  30:  120*  121*  122*  123*
Core  31:  125*  126*  127*


Since the nodes can be shared by users, Slurm has been configured to allocate one (physical) task per core by default. Without this  option,  by  default one  task  will  be  allocated  per  thread  on  nodes  with more than one ThreadsPerCore (as it is on Marconi100).

As a result of such configuration, for each requested task a physical core with all its 4 threads will be allocated to the task. The use of --cpus-per-task is hence discouraged as a sbatch directive, potentially leading to incorrect allocation.You can then exploit the multithreading capability with 4 MPI processes per physical core or opportunely combining MPI processes and OpenMP threads, if adequate for your application. 

Since a physical core (4 HTs) is assigned to one task, a maximum of 32 tasks per node can be asked (--ntasks-per-node), corresponding (as mentioned) to receive 4 logical cpus per task.

Interactive

A serial program can be executed in the standard UNIX way:

> ./program

This is allowed only for very short runs on the login nodes, since the interactive environment has a 10 minutes time limit.

A serial (or multithreaded) program using GPUs and needing more than 10 minutes can be executed interactively within an "Interactive" SLURM batch job, using the "srun" command: the job is queued and scheduled as any other job but, when executed, the remote standard input, output, and error streams are connected to the terminal session from which srun was launched.

For example, to start an interactive session on one node and one GPU launch the command:

> srun -N1 --ntasks-per-node=1 --gres=gpu:1 -A <account_name> --time=01:00:00 --pty /bin/bash

SLURM will then schedule your job to start, and your shell will be unresponsive until free resources are allocated for you. When the shell come back with the prompt (the hostname at the prompt will be that of the assigned node), launch the program in the standard way:

> ./program

As mentioned above, the accounting of the consumed core hours takes into account also the memory and the number of requested GPUs (see the dedicated section). For instance, a job using one core and one GPU for one hour (with the default memory per core) will consume 8 core-hours (each node being equipped with 32 physical cores and 4 V100 GPUs). 

A parallel (MPI) program using GPUs and needing more than 10 minutes can as well been executed in an interactive SLURM batch jobs, using the "salloc" command in the place of "srun --pty bash". For instance:

> salloc -N1 --ntasks-per-node=16 --gres=gpu:2 -A <account_name> --time=01:00:00 

Again, the job is queued and scheduled as any other job and, when executed, a new session starts on the login node from which salloc was launched (the hostname at the prompt will be that of the login node). You can now run your parallel program on the assigned compute node(s) as in any slurm parallel job:

> srun ./myprogram

or

> mpirun ./myprogram

srun/mpirun will dispatch the tasks of the program myprogram to the assigned compute node, i.e., the tasks do not run on the login node hosting the salloc session.

Please note that the recommended way to launch parallel tasks in slurm jobs is with srun. By using srun vs mpirun you will get full support for process tracking, accounting, task affinity, suspend/resume and other features.

A hybrid parallel (MPI/OpenMP) program using GPUs and needing more than 10 minutes can also be executed in an interactive SLURM batch jobs with the "salloc" command. For instance:

> salloc> srun -N1 --ntasks-per-node=1=4 --cpus-per-task=4 --gres=gpu:12 -A <account_name> --time=01:00:00 --pty /bin/bash

SLURM will then schedule your job to start, and your shell will be unresponsive until free resources are allocated for you. When the shell come back with the prompt (the hostname at the prompt will be that of the assigned node), launch the program in the standard way:

> ./program

As mentioned above, the accounting of the consumed core hours takes into account also the memory and the number of requested GPUs (see the dedicated section). For instance, a job using one core and one GPU for one hour (with the default memory per core) will consume 8 core-hours (each node being equipped with 32 physical cores and 4 V100 GPUs). 

00 
> export OMP_NUM_THREADS=4
> srun ./myprogram

The above request reflects the configuration of assigning a physical core with its four threads. But you can choose the tasks/threads ratio which better suits your application, and ask for a number of tasks so to obtain a number of logical cores equal to the product of the number of MPI processes * the number of OMP threads per task. For instance, for 4 MPI processes and 16 OMP threads per task, you need 64 logical cores, hence 16 physical coresA parallel (MPI) program using GPUs and needing more than 10 minutes can as well been executed in an interactive SLURM batch jobs, using the "salloc" command in the place of "srun --pty bash":

> salloc -N1 --ntasks-per-node=16 --gres=gpu:2 -A <account_name> --time=01:00:00 

Again, the job is queued and scheduled as any other job and, when executed, a new session starts on the login node from which salloc was launched (the hostname at the prompt will be that of the login node). You can now run your parallel program on the assigned compute node(s) as in any slurm parallel job:

> srun ./myprogram

or

> mpirun ./myprogram

srun/mpirun will dispatch the tasks of the program myprogram to the assigned compute node, i.e., the tasks do not run on the login node hosting the salloc session.

Please note that the recommended way to launch parallel tasks in slurm jobs is with srun. By using srun vs mpirun you will get full support for process tracking, accounting, task affinity, suspend/resume and other features.

   # this will assign 16 physical cores with 4 HTs each
> export OMP_NUM_THREADS=16
> srun --ntasks-per-node=4 (--cpu-bind=core )  --cpus-per-task=16 -m block:block ./myprogram

The -m flag allows to specify the desired process distribution between nodes/socket/cores (the default is block:cyclic). Please refer to srun manual for more details on the processes distribution and binding. Note that the binding flag is required in order to obtain the correct process binding in case the -m flag is not used.

You can then set the OMP affinity to threads exporting the OMP_PLACES variable.

For all the mentioned cases, SLURM automatically exports the environment variables you defined in the source shell, so that if you need to run your program "myprogram" in a controlled environment (i.e. specific library paths or options), you can prepare the environment in the origin shell being sure to find it in the interactive shell (started with both srun and salloc).

...

The m100_all_serial partition is available with a maximum walltime of 4 hours, 6 tasks 1 task and 18000 7600 MB per job. It runs on two dedicated nodestwo dedicated nodes (equipped with 4 Volta GPUs), and it is designed for pre/post-processing serial analysis (using or not the GPUs), and for moving your data (via rsync, scp etc.) in case more than 10 minutes are required to complete the data transfer. This is the default partition, which is assumed by SLURM if you do not explicit request a partition with the flag "–partition--partition" or "-p". You can however explicitly request it in your batch script with the directive:

...

Submitting Batch jobs for production

sinfo lists Not all the partitions available on M100. Some of them of the partitions are open to access by the academic community as some are reserved to dedicated class classes of users (for example *_fua_ * partitions are for EUROfusion users):

  • m100_fua_prod and m100_fua_dbg, are  reserved to EuroFusion users, respectively for production and debugging
  • m100_usr_prod and m100_usr_dbg are reserved open to academic production.

Each node exposes itself to SLURM as having 32 cores, 4 GPUs and XXXX memory. SLURM assigns a node in shared way, assigning to the job only the resources required and allowing multiple jobs to run on the same node/nodes. If you want to have the node/s in exclusive mode, ask for all the resources of the node (either ntasks-per-node=32 or mem=XXXX).

The maximum memory which can be requested is  230000MB (with a medium memory available XXXXMB (average memory per physical core of ~7 GB~ 7GB) and this value value guarantees that no memory swapping will occur. 

For example, to request one core and one GPU in a production queue the following SLURM job script can be used:

#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1 # this refers to the number of requested gpus per node, and can vary between 1 and 4
#SBATCH -A <account_name>
#SBATCH --mem=7100 # this refers to the requested memory per node with a maximum of XXXXXX
#SBATCH -p m100_usr_prod
#SBATCH --time 00:05:00#SBATCH --time 00:10:00 # format: HH:MM:SS
#SBATCH --job-name=my_batch_job
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<user_email>

...

In the following table you can find all the main features and limits imposed on the queues/Partitions of M100. 



6 cpus

03

182000

40        40

SLURM

partition

Job QOS# cores/# GPU per job
max walltime

max running jobs per user/

max n. of cpus/nodes/GPUs per user

max memory per node

(MB)

prioritynotes

m100_all_serial

(default partition)

m100_all_serial
normal

max = 61 core, 1 GPU

(max mem= 18000 7600 MB)

04:00:00

4 cpus/1 GPU


 -40








m100_usr_dbgnormal

max = 2 nodes

02:00:002 nodes/64 cpus/8 GPUs
XXXXX40runs on 12 nodes
m100_usr_prodnormal

max = 16 nodes

24 700040 qos_rcm

min = 1

max = 32

:00:001/3210 jobsXXXXX40runs on 880 nodes

m100_qos_bprod

min = 17 nodes

max = 256 nodes

24- 

to be defined

m100_usr_dbgm100_qos_dbg

max = 2 nodes

02:00:0024

256 nodes

182000XXXXX4585

runs on

24 dedicated

256 nodes

#SBATCH -p m100_usr_prod

#SBATCH --qos=m100_

usr_prod

qos_bprod

m100_fua_dbgnormalmin = 1 nodemax = 16 2 nodes1-0002:00:0016 nodes
XXXXX18200040runs on 12 nodes
m100_qosfua_bprodprodnormalmin = 17 nodesmax = 256 16 nodes1-00:00:00

1/256

1 jobs per account

18200050

#SBATCH -p skl_usr_prod

#SBATCH --qos=skl_qos_bprod


XXXXX40runs on 68 nodes

qos_qos_special>256 nodes

>24:00:00

(max = 64 nodes for user)

                          182000



                XXXXX 
40 

#SBATCH --qos=qos_special

request to superc@cineca.it

qos_lowpriomax = 64 16 nodes24:00:0064 nodes
XXXXX 1820000#SBATCH --qos=qos_lowpriom100_usr_preempt max = 16 nodes08:00:0010m100_fua_prodm100_fua_prodmax = 16 nodes1-00:00:0060m100_qos_fuadbgmax = 2 nodes02:00:0065




Graphic session


If a graphic session is desired we recommend to use the tool RCM (Remote Connection Manager)For additional information visit Remote Visualization section on our User Guide.

...