Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

hostname:              login.marconi100m100.cineca.it

early availability:  march 2020April 2020

start of production: to be defined (2020)

...

Login nodes: 8 Login IBM Power9 LC922 (similar to the compute nodes)


Model: IBM Power AC922 (Whiterspoon)

Racks: 55 total (49 compute)
Nodes: 980
Processors: 2x16 cores IBM POWER9 AC922 at 3.1 GHz
Accelerators: 4 x NVIDIA Volta V100 GPUs, Nvlink 2.0, 16GB
Cores: 32 cores/node
RAM: 256 GB/node
Peak Performance: about 32 Pflop/s
Internal Network: Mellanox Infiniband EDR DragonFly+
Disk Space: 8PB Gpfs storage

Image Modified


Access

All the login nodes have an identical environment and can be reached with SSH (Secure Shell) protocol using the "collective" hostname:

...

The srun command will take by default PMI2 as MPI type.

Please note that

1) The recommended way to launch parallel tasks in slurm jobs is with srun. By using srun vs mpirun you will get full support for process tracking, accounting, task affinity, suspend/resume and other features.

...

The info reported here refer to the general user M100 partition. The production environment of MARCONI_Fusionfor EUROfusion users is discussed in a separate document.

...

For more information and examples of job scripts, see section Batch Scheduler SLURM.


Submitting serial Batch jobs


The m100_all_serial partition is available with a maximum walltime of 4 hours, 6 tasks and 18000 MB per job. It runs on two dedicated nodes, and it is designed for pre/post-processing serial analysis, and for moving your data (via rsync, scp etc.

...

) in case more than 10 minutes are required to complete the data transfer. In order to use this partition you have to specify the SLURM flag "-p":


#SBATCH -p m100_all_serial


Submitting Batch jobs for production



sinfo -d lists all the partitions available on M100. Some of them are reserved to dedicated class of users (for example *_fua_ * partitions are for EUROfusion users):


  • m100_fua_prod and m100_fua_dbg, are  reserved to EuroFusion users, respectivelly for production and debugging
  • m100_usr_prod and m100_usr_dbg are opened to academic production.


Each node exposes itself to SLURM as having 32 cores, 4 GPUs and xx GB of memory. SLURM assigns a node in shared way, assigning to the job only the resources required and allowing multiple jobs running on the same node/nodes. If you want to have the node/s in exclusive mode, ask for all the resources of the node (hence, ncpus=32 or ngpus=4 or all the memory).


The maximum memory which can be requested is 182000MB and this value guarantees that no memory swapping will occur.


For example, to request a single node in a production queue the following SLURM job script can be used:


#!/bin/bash
#SBATCH -N 1
#SBATCH -A <account_name>
#SBATCH --mem=180000 <-- sostituire con memoria corrispondedte a 1 core
#SBATCH -p m100_usr_prod
#SBATCH --time 00:05:00
#SBATCH --job-name=my_batch_job
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<user_email>


srun ./myexecutable


Users with exhausted but still active projects are allowed to keep using the cluster resources, even if at a very low priority, by adding the  "qos_lowprio" flag to their job;

(inserire la richiesta di QOS)


Summary


In the following table you can find all the main features and limits imposed on the queues/Partitions of M100. 



SLURM

partition

QOS# cores per job
max walltime

max running jobs per user/

max n. of cpus/nodes per user

max memory per node

(MB)

prioritynotes

m100_all_serial

(default partition)

noQOS

max = 6

(max mem= 18000 MB)

04:00:00

6 cpus


 1800040

 qos_rcm

min = 1

max = 48

03:00:001/48

182000


- 

to be defined









m100_usr_dbgno QOS

min = 1 node

max = 4 nodes

00:30:004/418200040runs on 24 dedicated nodes
m100_usr_prodno QOS

min = 1 node

max = 64 nodes

24:00:0064 nodes18200040

skl_qos_bprod

min=65 nodes

max = 256 nodes

24:00:00

1/256

1 jobs per account

18200085

#SBATCH -p skl_usr_prod

#SBATCH --qos=skl_qos_bprod


qos_special>256 nodes

>24:00:00

(max = 64 nodes for user)


                          182000        40

#SBATCH --qos=qos_special

request to superc@cineca.it

qos_lowpriomax = 64 nodes24:00:0064 nodes1820000#SBATCH --qos=qos_lowprio




Graphic session


If a graphic session is desired we recommend to use the tool RCM (Remote Connection Manager)For additional information visit Remote Visualization section on our User Guide.