Page History

...

hostname: login.marconi100m100.cineca.it

early availability: march 2020April 2020

start of production: to be defined (2020)

...

Login nodes: 8 Login IBM Power9 LC922 (similar to the compute nodes)

Model: IBM Power AC922 (Whiterspoon)

Racks: 55 total (49 compute)
Nodes: 980
Processors: 2x16 cores IBM POWER9 AC922 at 3.1 GHz
Accelerators: 4 x NVIDIA Volta V100 GPUs, Nvlink 2.0, 16GB
Cores: 32 cores/node
RAM: 256 GB/node
Peak Performance: about 32 Pflop/s
Internal Network: Mellanox Infiniband EDR DragonFly+
Disk Space: 8PB Gpfs storage

Image Modified

Access

All the login nodes have an identical environment and can be reached with SSH (Secure Shell) protocol using the "collective" hostname:

...

The srun command will take by default PMI2 as MPI type.

Please note that

1) The recommended way to launch parallel tasks in slurm jobs is with srun. By using srun vs mpirun you will get full support for process tracking, accounting, task affinity, suspend/resume and other features.

...

The info reported here refer to the general user M100 partition. The production environment of MARCONI_Fusionfor EUROfusion users is discussed in a separate document.

...

For more information and examples of job scripts, see section Batch Scheduler SLURM.

Submitting serial Batch jobs

The m100_all_serial partition is available with a maximum walltime of 4 hours, 6 tasks and 18000 MB per job. It runs on two dedicated nodes, and it is designed for pre/post-processing serial analysis, and for moving your data (via rsync, scp etc.

...

) in case more than 10 minutes are required to complete the data transfer. In order to use this partition you have to specify the SLURM flag "-p":

#SBATCH -p m100_all_serial

Submitting Batch jobs for production

sinfo -d lists all the partitions available on M100. Some of them are reserved to dedicated class of users (for example *_fua_ * partitions are for EUROfusion users):

m100_fua_prod and m100_fua_dbg, are reserved to EuroFusion users, respectivelly for production and debugging
m100_usr_prod and m100_usr_dbg are opened to academic production.

Each node exposes itself to SLURM as having 32 cores, 4 GPUs and xx GB of memory. SLURM assigns a node in shared way, assigning to the job only the resources required and allowing multiple jobs running on the same node/nodes. If you want to have the node/s in exclusive mode, ask for all the resources of the node (hence, ncpus=32 or ngpus=4 or all the memory).

The maximum memory which can be requested is 182000MB and this value guarantees that no memory swapping will occur.

For example, to request a single node in a production queue the following SLURM job script can be used:

#!/bin/bash
#SBATCH -N 1
#SBATCH -A <account_name>
#SBATCH --mem=180000     <-- sostituire con memoria corrispondedte a 1 core
#SBATCH -p m100_usr_prod
#SBATCH --time 00:05:00
#SBATCH --job-name=my_batch_job
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<user_email>

srun ./myexecutable

Users with exhausted but still active projects are allowed to keep using the cluster resources, even if at a very low priority, by adding the "qos_lowprio" flag to their job;

(inserire la richiesta di QOS)

Summary

In the following table you can find all the main features and limits imposed on the queues/Partitions of M100.

SLURM partition	QOS	# cores per job	max walltime	max running jobs per user/ max n. of cpus/nodes per user	max memory per node (MB)	priority	notes
m100_all_serial (default partition)	noQOS	max = 6 (max mem= 18000 MB)	04:00:00	6 cpus	18000	40
	qos_rcm	min = 1 max = 48	03:00:00	1/48	182000	-	to be defined

m100_usr_dbg	no QOS	min = 1 node max = 4 nodes	00:30:00	4/4	182000	40	runs on 24 dedicated nodes
m100_usr_prod	no QOS	min = 1 node max = 64 nodes	24:00:00	64 nodes	182000	40
	skl_qos_bprod	min=65 nodes max = 256 nodes	24:00:00	1/256 1 jobs per account	182000	85	#SBATCH -p skl_usr_prod #SBATCH --qos=skl_qos_bprod
	qos_special	>256 nodes	>24:00:00 (max = 64 nodes for user)		182000	40	#SBATCH --qos=qos_special request to superc@cineca.it
	qos_lowprio	max = 64 nodes	24:00:00	64 nodes	182000	0	#SBATCH --qos=qos_lowprio

Graphic session

If a graphic session is desired we recommend to use the tool RCM (Remote Connection Manager). For additional information visit Remote Visualization section on our User Guide.

Page tree

Versions Compared

Old Version 23

New Version 24

Key

Access

Submitting serial Batch jobs

Submitting Batch jobs for production

Summary

Graphic session