Page History

Table of Contents

maxLevel	2

...

hostname: login.m100.cineca.it

...

Architecture: IBM Power 9 AC922
Internal Network: Mellanox Infiniband EDR DragonFly+ 100 Gb/s
Storage: 8 PB (raw) GPFS of local storage

...

Model: IBM Power AC922 (Whiterspoon)

Racks: 55 total (49 compute)
Nodes: 980
Processors: 2x16 cores IBM POWER9 AC922 at 2.6(3.1) GHz
Accelerators: 4 x NVIDIA Volta V100 GPUs/node, Nvlink 2.0, 16GB
Cores: 32 cores/node, Hyperthreading x4
RAM: 256 GB/node (242 usable)
Peak Performance: about 32 Pflop/s, 32 TFlops per node
Internal Network: Mellanox IB EDR DragonFly++ 100Gb/s
Disk Space: 8PB raw GPFS storage

...

> login01-ext.m100.cineca.it
> login02-ext.m100.cineca.it 
> login03-ext.m100.cineca.it

For information about data transfer from other computers please follow the instructions and caveats on the dedicated section Data storage or the document Data Management.

...

Since all the filesystems are based on IBM Spectrum Scale™ file system (formerly GPFS), the usual unix command "quota" is not working. Use the local command cindata to query for disk usage and quota ("cindata -h" for help):

> cindata

Modules environment

Dedicated node for Data transfer

A time limit of 10 cpu-minutes for processes running on login nodes has been set.
For Data transfer that may require more time, we set up a dedicated "data" VM accessible with a dedicated alias.
Login via ssh to this VM is not allowed. Environment variables as $HOME or $WORK are not defined, so you always have to explicitate the complete path to the files you need to copy.
For example to copy data to M100 using rsync you can run the following command:

rsync -PravzHS </data_path_from/file> <your_username>@data.m100.cineca.it:<complete_data_path_to>

You can also use the "data" VM onto login nodes to move data from Marconi100 to another location with public IP:

ssh -xt data.m100.cineca.it rsync -PravzHS <complete_data_path_from/file> </data_path_to>

this command will open a session on the VM that will not be closed until the rsync command is completed.

In similar ways you can use also scp and sftp commands if you prefer them.

Modules environment

The software modules are collected in different profiles and organized by functional categories (compilers, libraries, tools, applications,..). The profiles are of two types: “programming” type (base and advanced) for compilation, The software modules are collected in different profiles and organized by functional categories (compilers, libraries, tools, applications,..). The profiles are of two types: “programming” type (base and advanced) for compilation, debugging and profiling activities, and “domain” type (chem-phys, lifesc,..) for the production activity. They can be loaded together.

...

The internode communications is based on a Mellanox Infiniband EDR network 100 Gb/s, and the OpenMPI and IBM MPI Spectrum libraries are configured so to exploit the Mellanox Fabric Collective Accelerators (also on CUDA memories) and Messaging Accelerators.

...

In the following table you can find the main features and limits imposed on the partitions of M100.

Note: core refers to a physical cpu, with its 4 HTs; cpu refers to a logical cpu (1 HT). Each node has 32 cores/128 cpus.

def. partition

SLURM partition	Job QOS	# cores/# GPU per job	max walltime	max running jobs per user/ max n. of cpuscores/nodes/GPUs per user	priority	notes
m100_all_serial (	default)	normal	max = 1 core, 1 GPU max mem= 7600MB	04:00:00	4 cpus/1 GPU	40
m100_all_serial (	default)	qos_install	max = 16 cores	04:00:00	max = 16 cores 1 job per user	40	request to superc@cineca.it
m100_usr_prod	normal	max = 16 nodes	24:00:00		40	runs on 880 nodes
	m100_qos_dbg	max = 2 nodes	02:00:00	2 nodes/64cpus64cores/8GPUs	4580	runs on 12 nodes
	m100_qos_bprod	min = 17 nodes max =256 nodes	24:00:00	256 nodes	8560	runs on 512 nodes min is 17 FULL nodes (544 cores, 2176 cpus)
m100_usr_preempt	normal	max = 16 nodes	24:00:00		1	runs on 99 nodes
m100_fua_prod (EUROFUSION)	normal	max = 16 nodes	24:00:00		40	runs on 87 nodes
	m100_qos_fuadbg	max = 2 nodes	02:00:00		45	runs on 12 nodes
	m100_qos_fuabprod	min = 17 nodes max = 32 nodes	24:00:00		40	run on 64 nodes at same time
all partitions	qos_special	> 32 nodes	> 24:00:00		40	request to superc@cineca.it
all partitions (NO EUROFUSION)	qos_lowprio	max = 16 nodes	24:00:00		0	active projects with exhausted budget request to superc@cineca.it

The partition m100_usr_preempt allows users to access the additional nodes of m100_fua_prod partition in preemptable modality (if available and not used by Eurofusion community). The jobs submitted to the m100_usr_preempt partition may be killed if the assigned resources are requested by jobs submitted to higher priority partition (m100_fua_prod); hence we recommend its use only with restartable applications.

Users with exhausted but still active projects are allowed to keep using the cluster resources, even if at a very low priority, by adding the "qos_lowprio" flag to their job:

#SBATCH --qos=qos_lowprio

This qos is not active for EUROFusion projects for which a different dedicated QOS (qos_fualowprio) is automatically associated to Eurofusion users once their projects exhaust the budget before their expiry date.
For all the other users, please ask superc@cineca.it the QOS association.

Eurofusion users can also use the computing resources at low priority before their budget gets exhausted, in case they wish to run non urgent jobs without consuming the budget of the granted project. Please ask superc@cineca.it to be added to the Account FUAC5FUAC6_LOWPRIO, and specify this account and the qos_lowprio fualowprio QOS in your submission script.

...

#SBATCH --ntasks-per-node=8  
#SBATCH --cpus-per-task=4

In this way each tasks will correspond to one (physical) core. each tasks will correspond to one (physical) core.

M100 GPU use report

A statistics of the GPU utilization during a job, provided by nvidia dcgmi tool, can be obtained at the end of the job by explicitly requesting the constraing "gpureport":

#SBATCH --constraint=gpureport (or -C gpureport)

This option will result, at the end of the job, in producing a file for each of the nodes assigned to the job with the relevant information on the employed GPUs (performance statistics, such as Energy Consumed, Power Usage, Max GPU Memory Used, GPU and Memory Used etc.; Event Stats, as ECC Errors etc.; Slowdown Stats; Overall Health). The files are named "dcgmi_stats_<node_name>_<jobid>.out".

Submitting serial batch jobs

...

m100_fua_* partitions are reserved to EuroFusion EUROFusion users
m100_usr_* partitions are open to academic production.

...

16 full cores are requested and 2 GPUs. The 16x4 (virtual) cpus are used for 4 MPI tasks and 16 OMP threads per task. The -m flag in the srun command specifies the desired process distribution between nodes/socket/cores (the default is block:cyclic). Please refer to srun manual for more details on the processes distribution and binding. The --map-by socket:PE=4 will assign and bind 4 physical consecutive cores to each process (see process mapping and binding on the official IBM Spectrum MPI manual).

> salloc  -N1 --ntasks-per-node=32 --cpus-per-task=4 --gres=gpu:2 --partition=...

...

Here you can find Other batch job examples on M100 . You can find more information on process mapping and binding on the official IBM Spectrum MPI manual.

Graphic session

If a graphic session is desired we recommend to use the tool RCM (Remote Connection Manager). For additional information visit Remote Visualization section on our User Guide.

...

Invocations	Usage
pgcc	Compile C source files.
pgccpgc++	Compile C++ source files.
pgf77	Compile FORTRAN 77 source files
pgf90	Compile FORTRAN 90 source files
pgf95	Compile FORTRAN 95 source files

...

On Marconi100 only the Command Line Interface (CLI) is available since the GUI does not support Power9 nodes. Our suggestion is to run the CLI inside your job script in order to generate the qdrep files. Then you can download the qdrep files on your local PC and visualize them with the Nsight System GUI available on your workstation.

The profiler is available under the modules hpc-sdk, cuda/11.0 and later versions.

Standard usage of an MPI job running on GPU is

...

This will place the temporary outputs of the nsys code in your TMPDIR folder that by default is /scratch_local/slurm_job.$SLURM_JOB_ID where you have 1 TB of free space.
This workaround may cause conflicts between multiple jobs running this profiler on a compute node at the same time, so we strongly suggest also to request the compute node exclusively:multiple jobs running this profiler on a compute node at the same time, so we strongly suggest also to request the compute node exclusively:

#SBATCH --exclusive

Nsight Systems can also collect kernel IP samples and backtraces, however, this is prevented by the perf event paranoid level being set to 2 on Marconi100. It is possible to bypass this restriction by adding the SLURM directive:

#SBATCH --gres=sysfs

Along with the exclusive one.

MPI environment

We offer two options for MPI environment on Marconi100:

...

Here you can find some useful details on how to use them on Marconi100.

Warning: When you compile your code using the XL compiler with Spectrum-MPI parallel library (our recommended software stack) you have to use mpirun (not srun) to execute your program.

Spectrum-MPI

It is an IBM implementation of MPI. Together with XL compiler it is the recommended enviroment to be used on Marconi100.
In addition to OpenMPI it adds unique features optimized for IBM systems such as CPU affinity features, dynamic selection of interface libraries, workload manager integrations and better performance.
Spectrum-MPI supports both CUDA-aware and GPUDirect technologies.

...

Page tree

Versions Compared

Old Version 106

New Version Current

Key

Modules environment

Dedicated node for Data transfer

Modules environment

M100 GPU use report

Submitting serial batch jobs

Graphic session

MPI environment

Spectrum-MPI