Page History

Table of Contents

...

hostname: login.m100.cineca.it

early availability: April 20, 2020

start of production: to be defined (April 27, 2020)

...

This system will be in production at the beginning of 2020 as is an upgrade of the "non conventional" partition of the Marconi Tier-0 system. It is an accelerated cluster based on Power9 chips and Volta NVIDIA GPUs, acquired by Cineca within the PPI4HPC European initiative.

System Architecture

...

More technical details on this architecture can be found on the IBM RedBook serieseries:

https://www.redbooks.ibm.com/redpapers/pdfs/redp5494.pdf

...

For information about data transfer from other computers please follow the instructions and caveats on the dedicated section Data storage, or or the document Data Management.

...

For accounting information please consult our dedicated section.

The account_no (or project) is important for batch executions. You need to indicate an account_no to be accounted for in the scheduler, using the flag "-A"

...

With the "saldo -b" command you can list all the account_no associated to with your username.

> saldo -b (reports projects defined on M100 )

Please note that the accounting is in terms of consumed core hours, but it strongly depends also on the requested memory and number of GPUs, please refer to the dedicated section.

Budget Linearization policy

...

Starting from the first day of each month, the collaborators of any account are allowed to use the quota at full priority. As long as the budget is consumed, the jobs submitted from the account will gradually loose lose priority, until the monthly budget (monthTotal) is fully consumed. At that moment, their jobs will still be considered for execution, but with a lower priority than the jobs from accounts that still have some monthly quota left.

This policy is similar to those already applied by other important HPC centers in Europe and worldwide. The goal is to improve the response time, giving users the opportunity of using the cpu hours assigned to their project in relation of to their actual size (total amount of core-hours).

...

The storage organization conforms to the CINECA infrastructure (see Section Data Storage and Filesystems).

In addition to the home directory $HOME, for each user is defined a scratch area $CINECA_SCRATCH, a large disk for the storage of run time data and files.

...

GPU0     X     NV3    SYS    SYS    0-63
GPU1    NV3     X     SYS    SYS    0-63
GPU2    SYS    SYS     X     NV3    64-127
GPU3    SYS    SYS    NV3     X     64-127

from the output of the comman command it is possible to see that GPU0 and GPU1 are connected with the NVLink, as well as the couple GPU2 & GPU3. The first couple is connected to cpus 0-63, the second to cpus 64-127. The cpus are numbered from 0 to 127 because of a ofa hyperthreading of four: 32 physical core x 4 → 128 cpus.

The knowlwdge knowledge of the topology of the node is important for correctly distribute the parallel threads of your applications in order to get the best performances.

...

The software modules are collected in different profiles and organized by functional category categories (compilers, libraries, tools, applications,..).

...

Since the nodes can be shared by users, Slurm has been configured to allocate one (physical) task per core by default. Without this this option, by default one task will be allocated per thread on nodes by default one task will be allocated per thread on nodes with more than one ThreadsPerCore (as it is on Marconi100).

...

As mentioned above, the accounting of the consumed core hours takes into account also the memory and the number of requested GPUs (see the dedicated section). For instance, a job using one core and one GPU for one hour (with the default memory per core) will consume 8 core-hours (each node being equipped with 32 physical cores and 4 V100 GPUs).

...

For all the mentioned cases, SLURM automatically exports the environment variables you defined in the source shell , so that if you need to run your program "myprogram" in a controlled environment (i.e. specific library paths or options), you can prepare the environment in the origin shell being sure to find it in the interactive shell (started with both srun and salloc).

...

which shows, for each partition, the total number of nodes and the number of nodes by the state in the format "Allocated/Idle/Other/Total".

...

For more information and examples of job scripts, see section Batch Scheduler SLURM.

Submitting serial Batch jobs

The m100_all_serial partition is available with a maximum walltime of 4 hours, 1 task and 7600 MB per job. It runs on two dedicated nodes (equipped with 4 Volta GPUs), and it is designed for pre/post-processing serial analysis (using or not the GPUs), and for moving your data (via rsync, scp etc.) in case more than 10 minutes are required to complete the data transfer. This is the default partition, which is assumed by SLURM if you do not explicit request a partition with the flag "--partition" or "-p". You can however explicitly request it in your batch script with the directive:

...

Summary

In the following table, you can find all the main features and limits imposed on the queues/Partitions of M100.

...

SLURM partition	Job QOS	# cores/# GPU per job	max walltime	max running jobs per user/ max n. of cpus/nodes/GPUs per user	max memory per node (MB)	priority	notes
m100_all_serial (default partition)	normal	max = 1 core, 1 GPU (max mem= 7600 MB)	04:00:00	4 cpus/1 GPU	-	40

m100_usr_prod	m100_qos_dbg	max = 2 nodes	02:00:00	2 nodes/64 cpus/8 GPUs	246000	45	runs on 12 nodes #SBATCH -p m100_usr_prod #SBATCH --qos=m100_qos_dbg
m100_usr_prod	normal	max = 16 nodes	24:00:00	10 jobs	246000	40	runs on 880 nodes #SBATCH -p m100_usr_prod
	m100_qos_bprod	min = 17 nodes max = 256 nodes	24:00:00	256 nodes	246000	85	runs on 256 nodes #SBATCH -p m100_usr_prod #SBATCH --qos=m100_qos_bprod
m100_fua_prod	m100_qos_fuadbg	max = 2 nodes	02:00:00		246000	45	runs on 12 nodes #SBATCH -p m100_fua_prod #SBATCH --qos=m100_qos_fuadbg
m100_fua_prod	normal	max = 16 nodes	1-00:00:00		246000	40	runs on 68 nodes #SBATCH -p m100_fua_prod
	qos_special	>256 nodes	>24:00:00		246000	40	#SBATCH --qos=qos_special request to superc@cineca.it
	qos_lowprio	max = 16 nodes	24:00:00		246000	0	#SBATCH --qos=qos_lowprio Non-Eurofusion users: request to superc@cineca.it

...

If a graphic session is desired we recommend to use the tool RCM (Remote Connection Manager). For additional information visit Remote Visualization section on our User Guide.

...

The programming environment of the M100 cluster consists of a choice of compilers for the main scientific languages (Fortran, C and C++), debuggers to help users in finding bugs and errors in the codes, profilers to help in code optimisation.

In general, you must "load" the correct environment also for using programming tools like compilers, since "native" compilers are not available.

...

> man gfortan

> man gcc

Some miscellanous miscellaneous flags are described in the following:

...

In GPU-accelerated applications, the sequential part of the workload runs on the CPU – which is optimized for single-threaded performance – while the compute intensive portion of the application runs on thousands of GPU cores in parallel. When using CUDA, developers program in popular languages such as C, C++, Fortran, Python and MATLAB and express parallelism through extensions in the form of a few basic keywords. We refer to the the NVIDIA NVIDIA CUDA Parallel Computing Platform documentation.

...

Whatever your decision, in any case, you need to enable compiler runtime checks, by putting specific flags during the compilation phase. In the following we describe those flags for the different Fortran compilers: if you are using the C or C++ compiler, please check before because the flags may differ.

...

In software engineering, profiling is the investigation of a program's behavior using information gathered as the program executes. The usual purpose of this analisys analysis is to determine which sections of a program to optimize - to increase its overall speed, decrease its memory requirement or sometimes both.

A (code) profiler is a performance analisys analysis tool that, most commonly, measures only the frequency and duration of function calls, but there are other specific types of profilers (e.g. memory profilers) in addition to more comprehensive profilers, capable of gathering extensive performance data.

...

It is also possible to profile at code line-level (see "man gprof" for other options). In this case, you must use also the “-g” flag at compilation time:

>  gfortran -pg -g -O3 -o myexec myprog.f90
> ./myexec
> ls -ltr
   .......
   -rw-r--r-- 1 aer0 cineca-staff    506 Apr  6 15:33 gmon.out
> gprof -annotated-source myexec gmon.out

It is possilbe possible to profile MPI programs. In this case, the environment variable GMON_OUT_PREFIX must be defined in order to allow to each task to write a different statistical file. Setting

...

Page tree

Versions Compared

Old Version 57

New Version 58

Key

System Architecture

Budget Linearization policy

Submitting serial Batch jobs

Summary