Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

...

hostname:                login.m100.cineca.it

early availability:  April 20, 2020

start of production: to be defined (April 27, 2020)

...

This system will be in production at the beginning of 2020 as is an upgrade of the "non conventional" partition of the Marconi Tier-0 system. It is an accelerated cluster based on Power9 chips and Volta NVIDIA GPUs, acquired by Cineca within the PPI4HPC  European initiative.

System Architecture

...

More technical details on this architecture can be found on the IBM RedBook serieseries:

https://www.redbooks.ibm.com/redpapers/pdfs/redp5494.pdf

...

For information about data transfer from other computers please follow the instructions and caveats on the dedicated section Data storage, or  or the document  Data Management.

...

For accounting information please consult our dedicated section.

The account_no (or project) is important for batch executions. You need to indicate an account_no to be accounted for in the scheduler, using the flag "-A"

...

With the "saldo -b" command you can list all the account_no associated to with your username. 

saldo -b   (reports projects defined on M100 )

Please note that the accounting is in terms of consumed core hours, but it strongly depends also on the requested memory and number of GPUs, please refer to the dedicated section.

Budget Linearization policy

...

Starting from the first day of each month, the collaborators of any account are allowed to use the quota at full priority. As long as the budget is consumed, the jobs submitted from the account will gradually loose lose priority, until the monthly budget (monthTotal) is fully consumed. At that moment, their jobs will still be considered for execution, but with a lower priority than the jobs from accounts that still have some monthly quota left.

This policy is similar to those already applied by other important HPC centers in Europe and worldwide. The goal is to improve the response time, giving users the opportunity of using the cpu hours assigned to their project in relation of to their actual size (total amount of core-hours).

...

The storage organization conforms to the CINECA infrastructure (see Section Data Storage and Filesystems). 

In addition to the home directory $HOME, for each user is defined a scratch area $CINECA_SCRATCH, a large disk for the storage of run time data and files. 

...

GPU0     X     NV3    SYS    SYS    0-63
GPU1    NV3     X     SYS    SYS    0-63
GPU2    SYS    SYS     X     NV3    64-127
GPU3    SYS    SYS    NV3     X     64-127

from the output of the comman command it is possible to see that GPU0 and GPU1 are connected with the NVLink, as well as the couple GPU2 & GPU3. The first couple is connected to cpus 0-63, the second to cpus 64-127. The cpus are numbered from 0 to 127 because of a ofa  hyperthreading of four: 32 physical core x 4 → 128 cpus.

The knowlwdge knowledge of the topology of the node is important for correctly distribute the parallel threads of your applications in order to get the best performances.

...

The software modules are collected in different profiles and organized by functional category categories (compilers, libraries, tools, applications,..).

...

Since the nodes can be shared by users, Slurm has been configured to allocate one (physical) task per core by default. Without this  this option,  by  default one  task  will  be  allocated  per  thread  on  nodes  by default one task will be allocated per thread on nodes with more than one ThreadsPerCore (as it is on Marconi100).

...

As mentioned above, the accounting of the consumed core hours takes into account also the memory and the number of requested GPUs (see the dedicated section). For instance, a job using one core and one GPU for one hour (with the default memory per core) will consume 8 core-hours (each node being equipped with 32 physical cores and 4 V100 GPUs). 

...

For all the mentioned cases, SLURM automatically exports the environment variables you defined in the source shell , so that if you need to run your program "myprogram" in a controlled environment (i.e. specific library paths or options), you can prepare the environment in the origin shell being sure to find it in the interactive shell (started with both srun and salloc).

...

which shows, for each partition, the total number of nodes and the number of nodes by the state in the format "Allocated/Idle/Other/Total".

...

For more information and examples of job scripts, see section Batch Scheduler SLURM.

Submitting serial Batch jobs

The m100_all_serial partition is available with a maximum walltime of 4 hours, 1 task and 7600 MB per job. It runs on two dedicated nodes (equipped with 4 Volta GPUs), and it is designed for pre/post-processing serial analysis (using or not the GPUs), and for moving your data (via rsync, scp etc.) in case more than 10 minutes are required to complete the data transfer. This is the default partition, which is assumed by SLURM if you do not explicit request a partition with the flag "--partition" or "-p". You can however explicitly request it in your batch script with the directive:

...

Summary


In the following table, you can find all the main features and limits imposed on the queues/Partitions of M100. 

...

SLURM

partition

Job QOS# cores/# GPU per job
max walltime

max running jobs per user/

max n. of cpus/nodes/GPUs per user

max memory per node

(MB)

prioritynotes

m100_all_serial

(default partition)

normal

max = 1 core, 1 GPU

(max mem= 7600 MB)

04:00:00

4 cpus/1 GPU


 -40








m100_usr_prodm100_qos_dbg

max = 2 nodes

02:00:002 nodes/64 cpus/8 GPUs
24600045

runs on 12 nodes


#SBATCH -p m100_usr_prod


#SBATCH --qos=m100_qos_dbg

m100_usr_prodnormal

max = 16 nodes

24:00:0010 jobs24600040

runs on 880 nodes

#SBATCH -p m100_usr_prod


m100_qos_bprod

min = 17 nodes

max = 256 nodes

24:00:00

256 nodes

24600085

runs on 256 nodes

#SBATCH -p m100_usr_prod

#SBATCH --qos=m100_qos_bprod

m100_fua_prodm100_qos_fuadbgmax = 2 nodes02:00:00
24600045

runs on 12 nodes

#SBATCH -p m100_fua_prod

#SBATCH --qos=m100_qos_fuadbg

m100_fua_prodnormalmax = 16 nodes1-00:00:00
24600040

runs on 68 nodes

#SBATCH -p m100_fua_prod


qos_special>256 nodes

>24:00:00



           246000
40 

#SBATCH --qos=qos_special

request to superc@cineca.it

qos_lowpriomax = 16 nodes24:00:00
2460000

#SBATCH --qos=qos_lowprio

Non-Eurofusion users: request to superc@cineca.it

...

If a graphic session is desired we recommend to use the tool RCM (Remote Connection Manager)For additional information visit Remote Visualization section on our User Guide.

...

The programming environment of the M100 cluster consists of a choice of compilers for the main scientific languages (Fortran, C and C++), debuggers to help users in finding bugs and errors in the codes, profilers to help in code optimisation.

In general, you must "load" the correct environment also for using programming tools like compilers, since "native" compilers are not available.

...

> man gfortan
> man gcc

Some miscellanous miscellaneous flags are described in the following:

...

In GPU-accelerated applications, the sequential part of the workload runs on the CPU – which is optimized for single-threaded performance – while the compute intensive portion of the application runs on thousands of GPU cores in parallel. When using CUDA, developers program in popular languages such as C, C++, Fortran, Python and MATLAB and express parallelism through extensions in the form of a few basic keywords. We refer to the  the NVIDIA  NVIDIA CUDA Parallel Computing Platform documentation.

...

Whatever your decision, in any case, you need to enable compiler runtime checks, by putting specific flags during the compilation phase. In the following we describe those flags for the different Fortran compilers: if you are using the C or C++ compiler, please check before because the flags may differ.

...

In software engineering, profiling is the investigation of a program's behavior using information gathered as the program executes. The usual purpose of this analisys analysis is to determine which sections of a program to optimize - to increase its overall speed, decrease its memory requirement or sometimes both.

A (code) profiler is a performance analisys analysis tool that, most commonly, measures only the frequency and duration of function calls, but there are other specific types of profilers (e.g. memory profilers) in addition to more comprehensive profilers, capable of gathering extensive performance data.

...

It is also possible to profile at code line-level (see "man gprof" for other options). In this case, you must use also the “-g” flag at compilation time:

>  gfortran -pg -g -O3 -o myexec myprog.f90
> ./myexec
> ls -ltr
   .......
   -rw-r--r-- 1 aer0 cineca-staff    506 Apr  6 15:33 gmon.out
> gprof -annotated-source myexec gmon.out


It is possilbe possible to profile MPI programs. In this case, the environment variable GMON_OUT_PREFIX must be defined in order to allow to each task to write a different statistical file. Setting

...