Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

$WORK area is defined for each active project on the system, reserved for all the collaborators of the project. This is a safe storage area to keep run time data for the whole life of the project.



Total Dimension (TB)

Quota (GB)

Notes

$HOME20050
  • permanent/backed up, user specific, local
$CINECA_SCRATCH2.000no quota
  • temporary, user specific, local
  • no backup
  • automatic cleaning procedure of data older than 40 days (time interval can be reduced in case of critical usage ratio of the area. In this case, users will be notified via HPC-News)
$WORK4.0001.024
  • permanent, project specific, local
  • no backup
  • extensions can be considered if needed (mailto: superc@cineca.it)


$DRES environment variable points to the shared repository where Data RESources are maintained. This is a data archive area available only on-request, shared with all CINECA HPC systems and among different projects. $DRES is not mounted on the compute nodes of the production partitions and can be accessed only from login nodes and from the nodes of the serial partition. This means that you cannot access it within a standard batch job: all data needed during the batch execution has to be moved to $WORK or $CINECA_SCRATCH before the run starts, either from the login nodes or via a job submitted to the serial partition. 

...

A more specific description of the options used by salloc/srun to allocate resources or to give direction to SLURM on how to place tasks and threads on the resources (pinning) is reported later in the “Batch” section, because they are the same of the sbatch command described there.

Batch

As usual on HPC systems using SLURM, you can submit a script script.x using the command:

> sbatch script.x

You can get a list of defined partitions with the command:

> sinfo

You can simplify the output reported by the sinfo command specifying the output format via the "-o" option. A minimal output is reported, for instance, with:

> sinfo -o "%10D %20F %P"

which shows, for each partition, the total number of nodes and the number of nodes by the state in the format "Allocated/Idle/Other/Total".

Please note that the recommended way to launch parallel tasks in slurm jobs is with srun. By using srun vs mpirun you will get full support for process tracking, accounting, task affinity, suspend/resume and other features.

For more information and examples of job scripts, see section Batch Scheduler SLURM.

Submitting serial Batch jobs

The m100_all_serial partition is available with a maximum walltime of 4 hours, 1 task and 7600 MB per job. It runs on two dedicated nodes (equipped with 4 Volta GPUs), and it is designed for pre/post-processing serial analysis (using or not the GPUs), and for moving your data (via rsync, scp etc.) in case more than 10 minutes are required to complete the data transfer. This is the default partition, which is assumed by SLURM if you do not explicit request a partition with the flag "--partition" or "-p". You can however explicitly request it in your batch script with the directive:

#SBATCH -p m100_all_serial

Submitting Batch jobs for production

Not all of the partitions are open to access by the academic community as some are reserved to dedicated classes of users (for example *_fua_ * partitions are for EUROfusion users):

  • m100_fua_prod and m100_fua_dbg, are  reserved to EuroFusion users, respectively for production and debugging
  • m100_usr_prod and m100_usr_dbg are open to academic production.

Each node exposes itself to SLURM as having 32 cores, 4 GPUs and 246000 MB memory. SLURM assigns a node in shared way, assigning to the job only the resources required and allowing multiple jobs to run on the same node/nodes. If you want to have the node/s in exclusive mode, ask for all the resources of the node (either ntasks-per-node=32 or mem=246000).

The maximum memory which can be requested is 246000 MB (average memory per physical core ~ 7GB) and this value guarantees that no memory swapping will occur. 

For example, to request one core and one GPU in a production queue the following SLURM job script can be used:

#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1 # this refers to the number of requested gpus per node, and can vary between 1 and 4
#SBATCH -A <account_name>
#SBATCH --mem=7100 # this refers to the requested memory per node with a maximum of 246000
#SBATCH -p m100_usr_prod
#SBATCH --time 00:10:00 # format: HH:MM:SS
#SBATCH --job-name=my_batch_job
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<user_email>
srun ./myexecutable

Users with exhausted but still active projects are allowed to keep using the cluster resources, even if at a very low priority, by adding the  "qos_lowprio" flag to their job:

#SBATCH --qos=qos_lowprio

This QOS is automatically associated to Eurofusion users once their projects exhaust the budget before their expiry date. For all the other users, please ask superc@cineca.it to request the QOS association.

Summary

In the following table, you can find all the main features and limits imposed on the queues/Partitions of M100. 

...

SLURM

partition

...

max running jobs per user/

max n. of cpus/nodes/GPUs per user

...

max memory per node

(MB)

...

m100_all_serial

(default partition)

...

max = 1 core, 1 GPU

(max mem= 7600 MB)

...

4 cpus/1 GPU

...

max = 2 nodes

...

runs on 12 nodes

#SBATCH -p m100_usr_prod

#SBATCH --qos=m100_qos_dbg

...

max = 16 nodes

...

runs on 880 nodes

#SBATCH -p m100_usr_prod

...

min = 17 nodes

max = 256 nodes

...

256 nodes

...

runs on 256 nodes

#SBATCH -p m100_usr_prod

#SBATCH --qos=m100_qos_bprod

...

runs on 12 nodes

#SBATCH -p m100_fua_prod

#SBATCH --qos=m100_qos_fuadbg

...

runs on 68 nodes

#SBATCH -p m100_fua_prod

...

>24:00:00

...

#SBATCH --qos=qos_special

...

#SBATCH --qos=qos_lowprio

Non-Eurofusion users: request to superc@cineca.it

Graphic session

If a graphic session is desired we recommend to use the tool RCM (Remote Connection Manager)For additional information visit Remote Visualization section on our User Guide.

Programming environment

The programming environment of the M100 cluster consists of a choice of compilers for the main scientific languages (Fortran, C and C++), debuggers to help users in finding bugs and errors in the codes, profilers to help in code optimisation.

In general, you must "load" the correct environment also for using programming tools like compilers, since "native" compilers are not available.

If you use a given set of compilers and libraries to create your executable, very probably you have to define the same "environment" when you want to run it. This is because, since by default linking is dynamic on Linux systems, at runtime the application will need the compiler shared libraries as well as other proprietary libraries. This means that you have to specify "module load" for compilers and libraries, both at compile time and at run time. To minimize the number of needed modules at runtime, use static linking to compile the applications.

Compilers

You can check the complete list of available compilers on MARCONI with the command:

> module available

and checking the "compilers" section. The available compilers are:

...

XL 

The XL compiler family offers C, C++, and Fortran compilers designed for optimization and improvement of code generation, exploiting the inherent opportunities in Power Architecture.

The xl/16.1.1–binary module provides:

  • IBM XL C/C++ and Fortran compilers 16.1.1
  • IBM XL Shared-memory parallelism (SMP) runtime library/environment 5.1.1
  • Mathematical Acceleration Subsystem (MASS) Libraries 9.1.1

The name of the XL C/C++ and Fortran compilers are:

...

xlc, xlc_r  

...

Compile C source files.

(ANSI C89, ISO C99, IBM language extensions)

...

cc, cc_r

...

Compile legacy code that does not conform to Standard C. (pre-ANSI C)

...

Compile C source files with strict conformance to the C89 standard. (ANSI C89)

...

c99, c99_r

...

Compile C source files with strict conformance to the C99 standard. (ISO 99)

...

The main difference between these commands is that they use different default options (which are set in the configuration files /cineca/prod/opt/compilers/xl/16.1.1/binary/xlC/16.1.1/etc/xlc.cfg.rhel.7.6.gcc.8.4.0.cuda.10.1 and /cineca/prod/opt/compilers/xl/16.1.1/binary/xlf/16.1.1/etc/xlf.cfg.rhel.7.6.gcc.8.4.0.cuda.10.1 respectively for the C/C++ and Fortran compilers).

All the invocation commands can be used to link programs that use multithreading. The _r versions are for backward-compatibility.

To learn more about the XL Fortran for Linux compiler, access the online product documentation in IBM Knowledge Center for the XL C/C++ compiler and the XL Fortran compiler.        

The OpenMP parallelization is enabled by the -qsmp compiler option. If -qsmp=omp is specified, strict OpenMP compliance is applied on the compiling programs. Please refer to the official OpenMP support in IBM XL compilers documentation.

PORTLAND Group (PGI)

Initialize the environment with the module command:

> module load pgi

The name of the PGI compilers are:

  • pgf77: Fortran77 compiler
  • pgf90: Fortran90 compiler
  • pgf95: Fortran95 compiler
  • pghpf: High Performance Fortran compiler
  • pgcc: C compiler
  • pgCC: C++ compiler

The documentation can be obtained with the man command after loading the relevant module:

> man pgf95
> man pgcc

Some miscellanous flags are described in the following:

...

, the large production runs are executed in batch mode. This means that the user writes a list of commands into a file (for example script.x) and then submits it to a scheduler (SLURM for Marconi100) that will search for the required resources in the system. As soon as the resources are available script.x is executed and the results and sent back to the user.

This is an example of script file:

#!/bin/bash
#SBATCH -A <account_name>
#SBATCH -p m100_usr_prod
#SBATCH --time 00:10:00     # format: HH:MM:SS
#SBATCH -N 1                # 1 node
#SBATCH --ntasks-per-node=8 # 8 tasks out of 128
#SBATCH --gres=gpu:1        # 1 gpus per node out of 4
#SBATCH --mem=7100          # memory per node out of 246000MB
#SBATCH --job-name=my_batch_job
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<user_email>
srun ./myexecutable

You can write your script file (for example script.x) using any editor, then you submit it using the command:

> sbatch script.x

The script file must contain both directives to SLURM and commands to be executed, as better described in the section  Batch Scheduler SLURM

Using SLURM directives you indicate the account_number (-A: which project pays for this work), where to run the job (-p: partition), what is the maximum duration of the run (--time: time limit). Moreover you indicate the resources needed, in terms of cores, GPUs and memory. 

One of the commands will be probably the launch of a parallel MPI application. In this case the right command is srun, as an alternative to the usual mpirun command. In this way you will get full support for process tracking, accounting, task affinity, suspend/resume and other features.


SLURM partitions


 A list of partitions defined on the cluster, with access rights and resources definition, can be displayed with the command sinfo:


> sinfo -o "%10D %20F %P"


The command returns a more readable output which shows, for each partition, the total number of nodes and the number of nodes by state in the format "Allocated/Idle/Other/Total".


In the following table you can find the main features and limits imposed on the partitions of M100.


SLURM

partition

Job QOS# cores/# GPU
per job
max walltime

max running jobs per user/

max n. of cpus/nodes/GPUs per user

prioritynotes

m100_all_serial

(def. partition)

normal

max = 1 core,
1 GPU

max mem= 7600MB

04:00:00

4 cpus/1 GPU


40
m100_usr_prodm100_qos_dbg

max = 2 nodes

02:00:002 nodes/64 cpus/
8 GPUs

45

runs on 12 nodes

m100_usr_prodnormal

max = 16 nodes

24:00:0010 jobs40

runs on 880 nodes



m100_qos_bprod

min = 17 nodes

max =256nodes

24:00:00

256 nodes

85

runs on 256 nodes


m100_fua_prodm100_qos_fuadbgmax = 2 nodes02:00:00
45

runs on 12 nodes


m100_fua_prodnormalmax = 16 nodes24:00:00
40

runs on 68 nodes



qos_special>16 nodes

>24:00:00



40 
request to superc@cineca.it

qos_lowpriomax = 16 nodes24:00:00
0

active projects with exhasted budget


M100 specific information

In the following we report information specific to M100, as well as examples suited for this kind of system.

Each node exposes itself to SLURM as having 128 (virtual) cpus, 4 GPUs and 246.000 MB memory. SLURM assigns a node in shared way, assigning to the job only the resources required and allowing multiple jobs to run on the same node/nodes. If you want to have the node/s in exclusive mode, use the SLURM option “--exclusive” together with “--gres=gpu:4”.

The maximum memory which can be requested is 246.000 MB (average memory per physical core ~ 7GB) and this value guarantees that no memory swapping will occur. 

Even if the nodes are shared among users, exclusivity is guaranteed for the single physical core and the single GPU. When you ask for “tasks” (--ntasks-for-node), SLURM gives you the requested number of (virtual) cpus  rounded on multiple of four. For example

#SBATCH --ntasks-for-node = 1 (or 2, 3, 4)     → 1 core
#SBATCH --ntasks-for-node = 13 (or 14, 15, 16) → 4 cores

By default the number of (virtual) cpus per task is one, but you can change it. 

#SBATCH --ntasks-per-node=8  
#SBATCH --cpus-per-task=4

In this way each tasks will correspond to one (physical) core. 

Users with exhausted but still active projects are allowed to keep using the cluster resources, even if at a very low priority, by adding the  "qos_lowprio" flag to their job:

#SBATCH --qos=qos_lowprio

This QOS is automatically associated to Eurofusion users once their projects exhaust the budget before their expiry date. For all the other users, please ask superc@cineca.it the QOS association.


Submitting serial batch jobs


 The m100_all_serial partition is available with a maximum walltime of 4 hours, 1 core and 7600 MB per job. It runs on two dedicated nodes (equipped with 4 Volta GPUs), and it is designed for pre/post-processing serial analysis (using or not the GPUs), for moving your data (via rsync, scp etc.), and for programming tools. 


#SBATCH -p m100_all_serial


This is the default partition, the use of this partition is free of charge and available to all users on the cluster.


Submitting batch jobs for production


Not all of the partitions are open to access by the academic community as some are reserved to dedicated classes of users (for example m100_fua_ * partitions are for EUROfusion users):


  • m100_fua_* partitions are  reserved to EuroFusion users

  • m100_usr_* partitions are open to academic production.


In these partitions you can use also the QOS directives, in order to modulate your request:


#SBATCH -p m100_usr_prod
#SBATCH -qos  m100_qos_dbg    


(debug queue for academic users)


#SBATCH -p m100_usr_prod
(production queue for academic users)


Examples


 > sbatch -N1 --ntasks-per-node=2 --cpus-per-task=4 --gres=gpu:2 …
export OMP_NUM_THREADS=4
srun ./myprogram


Two full cores on one node are requested, as well as 2 GPUs. A hybrid code is executed with 2 MPI tasks and 4 OMP threads, exploiting the HT capability of M100. Since 2 GPUs are used, 16 cores will be accounted to this job.


> sbatch  -N1 --ntasks-per-node=16 --cpus-per-task=4 --gres=gpu:2 ...   
export OMP_NUM_THREADS=16
srun --ntasks-per-node=4 (--cpu-bind=core )  --cpus-per-task=16 -m block:block ./myprogram


16 full cores are requested and 2 GPUs. The 16x4 (virtual) cpus are used for 4 MPI tasks and 16 OMP threads per task. The -m flag in the srun command specifies the desired process distribution between nodes/socket/cores (the default is block:cyclic). Please refer to srun manual for more details on the processes distribution and binding.

Graphic session

If a graphic session is desired we recommend to use the tool RCM (Remote Connection Manager)For additional information visit Remote Visualization section on our User Guide.

Programming environment

Marconi100 login and compute nodes host four Tesla Volta (V100) GPUs per node (CUDA compute capability 7.0). The most recent versions of nVIDIA CUDA toolkit and of the Community Edition PGI compilers (supporting CUDA Fortran) is available in the module environment, together with a set of GPU-enabled libraries, applications and tools.

The programming environment of the M100 cluster consists of a choice of compilers for the main scientific languages (Fortran, C and C++), debuggers to help users in finding bugs and errors in the codes, profilers to help in code optimisation.

In general, you must "load" the correct environment also for using programming tools like compilers, since "native" compilers are not available.

If you use a given set of compilers and libraries to create your executable, very probably you have to define the same "environment" when you want to run it. This is because, since by default linking is dynamic on Linux systems, at runtime the application will need the compiler shared libraries as well as other proprietary libraries. This means that you have to specify "module load" for compilers and libraries, both at compile time and at run time. To minimize the number of needed modules at runtime, use static linking to compile the applications.

Compilers

You can check the complete list of available compilers on MARCONI with the command:

> module available

and checking the "compilers" section. The available compilers are:

  • XL
  • PGI
  • GNU
  • CUDA

XL 

The XL compiler family offers C, C++, and Fortran compilers designed for optimization and improvement of code generation, exploiting the inherent opportunities in Power Architecture. This is the recommended sofware stack on M100, together with Spectrum-MPI parallel library and ESSL scientific library.

The xl module provides:

  • IBM XL C/C++ and Fortran compilers 
  • IBM XL Shared-memory parallelism (SMP) runtime library/environment 
  • Mathematical Acceleration Subsystem (MASS) Libraries 

The name of the XL C/C++ and Fortran compilers are:

InvocationsUsage (supported standards)

xlc

Compile C source files.

(ANSI C89, ISO C99, IBM language extensions)

xlc++Compile C++ source files.
xlfCompile FORTRAN 77 source files
xlf90Compile FORTRAN 90 source files
xlf95Compile FORTRAN 95 source files
xlcufCompile CUDA FORTRAN  source files

To learn more about the XL Fortran for Linux compiler, access the online product documentation in IBM Knowledge Center for the XL C/C++ compiler and the XL Fortran compiler.        

The OpenMP parallelization is enabled by the -qsmp compiler option. If -qsmp=omp is specified, strict OpenMP compliance is applied on the compiling programs. Please refer to the official OpenMP support in IBM XL compilers documentation.


PORTLAND Group (PGI)

The name of the PGI compilers are:

  • pgf77: Fortran77 compiler
  • pgf90: Fortran90 compiler
  • pgf95: Fortran95 compiler
  • pghpf: High Performance Fortran compiler
  • pgcc: C compiler
  • pgCC: C++ compiler

The documentation can be obtained with the man command after loading the pgi module:

> man pgf95
> man pgcc

GNU compilers

The gnu compilers are always available but they are not the best optimizing compilers, it ensures the maximum portability. You do not need to load the module for using them.

Initialize the environment with the module command:

> module load gnu

The name of the GNU compilers are:

...

The documentation can be obtained with the man command :

> man gfortan
> man gcc

Some miscellaneous flags are described in the following:

after loading the gnu module:

> man gfortan
> man gcc-ffixed-line-length-132 To extend over the 77 column F77's limit -ffree-form / -ffixed-form Free/Fixed form for Fortran


CUDA

Compute Unified Device Architecture is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers are able to dramatically speed up computing applications by harnessing the power of GPUs. 

In GPU-accelerated applications, the sequential part of the workload runs on the CPU – which is optimized for single-threaded performance – while the compute intensive portion of the application runs on thousands of GPU cores in parallel. When using CUDA, developers program in popular languages such as C, C++, Fortran, Python and MATLAB and express parallelism through extensions in the form of a few basic keywords. We refer to the NVIDIA CUDA Parallel Computing Platform documentation....

Debugger and Profilers

If at runtime your code dies, then there is a problem. In order to solve it, you can decide to analyze the core file (core not available with PGI compilers) or to run your code using the debugger.


Compiler flags

Whatever your decision, in any case, you need to enable compiler runtime checks, by putting specific flags during the compilation phase. In the following we describe those flags for the different Fortran compilers: if you are using the C or C++ compiler, please check before because the flags may differ.

...

Other flags are compiler specific and are described in the following:

XL Fortran compiler

to be added...

PORTLAND Group (PGI) Compilers

The following flags are useful (in addition to "-O0 -g") for debugging your code:

-C                     Add array bounds checking
-Ktrap=ovf,divz,inv    Controls the behavior of the processor when exceptions occur: 
                       FP overflow, divide by zero, invalid operands

GNU Fortran compilers

The following flags are useful (in addition to "-O0 -g")for debugging your code:

...

 If the environmental variable is not set every task will write the same gmon.out file.

Scientific libraries

ESSL: Engineering and Scientific Subroutine Library

...

by IBM

Scientific libraries designed for Power architecture included in the XL compiler package, 

> module load essl/6.2.1

Documentation: https://www.ibm.com/support/knowledgecenter/SSFHY8/essl_content.html

...