Page History

Table of Contents

maxLevel	2

...

Production environment

Since LEONARDO is a general purpose system and is used by several users at the same time, long production jobs must be submitted using a queuing system (scheduler). The scheduler guarantees that the access to the resources is as fair as possible. The production environment on LEONARDO Data Centric General Purpose (DCGP) partition is based on the SLURM scheduler.

...

Roughly speaking, there are two different modes to use an HPC system: Interactive and Batch. For a general discussion see the section Production Environment.

Interactive

A serial program can be executed in the standard UNIX way:

...

Please do not execute parallel applications on the login nodes!

Batch

As usual on HPC systems, the large production runs are executed in batch mode. This means that the user writes a list of commands into a file (for example script.x) and then submits it to a scheduler (SLURM for LEONARDO) that will search for the required resources in the system. As soon as the resources are available script.x is executed and the results and sent back to the user.

...

Please note: the "mail" directive #SBATCH --mail-user is not effective yet.

SLURM partitions

A list of partitions defined on the cluster, with access rights and resources definition, can be displayed with the command sinfo:

...

40

SLURM partition	Job QOS	# cores/ # GPU per job	max walltime	max running jobs n. of nodes/cores/mem per user/ max n. of nodes /cores per useraccount	priority	notes
lrd_all_serial (default)	normal	max = 4 physical cores (8 logical cpus) max mem = 30800 MB	04:00:00	1 node / 4 cores / 30800 MB	40	No GPUs Hyperthreading x2
dcgp_usr_prod	normal	max = 16 nodes	24:00:00	512 nodes per account	40
	dcgp_qos_dbg	max = 2 nodes	00:30:00	2 nodes / 224 cores per user 512 nodes per account	80
	dcgp_qos_bprod	min = 17 nodes max =128 nodes	24:00:00	128 nodes nodes per user 512 nodes per account	60	runs on 1536 nodes min is 128 17 FULL nodes
	dcgp_qos_lprod	max = 3 nodes	4-00:00:00	3 nodes / 336 cores		4-00:00:00	3 nodes / 336 cores per user 512 nodes per account	40

Note: a maximum of 512 nodes per account is also imposed on the dcgp_usr_prod partition, meaning that, for each account, all the jobs associated with it cannot run on more than 512 nodes at the same time (if you submit a job that imply to exceed this limitation, it will stay pending until a

Programming environment

LEONARDO Data Centric compute nodes are not provided with GPUs, thus applications running on GPUs can be used only on the Booster partition. The programming environment include a list of compilers and of debugger and profiler tools, suitable for programming on CPUs.

Compilers

You can check the complete list of available compilers on LEONARDO with the command

...

For these reason, CUDA-aware compilers, such as GNU, NVIDIA nvhpc, and CUDA compilers, are suitable and recommended for LEONARDO Booster partition, and they are described in the dedicated page.

Intel OneAPI Compilers

Initialize the environment with the module command:

...

After loading the module, the documentation can be obtained with the man command:

$ man ifort
$ man icc

Debugger and Profilers

If at runtime your code dies, then there is a problem. In order to solve it, you can decide to analyze the core file (core not available with PGI compilers) or to run your code using the debugger.

Compiler flags

Whatever your decision, in any case, you need to enable compiler runtime checks, by putting specific flags during the compilation phase. In the following we describe those flags for the different Fortran compilers: if you are using the C or C++ compiler, please check before because the flags may differ.

...

Other flags are compiler specific and are described in the following.

PORTLAND Group (PGI) Compilers

The following flags are useful (in addition to "-O0 -g") for debugging your code:

-C                     Add array bounds checking
-Ktrap=ovf,divz,inv    Controls the behavior of the processor when exceptions occur: 
                       FP overflow, divide by zero, invalid operands

GNU Fortran compilers

The following flags are useful (in addition to "-O0 -g")for debugging your code:

-Wall             Enables warnings pertaining to usage that should be avoided
-fbounds-check    Checks for array subscripts.

Debuggers available

GNU: gdb (serial debugger)

GDB is the GNU Project debugger and allows you to see what is going on 'inside' your program while it executes -- or what the program was doing at the moment it crashed.

VALGRIND

Valgrind is a framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile your programs in detail. The Valgrind distribution currently includes six production-quality tools: a memory error detector, two thread error detectors, a cache and branch-prediction profiler, a call-graph generating cache profiler, and a heap profiler.

Valgrind is Open Source / Free Software, and is freely available under the GNU General Public License, version 2.

Profilers

In software engineering, profiling is the investigation of a program's behavior using information gathered as the program executes. The usual purpose of this analysis is to determine which sections of a program to optimize - to increase its overall speed, decrease its memory requirement or sometimes both.

A (code) profiler is a performance analysis tool that, most commonly, measures only the frequency and duration of function calls, but there are other specific types of profilers (e.g. memory profilers) in addition to more comprehensive profilers, capable of gathering extensive performance data.

gprof

The GNU profiler gprof is a useful tool for measuring the performance of a program. It records the number of calls to each function and the amount of time spent there, on a per-function basis. Functions which consume a large fraction of the run-time can be identified easily from the output of gprof. Efforts to speed up a program should concentrate first on those functions which dominate the total run-time.

...

If the environmental variable is not set every task will write the same gmon.out file.

MPI environment

The MPI implementation of Intel, i.e. Intel-OneAPI-MPI, is recommended on the LEONARDO Data Centric partition, and it doesn't support CUDA. Here you can find some useful details on how to use it on this partition.

See the page dedicated to LEONARDO Booster partition for a description of OpenMPI, which instead is installed for supporting CUDA.

Compiling

Intel-OneAPI-MPI

To install MPI applications using IntelMPI you have to load intel-oneapi-mpi module (use "modmap -m intel-oneapi-mpi command to see the available versions).

...

e.g. Compiling Fortran code:

$ module load intel-oneapi-compilers/<VERSION>
$ module load intel-oneapi-mpi/<version>
$ mpiifort -o myexec  myprog.f90 (uses the ifort compiler)

You can add all options available for the backend compiler (you can show it by "-show" flag, e.g. "mpicc -show"). In order to list them type the "man" command

$ man mpiifort

Running

To run MPI applications there are two ways:

using mpirun launcher
using srun launcher

mpirun launcher

To use mpirun launcher on LEONARDO Data Centric partition, the intel-oneapi-mpi module needs to be loaded:

...

$ sbatch -N 2 my_batch_script.sh (allocate a job of 2 nodes) 
$ cat my_batch_script.sh
#!/bin/sh
mpirun ./mpi_exec

srun launcher

MPI applications can also be launched directly with the SLURM launcher srun

...

$ sbatch -N 2 my_batch_script.sh (allocate a job of 2 nodes) 
$ vi my_batch_script.sh
#!/bin/sh
srun -N 2 ./mpi_exec

Scientific libraries

Libraries listed in this section do not support CUDA (see LEONARDO Booster section for GPU-accelerated libraries).

Linear Algebra

BLAS: openblas, intel-oneapi-mkl
LAPACK: openblas, intel-oneapi-mkl
SCALAPACK: netlib-scalapack, intel-oneapi-mkl
SPARCE MATRICES : PetSc (multi-node), SuperLU-dist (multi-node)

PetSc and SuperLU-dist are GPU-accelerated libraries and are also listed in LEONARDO Booster dedicated page. However, we report them here for the frequent use also in non-accelerated applications.

Fast Fourier Transform

FFTW (single and multi-node)

...

Page tree

Versions Compared

Old Version 7

New Version Current

Key