Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents
maxLevel2

Sections Production environment, Programming environment are specific for the two partitions, Booster and Data Centric General Purpose (DCGP):

...

hostname: login.leonardo.cineca.it

early availability:      March, 2023

                       login01-ext.leonardo.cineca.it

                       login02-ext.leonardo.cineca.it

                       login05-ext.leonardo.cineca.it

                       login07-ext.leonardo.cineca.it

early availability:      March, 2023 (Booster)

start of pre-start of pre-production:June, 2023 (Booster) 

                                   last quarter 2023 (Data Centric)                  January 2024 (DCGP)

start of production: August 2023 (Booster)

                                         February 2024 (DCGP)

...

This HPC system is This system is the new pre-exascale Tier-0 EuroHPC Joint Undertaking supercomputer hosted by CINECA and currently built in the Bologna Technopole, Italy. It is supplied by EVIDEN ATOS, and it is based on two new specifically-designed compute blades, which are available through two distinct SLURM partitions on the cluster:

  • X2135 GPU blade based on NVIDIA Ampere A100-64 accelerators - LEONARDO Booster partition
  • X2140 CPU-only blade based on Intel Sapphire Rapids processors - LEONARDO Data Centric General Purpose (DCGP) partition

The overall system architecture also uses NVIDIA Mellanox InfiniBand High Data Rate (HDR) a BullSequana XH2135 supercomputer nodes, each with four NVIDIA Tensor Core GPUs and a single Intel CPU. It also uses NVIDIA Mellanox HDR 200Gb/s InfiniBand connectivity, with smart in-network computing acceleration engines that enable extremely low latency and high data throughput to provide the highest AI and HPC application performance and scalability.

System Architecture

...

The system also includes a Capacity Tier and a Fast Tier storage, based on DDN Exascaler.

The Operating System is RedHat Enterprise Linux 8.6.

System Architecture

Login nodes: 4 nodes, icelake no-gpu




Booster

DCGP

Model

Atos BullSequana X2135 "Da Vinci"

...

single-node GPU blade

Atos BullSequana X2140 three-node CPU blade

Racks

116
22

Nodes

Login nodes: in β production 1 (16 later): login14 accessible via IP 131.175.43.130, icelake no-gpu

Image Removed

NVIDIA HDR 2×100 Gb/s cards
1x Nvidia HDR100 100 Gb/s card

Booster

Data Centric

Model

Atos BullSequana XH21355 "Da Vinci" blade

Atos BullSequana X2610 compute blade

Racks

150

Nodes

3456

1536

Processors

single socket 32 cores Intel Ice Lake Lake CPU

Intel(R) Xeon(R) 1 x Intel Xeon Platinum 8358 CPU @ , 2.60GHz60GHz  TDP 250W

dual socket 56 cores sockets Intel Sapphire Rapids CPU

2 x Intel Xeon Platinum 8480p, 2.00 GHz TDP 350W

Accelerators

4 x NVIDIA Ampere GPUs/node, 64GB HBM2 HBM2e NVLink 3.0 (200GB/s) 

-

Cores

32 cores/node

112 cores/node

RAM

512 (8x64) GB DDR4 3200 MHz

512 (16 x 32) GB DDR5 4800 MHz

Peak Performance

about 309 Pflop/s

9 Pflops/s


Internal Network

DragonFly+ 200 Gbps (NVIDIA Mellanox HDR DragonFly++ 200Gb/sInfiniband HDR) 

2 x

Disk Space

106PB Large capacity storage
5.4 PB of High performance storage

The following guide refers already to the production configuration. The pre-production phase will begin in the next few days, with a mandatory access via 2FA. Please refer to the Access section below in the Leonardo User Guide.

Peak performance details

dual port HDR100 per node

 single port HDR100 per node

Storage
(raw capacity)

137.6 PB based on DDN ES7990X and Hard Drive Disks (Capacity Tier)
5.7 PB based on DDN ES400NVX2 and Solid State Drives (Fast Tier) 






Image Added


Peak performance details

Node Performance

Theoretical
Peak
Performance

CPU (nominal/peak freq.)1680 Gflops
GPU75000 Gflops
Total76680 GFlops
Memory Bandwidth (nominal/peak freq.)24.4 GB/s

Access

All the login nodes have an identical environment and can be reached with SSH (Secure Shell) protocol using the "collective" hostname:

>$ login.leonardo.cineca.it

The mandatory access to Leonardo is the two-factor authentication (2FA). Please refer to this link of the User Guide to activate and connect via 2FA. For information about data transfer from other computers please follow the instructions and caveats on the dedicated section Data storage or the document  Data Management.

Accounting

The accounting is still unavailable in this pre-production phase and will soon be implemented.

For accounting information please consult our dedicated section.

The account_no (or project) is important for batch executions. You need to indicate an account_no to be accounted for in the scheduler, using the flag "-A"

#SBATCH -A <account_no>

With the "saldo -b" command you can list all the account_no associated with your username. 

which establishes a connection to one of the available login nodes.  To connect to LEONARDO you can also indicate explicitly  the login nodes:

$ login01-ext.leonardo.cineca.it
$ login02-ext.leonardo.cineca.it
$ login05-ext.leonardo.cineca.it
$ login07-ext.leonardo.cineca.it

The mandatory access to LEONARDO is the two-factor authentication (2FA). Please refer to this link of the User Guide to activate and connect via 2FA. For information about data transfer from other computers please follow the instructions and caveats on the dedicated section Data storage or the document  Data Management.

Accounting

The accounting (consumed budget) is active from the start of the production phase. For accounting information please consult our dedicated section.

The account_name (or project) is important for batch executions. You need to indicate an account_name to be accounted for in the scheduler, using the flag "-A"

#SBATCH -A <account_name>

With the "saldo -b" command you can list all the account_name associated with your username. 

$ saldo -b          > saldo -b   (reports projects defined on LEONARDO Booster)
$ saldo --dcgp -b (reports projects defined on LEONARDO DCGP)

Please Please note that the accounting is in terms of consumed core hours, but it strongly depends also on the requested memory and local storage, and number of GPUs, please refer to the dedicated section.

Budget Linearization policy

On LEONARDO, as on the other HPC clusters in CinecaCINECA,  a a linearization policy for the usage of project budgets has been defined and implemented.  The The goal is to improve the response time, giving users the opportunity of using the cpu hours assigned to their project in relation to their actual size (total amount of core-hours).


Disks and Filesystems

The storage organization conforms to the CINECA infrastructure (see Section Data Storage and Filesystems). 

In addition to the home directory directory $HOME,  for for each user is defined a scratch area area $SCRATCH (or $CINECA_SCRATCH), a large disk for the storage of run time data and files. A $WORK area is defined for each active project on the system, reserved for
An new user specific area $PUBLIC is defined on LEONARDO, useful for example to share installations with other users (it is indeed the default directory for SPACK sub-directories, see more details in the dedicated page).
$WORK area is defined for each active project on the system, reserved to all the collaborators of the project. In this pre-production phase the $WORK area is not yet available. Until the $WORK areas will be configured and put in place the automatic cleaning of the scratch area will NOT be active.A corresponding $FAST area is defined for each active project on the scratch filesystem, on its subset of "fast" NVMe SSD flash drives. As for $WORK, the $FAST area is reserved to all the collaborators of the project. An extension of the default $WORK quota (1 TB) can be granted if justified and essential for the course of the project's activity, while the use of the $FAST is limited to 1 TB of space per project.  


Total Dimension (TB)

Quota (GB)

Total Dimension (TB)

Quota (GB)

Notes

$HOME0.46 PiB70GB 50GB per user
  • permanent
  • /backed up,
  • user specific, local
$CINECA_SCRATCH41.4 40 PiBno quota
  • HDD storage
  • temporary
  • user specific
  • /leonardo_scratch/fast     (confinata sugli OST flash))
  • large (confinata sugli OST HDD)
  • temporary, user specific, local
  • no backup
  • automatic cleaning procedure of data older than 40 days (time interval can be reduced in case of critical usage ratio of the area. In this case, users will be notified via HPC-News).
$PUBLIC0.46 PiB50GB per user
  • permanent
  • user specific
  • no backup
$WORK30 PB

1TB per project

  • permanent
  • project specific
$WORKnot yet available (10PB)
  • permanent, project specific, local
  • no backup
  • extensions can be considered if needed (mailto: superc@cineca.it)
$FAST3.5PB1TB per project
  • permanent
  • project specific
  • no backup
  • The work filesystem: the $WORK areas are not available yet. Until they will be configured and put in place the automatic cleaning of the scratch area will NOT be activeis NOT active yet, but it will soon be enforced.


It is also available a temporary storage area local to nodes on login and compute nodes (on the latter it is generated when the job starts and removed when it ends) and accessible via environment variable $TMPDIR. For more details please see the dedicated section of UG2.5: Data storage and FileSystems. On LEONARDO the $TMPDIR local area has 1 TB of available space.

Since all the filesystems are based on Lustre, the usual unix command "quota" is not working. Use the local command cindata to query for disk usage and quota ("cindata -h" for help) that will be available soon.

> cindata

Software environment

Module environment

The software modules are collected in different profiles and organized by functional categories (compilers, libraries, tools, applications,...). The profiles are of two types: “programming” type (base and advanced) for compilation, debugging and profiling activities, and  “domain” type (chem-phys, lifesc,..) for the production activity. They can be loaded together.

"Base" profile is the default. It is automatically loaded after login and it contains basic modules for the programming activities (ibm, gnu, pgi, cuda compilers, math libraries, profiling and debugging tools,..).

If you want to use a module placed under other profiles, for example an application module, you will have to load preventively the corresponding profile:

>module load profile/<profile name>
>module load <module name>

Almost all the softwares on Leonardo were installed with Spack manager, which loads automatically the possible dependencies, so autoload comand is unnecessary. In other cases could be useful to load dependencies with

> module load autoload <module name>

...

>module list

In order to detect all profiles, categories and modules available on LEONARDO, the command “modmap” will be soon available as for the other clusters. With modmap you can see if the desired module is available and which profile you have to load to use it.

>modmap -m <module_name> 

 Spack environment

In case you don't find a software you are interested in, you can install it by yourself. 
In this case, on Leonardo  we also offer the possibility to use the “spack” environment by loading the corresponding module. Please refer to the dedicated section in UG2.6: Production Environment

Please note that we are still optimizing Leonardo software stack, and more installations may be added/replaced. Always check with "module av" (the hash in the module name can change).

GPU and intra/inter connection environment

It will be described soon.

Production environment

Since LEONARDO is a general purpose syste and is used by several users at the same time, long production jobs must be submitted using a queuing system (scheduler). The scheduler guarantees that the access to the resources is as fair as possible

The production environment on LEONARDO, is based on the slurm scheduler, already in place on the cluster but still not complete and in a pre-production configuration.

Leonardo is based on a policy of node sharing among different jobs, i.e. a job can ask for resources and these can also be a part of a node, for example few cores and 1GPU. This means that, at a given time, one physical node can be allocated to multiple jobs of different users. Nevertheless, exclusivity at the level of the single core is guaranteed by low-level mechanisms.

Roughly speaking, there are two different modes to use an HPC system: Interactive and Batch. For a general discussion see the section Production Environment.

Interactive

A serial program can be executed in the standard UNIX way:

> ./program

This is allowed only for very short runs on the login nodes. Soon we will impose 10 minutes cpu-time limit  for the interactive processes.  Please do not execute parallel applications on the login nodes! 

Batch

As usual on HPC systems, the large production runs are executed in batch mode. This means that the user writes a list of commands into a file (for example script.x) and then submits it to a scheduler (SLURM for Leonardo) that will search for the required resources in the system. As soon as the resources are available script.x is executed and the results and sent back to the user.

This is an example of script file:

...

  • Please refer to the general online guide to slurm and on task/thread bindings, and please pay attention to the setting of the SRUN_CPUS_PER_TASK for hybrid applications dispatched with "srun". 

You can write your script file (for example script.x) using any editor, then you submit it using the command:

> sbatch script.x

The script file must contain both directives to SLURM and commands to be executed, as better described in the section  Batch Scheduler SLURM. 

Using SLURM directives you indicate the account_number (-A: which project pays for this work), where to run the job (-p: partition), what is the maximum duration of the run (--time: time limit). Moreover you indicate the resources needed, in terms of cores, GPUs (later) and memory. 

One of the commands will be probably the launch of a parallel MPI application. In this case the right command is srun, as an alternative to the usual mpirun command. In this way you will get full support for process tracking, accounting, task affinity, suspend/resume and other features.

Please note: the "mail" directives are not effective yet.

SLURM partitions

A list of partitions defined on the cluster, with access rights and resources definition, can be displayed with the command sinfo:

> sinfo -o "%10D %20F %P"

The command returns a more readable output which shows, for each partition, the total number of nodes and the number of nodes by state in the format "Allocated/Idle/Other/Total".

In the following table you can find the main features and limits imposed on the partitions of Leonardo.

...

SLURM

partition

...

max running jobs per user/

max n. of cores/nodes/GPUs per user

...

lrd_all_serial
(default)

not yet available

...

min = 33 nodes

max =256 nodes *

...

max = 3 nodes

...

  • *For the "boost_usr_prod" partition you can use at most 32 nodes (MaxTime=24:00:00). Please request the boost_qos_bprod QOS to go up to 512 nodes (MaxTime=10:00:00) This limit will be in place until May 25, when it will be reduced to 256 nodes with MaxTime=24:00:00 (production environment) before May 25.
  • For EUROFusion users and their dedicated queues please refer to the dedicated document.

Graphic session

It will be available soon. 

Programming environment

Leonardo compute nodes host four A100  GPUs per node (CUDA compute capability 8.0). The most recent versions of nVIDIA CUDA toolkit and of the nVIDIA nvhpc compilers (ex PGI, supporting CUDA Fortran) is available in the module environment.

Compilers

You can check the complete list of available compilers on MARCONI with the command:

> module available

and checking the "compilers" section. The available compilers are:

  • Gnu Compilers Collection (GCC)
  • NVIDIA nvhpc (ex PGI)
  • CUDA

NVIDIA nvhpc (ex PORTLAND PGI + NVIDIA CUDA)

As of August 5, 2020, the "PGI Compilers and Tools" technology is a part of the NVIDIA HPC SDK product, available as a free download from NVIDIA.

...

This area is:

  • on the local SSD disks on login nodes (14 TB of capacity), mounted as /scratch_local (TMPDIR=/scratch_local). This is a shared area with no quota, remove all the files once they are not requested anymore. A cleaning procedure will be enforced in case of improper use of the area.   
  • on the local SSD disks on the serial node (lrd_all_serial, 14TB of capacity), managed via the slurm job_container/tmpfs plugin. This plugin provides a job-specific, private temporary file system space, with private instances of /tmp and /dev/shm in the job's user space (TMPDIR=/tmp, visible via the command "df -h"), removed at the end of the serial job. You can request the resource via sbatch directive or srun option "--gres=tmpfs:XX" (for instance: --gres=tmpfs:200GB), with a maximum of 1 TB for the serial jobs. If not explicitly requested, the /tmp has the default dimension of 10 GB.
  • on the local SSD disks on DCGP nodes (3 TB  of capacity). As for the serial node, the local /tmp and /dev/shm areas are managed via plugin, which at the start of the jobs mounts private instances of /tmp and /dev/shm in the job's user space (TMPDIR=/tmp, visible via the command "df -h /tmp"), and unmounts them at the end of the job (all data will be lost). You can request the resource via sbatch directive or srun option "--gres=tmpfs:XX", with a maximum of all the available 3 TB for DCGP nodes. As for the serial node, if not explicitly requested, the /tmp has the default dimension of 10 GB. Please note: for the DCGP jobs the requested amount of gres/tmpfs resource contributes to the consumed budget, changing the number of accounted equivalent core hours, see the dedicated section on the Accounting
  • on RAM on the diskless booster nodes (with a fixed size of 10 GB, no increase is allowed, and the gres/tmpfs resource is disabled).

For a general discussion on the TMPDIR area, please see the dedicated section of Data storage and FileSystems.

Since all the filesystems are based on Lustre, the usual unix command "quota" is not working. Use the local command cindata to query for disk usage and quota ("cindata -h" for help):

$ cindata

or the tool "cinQuota" available in the module cintools

$ cinQuota

For more details about both these commands, please consult the section dedicated to how to monitor the occupancy.

Software environment

Module environment

The software modules are collected in different profiles and organized by functional categories (compilers, libraries, tools, applications, ...). The profiles are of two types: “programming” type (base and advanced) for compilation, debugging and profiling activities, and  “domain” type (chem-phys, lifesc, ...) for the production activity. They can be loaded together.

"Base" profile is the default. It is automatically loaded after login and it contains basic modules for the programming activities (ibm, gnu, pgi, cuda compilers, math libraries, profiling and debugging tools, ...).

If you want to use a module placed under other profiles, for example an application module, you will have to previously load the corresponding profile:

$ module load profile/<profile name>
$ module load <module name>

Almost all the softwares on LEONARDO were installed with Spack manager, which loads automatically the possible dependencies, so "autoload" command is unnecessary.

For listing all profiles you have loaded you can use the following command:

$ module list

In order to detect all profiles, categories and modules available on LEONARDO, the command “modmap” is available as for the other clusters. With modmap you can see if the desired module is available and which profile you have to load to use it.

$ modmap -m <module_name>

Note: on LEONARDO you can find modules compiled to support GPUs and modules suitable only for CPUs. You can check the compiler in the full name of the module, where the version is specified (e.g. gromacs/2022.3--intel-oneapi-mpi--2021.10.0--oneapi–2023.2.0). Remind that modules compiled with gcc, nvhpc, cuda should be used only on the Booster partition, while modules compiled with intel oneapi are suitable for running on the DGCP partition. Please refer to the specific sections of the two partitons for more details on the available compilers: Booster Programming environment and DCGP Programming environment.

 Spack environment

In case you don't find a software you are interested in, you can install it by yourself. 
In this case, on LEONARDO  we offer the possibility to use the “spack” environment by loading the corresponding module. Please refer to the dedicated section in UG2.6: Production Environment

Please note that we are still optimizing LEONARDO software stack, and more installations may be added/replaced. Always check with "module av" (the hash in the module name can change).

Remind that, on LEONARDO (at variance with other CINECA clusters), the default area where Spack directories are created (/cache, /install, /modules, /user_cache) is the $PUBLIC one (described in section Disks and Filesystems).

Graphic session

It will be available soon. 



You can proceed with the sections related to Production environment and Programming environment in the specific pages for the two partitions:

For legacy reasons, the nVIDIA nvhpc suite also offers the PGI C, C++, and Fortran compilers with their original names:

...

To enable CUDA C++ or CUDA Fortran, and link with the CUDA runtime libraries, use the -cuda option (-Mcuda is deprecated). Use the -gpu option to tailor the compilation of target accelerator regions.

The OpenACC parallelization is enabled by the -acc flag. GPU targeting and code generation can be controlled by adding the -⁠gpu flag to the compiler command line. 

The OpenMP parallelization is enabled by the -mp compiler option. The GPU offload via OpenMP is enabled by the -mp=gpu option.

GNU compiler collection

The gnu compilers are always available. GCC version 8.5.0 is available without the need to load any gcc module. In the module environment you can find more recent version though.

The name of the GNU compilers are:

  • g77: Fortran77 compiler
  • gfortran: Fortran95 compiler
  • gcc: C compiler
  • g++: C++ compiler

The documentation can be obtained with the man command after loading the gnu module:

> man gfortan
> man gcc

CUDA

Compute Unified Device Architecture is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers are able to dramatically speed up computing applications by harnessing the power of GPUs. 

In GPU-accelerated applications, the sequential part of the workload runs on the CPU – which is optimized for single-threaded performance – while the compute intensive portion of the application runs on thousands of GPU cores in parallel. When using CUDA, developers program in popular languages such as C, C++, Fortran, Python and MATLAB and express parallelism through extensions in the form of a few basic keywords. We refer to the NVIDIA CUDA Parallel Computing Platform documentation.

Debugger and Profilers

If at runtime your code dies, then there is a problem. In order to solve it, you can decide to analyze the core file (core not available with PGI compilers) or to run your code using the debugger.

Compiler flags

Whatever your decision, in any case, you need to enable compiler runtime checks, by putting specific flags during the compilation phase. In the following we describe those flags for the different Fortran compilers: if you are using the C or C++ compiler, please check before because the flags may differ.

The following flags are generally available for all compilers and are mandatory for an easier debugging session:

-O0     Lower level of optimization
-g      Produce debugging information

Other flags are compiler specific and are described in the following:
PORTLAND Group (PGI) Compilers

The following flags are useful (in addition to "-O0 -g") for debugging your code:

-C                     Add array bounds checking
-Ktrap=ovf,divz,inv    Controls the behavior of the processor when exceptions occur: 
                       FP overflow, divide by zero, invalid operands
GNU Fortran compilers

The following flags are useful (in addition to "-O0 -g")for debugging your code:

-Wall             Enables warnings pertaining to usage that should be avoided
-fbounds-check    Checks for array subscripts.

Debuggers available

GNU: gdb (serial debugger)

GDB is the GNU Project debugger and allows you to see what is going on 'inside' your program while it executes -- or what the program was doing at the moment it crashed.

VALGRIND

Valgrind is a framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile your programs in detail. The Valgrind distribution currently includes six production-quality tools: a memory error detector, two thread error detectors, a cache and branch-prediction profiler, a call-graph generating cache profiler, and a heap profiler.

Valgrind is Open Source / Free Software, and is freely available under the GNU General Public License, version 2.

Profilers

In software engineering, profiling is the investigation of a program's behavior using information gathered as the program executes. The usual purpose of this analysis is to determine which sections of a program to optimize - to increase its overall speed, decrease its memory requirement or sometimes both.

A (code) profiler is a performance analysis tool that, most commonly, measures only the frequency and duration of function calls, but there are other specific types of profilers (e.g. memory profilers) in addition to more comprehensive profilers, capable of gathering extensive performance data.

gprof

The GNU profiler gprof is a useful tool for measuring the performance of a program. It records the number of calls to each function and the amount of time spent there, on a per-function basis. Functions which consume a large fraction of the run-time can be identified easily from the output of gprof. Efforts to speed up a program should concentrate first on those functions which dominate the total run-time.

gprof uses data collected by the -pg compiler flag to construct a text display of the functions within your application (call tree and CPU time spent in every subroutine). It also provides quick access to the profiled data, which let you identify the functions that are the most CPU-intensive. The text display also lets you manipulate the display in order to focus on the application's critical areas.

Usage:

> gfortran -pg -O3 -o myexec myprog.f90
> ./myexec
> ls -ltr
   .......
   -rw-r--r-- 1 aer0 cineca-staff    506 Apr  6 15:33 gmon.out
> gprof myexec gmon.out

It is also possible to profile at code line-level (see "man gprof" for other options). In this case, you must use also the “-g” flag at compilation time:

> gfortran -pg -g -O3 -o myexec myprog.f90
> ./myexec
> ls -ltr
   .......
   -rw-r--r-- 1 aer0 cineca-staff    506 Apr  6 15:33 gmon.out
> gprof -annotated-source myexec gmon.out

It is possible to profile MPI programs. In this case, the environment variable GMON_OUT_PREFIX must be defined in order to allow to each task to write a different statistical file. Setting

export GMON_OUT_PREFIX=<name>

 once the run is finished each task will create a file with its process ID (PID) extension

<name>.$PID

 If the environmental variable is not set every task will write the same gmon.out file.

Nvidia Nsight System (GPU profiler)

Nvidia Nsight System is a system-wide performance analysis tool designed to visualize an application’s algorithms, help you identify the largest opportunities to optimize, and tune to scale efficiently across any quantity or size of CPUs and GPUs; from large server to our smallest SoC.
You can find general info on how to use it in the dedicated Nvidia User Guide pages.

Our suggestion is to run the CLI inside your job script in order to generate the qdrep files. Then you can download the qdrep files on your local PC and visualize them with the Nsight System GUI available on your workstation.

The profiler is available under the module nvhpc.

Standard usage of an MPI job running on GPU is

> mpirun <options> nsys profile -o ${PWD}/output_%q{OMPI_COMM_WORLD_RANK} -f true --stats=true --cuda-memory-usage=true <your_code> <input> <output>

On the single node you can also run the profiler as "nsys profile mpirun", but keep in mind that with this syntax nsys will put everything in a single report.

Unfortunately nsys usually generates several files in /tmp dir of the compute node even if a TMPDIR environment variable is set. These files may be big causing the filling of the /tmp folder and, as a consequence, the crash of the compute node and the failure of the job.
In order to avoid such a problem we strongly suggest to include in your sbatch script the following lines around your mpirun call as a workaround:

> rm -rf /tmp/nvidia
> ln -s $TMPDIR /tmp/nvidia
> mpirun ... nsys profile ...
> rm -rf /tmp/nvidia

This will place the temporary outputs of the nsys code in your TMPDIR folder that by default is /dev/shm/slurm_job.$SLURM_JOB_ID where you have about 250 GB of free space.
This workaround may cause conflicts between multiple jobs running this profiler on a compute node at the same time, so we strongly suggest also to request the compute node exclusively:

#SBATCH --exclusive

MPI environment

We offer two options for MPI environment on LEONARDO:

  • Open MPI
  • Intel-OneAPI-MPI

Here you can find some useful details on how to use them on LEONARDO.

OpenMPI

This most common MPI implementation is installed inside the GNU environment.
It is configured to support both CUDA-aware and GPUDirect.

To install MPI applications using Open MPI you have to load openmpi module (use "modmap -m openmpi" command to see the available Open MPI versions) and select the MPI compiler wrapper for Fortran, C or C++ codes.

The openmpi module provides the following wrappers:

...

Compiler

...

Wrapper

...

Usage

...

mpic++
mpiCC
mpicxx

...

mpif77
mpif90
mpifort

...

e.g. Compiling C code

> module load openmpi/<version>
> mpicc -o myexec  myprog.c (uses the gcc compiler)

You can add all options available for the backend compiler (you can show it  by "-show" flag, e.g. "mpicc -show").  In order to list them type the "man" command:

> man mpicc
Intel-OneAPI-MPI

This is the MPI implementation of Intel and doesn't support CUDA.

To install MPI applications using Intel MPI you have to load intel-oneapi-mpi module (use "modmap  -m intel-oneapi-mpi command to see the available versions).

The intel-oneapi-mpi module provides the following wrappers for classic intel compilers and oneapi ("x") compilers:

...

Compiler

...

Wrapper

...

Usage

icpc

icpx

mpiicpc

mpiicpc -cxx=icpx

...

Compile C++ source files with classic Intel

Compile C++ source files with oneapi 

icc

icx

mpiicc

mpiicc -cc=iccx

...

Compile C source files with classic Intel

Compile C source files with oneapi 

ifort

ifx 

mpiifort (Fortran90/77)

mpiifort -fc=ifx

...

Compile FORTRAN source files with classic Intel

Compile FORTRAN source files with oneapi

e.g. Compiling Fortran code

> module load intel-oneapi-mpi/<version>
> mpiifort -o myexec  myprog.f90 (uses the ifort compiler)

You can add all options available for the backend compiler (you can show it  by "-show" flag, e.g. "mpicc -show").  In order to list them type the "man" command

> man mpiifort

Scientific libraries

Linear Algebra

GPU accelerated

The nvidia math libraries are available by loadind "nvhpc" module (use "modmap -m nvhpc" command to see the available versions of nvhpc).

For not nvidia math libraries installed with cuda support they are available by loading the corresponding module e.g "module load magma/<vers>".  Notice that when you load the module of any of these libraries the CUDA module is not automatically loaded).

  • BLAS: nvidia cublas, magma 
  • LAPACK: nvidia cusolver, magma 
  • SCALAPACK:  slate 
  • EIGENVALUE SOLVERS: nvidia cusolver, magma (single-node), slate, elpa and slepC (multi-node) 
  • SPARCE MATRICES and : nvidia cuSPARSE, PetSc (multi-node), SuperLU-dist (multi-node)
  • Hypre (multi-node)
 CUDA not supported 
  • BLAS: openblas,  intel-oneapi-mkl
  • LAPACK: openblas, intel-oneapi-mkl
  • SCALAPACK:  netlib-scalapack, intel-oneapi-mkl

Fast Fourier Transform

GPU accelerated

The nvidia math libraries are available by loadind "nvhpc" module (use "modmap -m nvhpc" command to see the available versions of nvhpc). 

  • nvidia cuFFT/cuFFTW (single-node)

CUDA not supported
  • FFTW (single and multi-node)