Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This system is the new pre-exascale Tier-0 EuroHPC supercomputer hosted by CINECA and currently built in the Bologna Technopole, Italy. It is supplied by ATOS, based on a BullSequana XH2135 supercomputer nodes, each with four NVIDIA Tensor Core GPUs and a single Intel CPU. It also uses NVIDIA Mellanox HDR 200Gb/s InfiniBand connectivity, with smart in-network computing acceleration engines that enable extremely low latency and high data throughput to provide the highest AI and HPC application performance and scalability.

System Architecture

Architecture: Atos BullSequana XH21355 "Da Vinci" blade - Booster - Atos BullSequana X2610 compute blade - Data-centric (will be available in the last quarter of the 2023)
Internal Network:
Nvidia Mellanox HDR DragonFly+ 200 Gb/s
Storage: 106 PB (raw) Large capacity storage, 620 GB/s
                   High Performance Storage 5.4 PB, 1.4 TB/s Based on 31 x DDN Exascaler ES400NVX2

...

The following guide refers already to the production configuration. The pre-production phase will begin in the next few days, with a mandatory access via 2FA. Please refer to the Access section below in the Leonardo User Guide.

Peak performance details

Node Performance

Theoretical
Peak
Performance

CPU (nominal/peak freq.)1680 Gflops
GPU75000 Gflops
Total76680 GFlops
Memory Bandwidth (nominal/peak freq.)24.4 GB/s

Access

IMPORTANT: Leonardo is still not in production. The hostname indicated below is disabled until official communication via HPC News.

...

The mandatory access to Leonardo is the two-factor authentication (2FA). Please refer to this link of the User Guide to activate and connect via 2FA. For information about data transfer from other computers please follow the instructions and caveats on the dedicated section Data storage or the document  Data Management.

Accounting

The accounting is still unavailable in this pre-production phase and will soon be implemented.

...

Please note that the accounting is in terms of consumed core hours, but it strongly depends also on the requested memory and number of GPUs, please refer to the dedicated section.

Budget Linearization policy

On LEONARDO, as on the other HPC clusters in Cineca, a linearization policy for the usage of project budgets has been defined and implemented. The goal is to improve the response time, giving users the opportunity of using the cpu hours assigned to their project in relation to their actual size (total amount of core-hours).


Disks and Filesystems

The storage organization conforms to the CINECA infrastructure (see Section Data Storage and Filesystems). 

...

Since all the filesystems are based on Lustre, the usual unix command "quota" is not working. Use the local command cindata to query for disk usage and quota ("cindata -h" for help) that will be available soon.

> cindata


Software environment

Module environment

The software modules are collected in different profiles and organized by functional categories (compilers, libraries, tools, applications,...). The profiles are of two types: “programming” type (base and advanced) for compilation, debugging and profiling activities, and  “domain” type (chem-phys, lifesc,..) for the production activity. They can be loaded together.

...

In order to detect all profiles, categories and modules available on LEONARDO, the command “modmap” will be soon available as for the other clusters. With modmap you can see if the desired module is available and which profile you have to load to use it.

>modmap -m <module_name> 

 Spack environment

In case you don't find a software you are interested in, you can install it by yourself. 
In this case, on Leonardo  we also offer the possibility to use the “spack” environment by loading the corresponding module. Please refer to the dedicated section in UG2.6: Production Environment

Please note that we are still optimizing Leonardo software stack, and more installations may be added/replaced. Always check with "module av" (the hash in the module name can change).

GPU and intra/inter connection environment

It will be described soon.

Production environment

Since LEONARDO is a general purpose syste and is used by several users at the same time, long production jobs must be submitted using a queuing system (scheduler). The scheduler guarantees that the access to the resources is as fair as possible

...

Roughly speaking, there are two different modes to use an HPC system: Interactive and Batch. For a general discussion see the section Production Environment.

Interactive

A serial program can be executed in the standard UNIX way:

...

This is allowed only for very short runs on the login nodes. Soon we will impose 10 minutes cpu-time limit  for the interactive processes.  Please do not execute parallel applications on the login nodes! 

Batch

As usual on HPC systems, the large production runs are executed in batch mode. This means that the user writes a list of commands into a file (for example script.x) and then submits it to a scheduler (SLURM for Leonardo) that will search for the required resources in the system. As soon as the resources are available script.x is executed and the results and sent back to the user.

...

Please note: the "mail" directives are not effective yet.

SLURM partitions

A list of partitions defined on the cluster, with access rights and resources definition, can be displayed with the command sinfo:

...

  • *For the "boost_usr_prod" partition you can use at most 32 nodes (MaxTime=24:00:00). Please request the boost_qos_bprod QOS to go up to 512 nodes (MaxTime=10:00:00) This limit will be in place until May 25, when it will be reduced to 256 nodes with MaxTime=24:00:00 (production environment) before May 25.
  • For EUROFusion users and their dedicated queues please refer to the dedicated document.

Graphic session

It will be available soon. 

Programming environment

Leonardo compute nodes host four A100  GPUs per node (CUDA compute capability 8.0). The most recent versions of nVIDIA CUDA toolkit and of the nVIDIA nvhpc compilers (ex PGI, supporting CUDA Fortran) is available in the module environment.

Compilers

You can check the complete list of available compilers on MARCONI with the command:

...

In GPU-accelerated applications, the sequential part of the workload runs on the CPU – which is optimized for single-threaded performance – while the compute intensive portion of the application runs on thousands of GPU cores in parallel. When using CUDA, developers program in popular languages such as C, C++, Fortran, Python and MATLAB and express parallelism through extensions in the form of a few basic keywords. We refer to the NVIDIA CUDA Parallel Computing Platform documentation.

Debugger and Profilers

If at runtime your code dies, then there is a problem. In order to solve it, you can decide to analyze the core file (core not available with PGI compilers) or to run your code using the debugger.

Compiler flags

Whatever your decision, in any case, you need to enable compiler runtime checks, by putting specific flags during the compilation phase. In the following we describe those flags for the different Fortran compilers: if you are using the C or C++ compiler, please check before because the flags may differ.

...

-O0     Lower level of optimization
-g      Produce debugging information

Other flags are compiler specific and are described in the following:
PORTLAND Group (PGI) Compilers

The following flags are useful (in addition to "-O0 -g") for debugging your code:

-C                     Add array bounds checking
-Ktrap=ovf,divz,inv    Controls the behavior of the processor when exceptions occur: 
                       FP overflow, divide by zero, invalid operands
GNU Fortran compilers

The following flags are useful (in addition to "-O0 -g")for debugging your code:

-Wall             Enables warnings pertaining to usage that should be avoided
-fbounds-check    Checks for array subscripts.

Debuggers available

GNU: gdb (serial debugger)

GDB is the GNU Project debugger and allows you to see what is going on 'inside' your program while it executes -- or what the program was doing at the moment it crashed.

VALGRIND

Valgrind is a framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile your programs in detail. The Valgrind distribution currently includes six production-quality tools: a memory error detector, two thread error detectors, a cache and branch-prediction profiler, a call-graph generating cache profiler, and a heap profiler.

Valgrind is Open Source / Free Software, and is freely available under the GNU General Public License, version 2.

Profilers

In software engineering, profiling is the investigation of a program's behavior using information gathered as the program executes. The usual purpose of this analysis is to determine which sections of a program to optimize - to increase its overall speed, decrease its memory requirement or sometimes both.

A (code) profiler is a performance analysis tool that, most commonly, measures only the frequency and duration of function calls, but there are other specific types of profilers (e.g. memory profilers) in addition to more comprehensive profilers, capable of gathering extensive performance data.

gprof

The GNU profiler gprof is a useful tool for measuring the performance of a program. It records the number of calls to each function and the amount of time spent there, on a per-function basis. Functions which consume a large fraction of the run-time can be identified easily from the output of gprof. Efforts to speed up a program should concentrate first on those functions which dominate the total run-time.

...

 If the environmental variable is not set every task will write the same gmon.out file.

Nvidia Nsight System (GPU profiler)

Nvidia Nsight System is a system-wide performance analysis tool designed to visualize an application’s algorithms, help you identify the largest opportunities to optimize, and tune to scale efficiently across any quantity or size of CPUs and GPUs; from large server to our smallest SoC.
You can find general info on how to use it in the dedicated Nvidia User Guide pages.

...

This will place the temporary outputs of the nsys code in your TMPDIR folder that by default is /dev/shm/slurm_job.$SLURM_JOB_ID where you have about 250 GB of free space.
This workaround may cause conflicts between multiple jobs running this profiler on a compute node at the same time, so we strongly suggest also to request the compute node exclusively:

#SBATCH --exclusive

MPI environment

We offer two options for MPI environment on LEONARDO:

...

Here you can find some useful details on how to use them on LEONARDO.

Compiling

OpenMPI

This most common MPI implementation is installed inside the GNU environment.
It is configured to support CUDA-aware.

...

You can add all options available for the backend compiler (you can show it  by "-show" flag, e.g. "mpicc -show").  In order to list them type the "man" command:

> man mpicc
Intel-OneAPI-MPI

This is the MPI implementation of Intel and doesn't support CUDA.

...

You can add all options available for the backend compiler (you can show it  by "-show" flag, e.g. "mpicc -show").  In order to list them type the "man" command

> man mpiifort

Running

To run MPI applications they are two way:

  • using mpirun launcher
  • using srun launcher 
mpirun launcher 

To use mpirun launcher  the openmpi or intel-oneapi-mpi module needs to be loaded:

...

> sbatch -N 2 my_batch_script.sh (allocate a job of 2 nodes) 
> cat my_batch_script.sh
#!/bin/sh
mpirun ./mpi_exec
srun launcher 

MPI applications can be launched directly with the slurm launcher srun

...

> sbatch -N 2 my_batch_script.sh (allocate a job of 2 nodes) 
> vi my_batch_script.sh
#!/bin/sh
srun -N 2  ./mpi_exec

Scientific libraries

Linear Algebra

GPU accelerated

The nvidia math libraries are available by loadind "nvhpc" module (use "modmap -m nvhpc" command to see the available versions of nvhpc).

...

  • BLAS: nvidia cublas, magma 
  • LAPACK: nvidia cusolver, magma 
  • SCALAPACK:  slate 
  • EIGENVALUE SOLVERS: nvidia cusolver, magma (single-node), slate, elpa and slepC (multi-node) 
  • SPARCE MATRICES and : nvidia cuSPARSE, PetSc (multi-node), SuperLU-dist (multi-node)
  • Hypre (multi-node)
 CUDA not supported 
  • BLAS: openblas,  intel-oneapi-mkl
  • LAPACK: openblas, intel-oneapi-mkl
  • SCALAPACK:  netlib-scalapack, intel-oneapi-mkl

Fast Fourier Transform

GPU accelerated

The nvidia math libraries are available by loadind "nvhpc" module (use "modmap -m nvhpc" command to see the available versions of nvhpc). 

  • nvidia cuFFT/cuFFTW (single-node)

CUDA not supported
  • FFTW (single and multi-node)

...