You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »


hostname:                 login.lrd.cineca.it

early availability:      March, 2023

start of production: April, 2023 (Booster) 

                                   last quarter 2023 (Data Centric)


This system is the new pre-exascale Tier-0 EuroHPC supercomputer hosted by CINECA and currently built in the Bologna Technopole, Italy. It is supplied by ATOS, based on a BullSequana XH2135 supercomputer nodes, each with four NVIDIA Tensor Core GPUs and a single Intel CPU. It also uses NVIDIA Mellanox HDR 200Gb/s InfiniBand connectivity, with smart in-network computing acceleration engines that enable extremely low latency and high data throughput to provide the highest AI and HPC application performance and scalability.

System Architecture

Architecture: Atos BullSequana XH21355 "Da Vinci" blade - Booster (in preproduction) - Atos BullSequana X2610 compute blade - Data-centric (will be available in the last quarter of the 2023)
Internal Network:
Nvidia Mellanox HDRDragonFly+ 200 Gb/s
Storage: 106 PB (raw) Large capacity storage, 620 GB/s
                   High Perormance Storage 5.4 PB, 1.4 TB/s Based on 31 x DDN Exascaler ES400NVX2

Login nodes: 16 ?



Booster

Data Centric

Model

Atos BullSequana XH21355 "Da Vinci" blade

Atos BullSequana X2610 compute blade

Racks

150

Nodes

3456

1536

Processors

32 cores Intel Ice Lake at xx(xx) GHz

56 cores Intel Sapphire Rapids

Accelerators

4 x NVIDIA Ampere GPUs/node, 64GB HBM2

2 x NVIDIA HDR 2×100 Gb/s cards

1x Nvidia HDR100 100 Gb/s card

Cores

32 cores/node

56 cores/node

RAM

512 (8x64) GB DDR4 3200 MHz

(16 x 32) GB DDR5 4800 MHz

Peak Performance

about 309 Pflop/s

9 Pflops/s

Internal Network

NVIDIA Mellanox HDR DragonFly++ 200Gb/s

Disk Space

106PB Large capacity storage

5.4 PB of High performance storage








Peak performance details


Node Performance

Theoretical
Peak
Performance

CPU (nominal/peak freq.)
GPU
Total
Memory Bandwidth (nominal/peak freq.)

Access

All the login nodes have an identical environment and can be reached with SSH (Secure Shell) protocol using the "collective" hostname:

> login.lrd.cineca.it

For information about data transfer from other computers please follow the instructions and caveats on the dedicated section Data storage or the document  Data Management.

Accounting

For accounting information please consult our dedicated section.

The account_no (or project) is important for batch executions. You need to indicate an account_no to be accounted for in the scheduler, using the flag "-A"

#SBATCH -A <account_no>

With the "saldo -b" command you can list all the account_no associated with your username. 

> saldo -b   (reports projects defined on LEONARDO )

Please note that the accounting is in terms of consumed core hours, but it strongly depends also on the requested memory and number of GPUs, please refer to the dedicated section.

Budget Linearization policy

On Leonardo, as on the other HPC clusters in Cineca, a linearization policy for the usage of project budgets has been defined and implemented. The goal is to improve the response time, giving users the opportunity of using the cpu hours assigned to their project in relation to their actual size (total amount of core-hours).

Disks and Filesystems

The storage organization conforms to the CINECA infrastructure (see Section Data Storage and Filesystems). 

In addition to the home directory $HOME, for each user is defined a scratch area $CINECA_SCRATCH, a large disk for the storage of run time data and files. 

$WORK area is defined for each active project on the system, reserved for all the collaborators of the project. This is a safe storage area to keep run time data for the whole life of the project.



Total Dimension (TB)

Quota (GB)

Notes

$HOME0.46 PiB70GB per user
  • permanent/backed up, user specific, local
$CINECA_SCRATCH41.4 PiBno quota
  • /leonardo_scratch/fast     (confinata sugli OST flash))
  • large (confinata sugli OST HDD)
  • temporary, user specific, local
  • no backup
  • automatic cleaning procedure of data older than 40 days (time interval can be reduced in case of critical usage ratio of the area. In this case, users will be notified via HPC-News)
$WORK10PB
  • permanent, project specific, local
  • no backup
  • extensions can be considered if needed (mailto: superc@cineca.it)

It is also available a temporary storage local on compute nodes generated when the job starts and accessible via environment variable $TMPDIR. For more details please see the dedicated section of UG2.5: Data storage and FileSystems. On Marconi100 the $TMPDIR local area has 1 TB of available space.

$DRES environment variable points to the shared repository where Data RESources are maintained. This is a data archive area available only on-request, shared with all CINECA HPC systems and among different projects. $DRES is not mounted on the compute nodes of the production partitions and can be accessed only from login nodes and from the nodes of the serial partition. This means that you cannot access it within a standard batch job: all data needed during the batch execution has to be moved to $WORK or $CINECA_SCRATCH before the run starts, either from the login nodes or via a job submitted to the serial partition. 

Since all the filesystems are based on Lustre? , the usual unix command "quota" is not working. Use the local command cindata to query for disk usage and quota ("cindata -h" for help) ???

> cindata


Modules environment

The software modules are collected in different profiles and organized by functional categories (compilers, libraries, tools, applications,..). The profiles are of two types: “programming” type (base and advanced) for compilation, debugging and profiling activities, and  “domain” type (chem-phys, lifesc,..) for the production activity. They can be loaded together.

"Base" profile is the default. It is automatically loaded after login and it contains basic modules for the programming activities (ibm, gnu, pgi, cuda compilers, math libraries, profiling and debugging tools,..).

If you want to use a module placed under other profiles, for example an application module, you will have to load preventively the corresponding profile:

>module load profile/<profile name>
>module load autoload <module name>

For listing all profiles you have loaded you can use the following command:

>module list

In order to detect all profiles, categories and modules available on LEONARDO the command “modmap” is available. With modmap you can see if the desired module is available and which profile you have to load to use it.

>modmap -m <module_name>

 Spack environment

In case you don't find a software you are interested in, you can install it by yourself. 
In this case, on Marconi100 we also offer the possibility to use the “spack” environment by loading the corresponding module. Please refer to the dedicated section in UG2.6: Production Environment

GPU and intra/inter connection environment


Production environment

Since Leonardo is a general purpose system and it is used by several users at the same time, long production jobs must be submitted using a queuing system (scheduler). This guarantees that the access to the resources is as fair as possible. On Leonardo the available scheduler is SLURM.

Leonardo is based on a policy of node sharing among different jobs, i.e. a job can ask for resources and these can also be a part of a node, for example few cores and 1 GPUs. This means that, at a given time, one physical node can be allocated to multiple jobs of different users. Nevertheless, exclusivity at the level of the single core and GPU is guaranteed by low-level mechanisms.

Roughly speaking, there are two different modes to use an HPC system: Interactive and Batch. For a general discussion see the section Production Environment.


Interactive

A serial program can be executed in the standard UNIX way:

> ./program

This is allowed only for very short runs on the login nodes, since the interactive environment has a 10 minutes cpu-time limit. Please do not execute parallel applications on the login nodes! 



Batch

As usual on HPC systems, the large production runs are executed in batch mode. This means that the user writes a list of commands into a file (for example script.x) and then submits it to a scheduler (SLURM for Leonardo) that will search for the required resources in the system. As soon as the resources are available script.x is executed and the results and sent back to the user.

This is an example of script file:

#!/bin/bash
#SBATCH -A <account_name>
#SBATCH -p xxx_usr_prod
#SBATCH --time 00:10:00     # format: HH:MM:SS
#SBATCH -N 1                # 1 node
#SBATCH --ntasks-per-node=8 # 8 tasks out of 128
#SBATCH --gres=gpu:1        # 1 gpus per node out of 4
#SBATCH --mem=7100          # memory per node out of 246000MB
#SBATCH --job-name=my_batch_job
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<user_email>
mpirun ./myexecutable #in case you compiled with spectrum-mpi
OR
srun ./myexecutable #in all the other cases


Please note that by requesting --ntasks-per-node=8 your job will be assigned 8 logical cpus (hence, the first 2 cpus with their 4 HTs). You can write your script file (for example script.x) using any editor, then you submit it using the command:

> sbatch script.x

The script file must contain both directives to SLURM and commands to be executed, as better described in the section  Batch Scheduler SLURM. 

Using SLURM directives you indicate the account_number (-A: which project pays for this work), where to run the job (-p: partition), what is the maximum duration of the run (--time: time limit). Moreover you indicate the resources needed, in terms of cores, GPUs and memory. 

One of the commands will be probably the launch of a parallel MPI application. In this case the right command is srun, as an alternative to the usual mpirun command. In this way you will get full support for process tracking, accounting, task affinity, suspend/resume and other features.


SLURM partitions


 A list of partitions defined on the cluster, with access rights and resources definition, can be displayed with the command sinfo:

> sinfo -o "%10D %20F %P"

The command returns a more readable output which shows, for each partition, the total number of nodes and the number of nodes by state in the format "Allocated/Idle/Other/Total".

In the following table you can find the main features and limits imposed on the partitions of M100.

Note: core refers to a physical cpu, with its 4 HTs; cpu refers to a logical cpu (1 HT). Each node has 32 cores/128 cpus.


SLURM

partition

Job QOS# cores/# GPU
per job
max walltime

max running jobs per user/

max n. of cores/nodes/GPUs per user

prioritynotes


lrd_all_serial
(default)

normalmax = 1 core, 1GPU04:00:004 cpus/1 GPU40
qos_installmax = 16 cores04:00:00max = 16 cores 
1 job per user
40request to superc@cineca.it


lrd_usr_prod


normalmax = 32 nodes24:00:00
40runs on all nodes
lrd_qos_dbgmax = 2 nodes02:00:002 nodes / 64 cores / 8 GPUs80runs on 24 nodes
lrd_qos_bprod

min = 33 nodes

max =256 nodes

24:00:00256 nodes60runs on 512 nodes
min is 33 FULL nodes
lrd_qos_lprod

max = 2 nodes

100:00:004 nodes40
lrd_usr_installqos_install + feature

1 node per type




internal use at start
all partitionsqos_special

> 32 nodes

> 24:00:00


40 

request to superc@cineca.it

qos_lowprio

max = 16 nodes

24:00:00


0

active projects with exhausted budget

request to superc@cineca.it
























Submitting serial batch jobs

 

Submitting batch jobs for production



Examples




Graphic session

It will be available soon. 

Programming environment

Compilers



Debugger and Profilers



  • No labels