Link to the new User Guide https://docs.hpc.cineca.it/index.html


Production Environment

Since LEONARDO is a general purpose system and is used by several users at the same time, long production jobs must be submitted using a queuing system (scheduler). The scheduler guarantees that the access to the resources is as fair as possible. The production environment on LEONARDO Data Centric General Purpose (DCGP) partition is based on the SLURM scheduler.

LEONARDO is based on a policy of node sharing among different jobs, i.e. a job can ask for resources and these can also be a part of a node, for example few cores. This means that, at a given time, one physical node can be allocated to multiple jobs of different users. Nevertheless, exclusivity at the level of the single core is guaranteed by low-level mechanisms.

There are two main modes of using compute nodes:

  • Batch Mode: This mode is intended for production runs. Users must prepare a shell script with all the operations to be executed once the requested resources are available. The job will then run on the compute nodes. Store all your data, programs, and scripts in the $WORK or $SCRATCH filesystems, as these are best for compute node access. You must have valid active projects to run batch jobs, and be aware of any specific policies regarding project budgets on our systems.

The script file must contain both directives to SLURM and commands to be executed, as better described in the section  Batch Scheduler SLURM. 

Using SLURM directives you indicate the account_name (-A: which project pays for this work), where to run the job (-p: partition), what is the maximum duration of the run (--time: time limit). Moreover you indicate the resources needed, in terms of cores and memory.

Please note that the recommended way to launch parallel MPI applications in SLURM jobs is with srun. By using srun instead of mpirun you will get full support for process tracking, accounting, task affinity, suspend/resume and other features.

 

sbatch script example
 #!/bin/bash
#SBATCH -A <account_name>
#SBATCH -p dcgp_usr_prod
#SBATCH --time 00:10:00     # format: HH:MM:SS
#SBATCH -N 1                # 1 node
#SBATCH --ntasks-per-node=4 # 4 tasks out of 112
#SBATCH --mem=123000          # memory per node out of 494000MB (481GB)
#SBATCH --job-name=my_batch_job

srun ./myexecutable  


To submit the sbatch script:

$ sbatch script.x
Please refer to the general online guide to SLURM and on task/thread bindings, and please pay attention to the setting of the SRUN_CPUS_PER_TASK for hybrid applications dispatched with "srun".
  • Interactive Mode: Jobs submitted in this mode are similar to batch mode in that the user must specify the resources to allocate. The job is then managed like any other submitted job. The key difference from batch mode is that once the job is running, the user can interactively execute applications within the limits of the allocated resources. All allocated resources are available for the entire requested walltime (and consequently billed) during the submission process.

Note:  interactive Mode under SLURM has a different meaning compared to the common understanding of interactive execution of an application under a Linux shell or prompt. Interactive execution of applications is allowed on compute nodes only via SLURM (see the next sections).

On login nodes, it is permitted to perform tasks such as data movement, archiving, code development, compilations, basic debugging, and very short test runs, provided these tasks do not exceed 10 minutes of CPU time and are free of charge under the current billing policy.

For a general discussion see the section Scheduler and Job Submission.

SLURM partitions

A list of partitions defined on the cluster, with access rights and resources definition, can be displayed with the command sinfo:

$ sinfo -o "%10D %20F %P"

The command returns a more readable output which shows, for each partition, the total number of nodes and the number of nodes by state in the format "Allocated/Idle/Other/Total".


In the following table you can find the main features and limits imposed on the partitions of LEONARDO Data Centric.

SLURM

partition

Job QOS# cores/ # GPU
per job
max walltime

max n. of nodes/cores/mem per user

max n. of nodes per account

priorityNotes

lrd_all_serial

(default)

normal

max = 4 physical cores
(8 logical cpus)

max mem = 30800 MB

04:00:001 node / 4 cores  / 30800 MB40No GPUs
Hyperthreading x2


dcgp_usr_prod


normalmax = 16 nodes24:00:00512 nodes per account40
dcgp_qos_dbgmax = 2 nodes00:30:00

2 nodes / 224 cores per user

512 nodes per account

80
dcgp_qos_bprod

min = 17 nodes

max =128 nodes

24:00:00

128 nodes per user

512 nodes per account

60

GrpTRES=1536 node

min is 17 FULL nodes

dcgp_qos_lprod

max = 3 nodes

4-00:00:00

3 nodes / 336 cores per user

512 nodes per account

40
Note: a maximum of 512 nodes per account is also imposed on the dcgp_usr_prod partition, meaning that, for each account, all the jobs associated with it cannot run on more than 512 nodes at the same time.

Programming environment

LEONARDO Data Centric compute nodes are not provided with GPUs, thus applications running on GPUs can be used only on the Booster partition. The programming environment include a list of compilers and of debugger and profiler tools, suitable for programming on CPUs.

Compilers

You can check the complete list of available compilers on LEONARDO with the command:

$ modmap -c compilers

The native, and recommended, compilers for LEONARDO Data Centric partition are the Intel ones, since the architecture is based on Intel processors and therefore using the Intel compilers may result in a significant improvement in performance and stability of your code. On the other side, since Intel compilers do not support CUDA, they are not recommended when working on GPUs with LEONARDO Booster partition.

For these reason, CUDA-aware compilers, such as GNU, NVIDIA nvhpc, and CUDA compilers, are suitable and recommended for LEONARDO Booster partition, and they are described in the dedicated page.

Intel OneAPI Compilers

Initialize the environment with the module command:

$ module load intel-oneapi-compilers/<VERSION>


The suite contains the new Intel oneAPI nextgen compilers (icx, icpx, ifx) and the classic compilers (icc, icpc, ifort):


ClassiconeAPINotes
C/C++ compilersicc/icpcicx/icpx
  • ICX is the Intel nextgen compiler based on Clang/LLVM technology  plus Intel proprietary optimizations and code generation.
  • ICX enables OpenMP TARGET offload to Intel GPU targets (irrelevant on Leonardo DCGP) .
  • ICX and ICC Classic use different compiler drivers. The ICC Classic drivers are icc, icpc, and icl. The ICX drivers are icx and icpx. Use icx to compile and link C programs, and icpx for C++ programs. 
  • Unlike the icc driver, icx does not use the file extension to determine whether to compile as C or C+. Users must invoke icpx to compile C+ files. In addition to providing a core C++ Compiler, ICX is the base compiler for the Intel oneAPI Data Parallel C++ Compiler and its new driver, dpcpp.
  • Intel still recommends ICC/ICPC for standard C/C++ applications.
Fortran compilersifortifx
  • The Intel Fortran Compiler (Beta) IFX is s a new compiler based on the Intel Fortran Compiler Classic (ifort) frontend and runtime libraries using LLVM backend technology.  ifx is released as a Beta version for users interested in trying offloading to supported Intel GPUs using OpenMP* TARGET directives which ifort does not support (irrelevant on Leonardo DCGP).

  • Intel recommends IFORT for standard Fortran applications.

Note

  • ICX is a new compiler. It has functional and behavioural differences compared to ICC. You can expect some porting will be needed for existing applications using ICC. According to Intel, the transition from ICC Classic to ICX is smooth and effortless. However, you must port and tune any existing applications from ICC Classic to ICX. Please refer to the official Intel Porting Guide for ICC Users to DPCPP or ICX
  • IFORT is a completely new compiler. According to Intel, although considerable effort is being made to make the transition from ifort to ifx as smooth and as effortless as possible, customers can expect that some effort may be required to tune their application. IFORT will remain Intel’s recommended production compiler until ifx has performance and features superior to ifort. Please refer to the official Intel Porting Guide for ifort Users to ifx
  • Please refer to the official Intel C++ Developer Guide and Reference and Fortran Developer Guide and Reference for an exhaustive list of compiler options

After loading the module, the documentation can be obtained with the man command:

$ man ifort
$ man icc

Debugger and Profilers

If at runtime your code dies, then there is a problem. In order to solve it, you can decide to analyze the core file (core not available with PGI compilers) or to run your code using the debugger.

Compiler flags

Whatever your decision, in any case, you need to enable compiler runtime checks, by putting specific flags during the compilation phase. In the following we describe those flags for the different Fortran compilers: if you are using the C or C++ compiler, please check before because the flags may differ.

The following flags are generally available for all compilers and are mandatory for an easier debugging session:

-O0     Lower level of optimization
-g Produce debugging information

Other flags are compiler specific and are described in the following.

PORTLAND Group (PGI) Compilers

The following flags are useful (in addition to "-O0 -g") for debugging your code:

-C                     Add array bounds checking
-Ktrap=ovf,divz,inv    Controls the behavior of the processor when exceptions occur: 
                       FP overflow, divide by zero, invalid operands

GNU Fortran compilers

The following flags are useful (in addition to "-O0 -g")for debugging your code:

-Wall             Enables warnings pertaining to usage that should be avoided
-fbounds-check    Checks for array subscripts.

Debuggers availables

GNU: gdb (serial debugger)

GDB is the GNU Project debugger and allows you to see what is going on 'inside' your program while it executes -- or what the program was doing at the moment it crashed.

VALGRIND

Valgrind is a framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile your programs in detail. The Valgrind distribution currently includes six production-quality tools: a memory error detector, two thread error detectors, a cache and branch-prediction profiler, a call-graph generating cache profiler, and a heap profiler.

Valgrind is Open Source / Free Software, and is freely available under the GNU General Public License, version 2.

Profilers

In software engineering, profiling is the investigation of a program's behavior using information gathered as the program executes. The usual purpose of this analysis is to determine which sections of a program to optimize - to increase its overall speed, decrease its memory requirement or sometimes both.

A (code) profiler is a performance analysis tool that, most commonly, measures only the frequency and duration of function calls, but there are other specific types of profilers (e.g. memory profilers) in addition to more comprehensive profilers, capable of gathering extensive performance data.

gprof

The GNU profiler gprof is a useful tool for measuring the performance of a program. It records the number of calls to each function and the amount of time spent there, on a per-function basis. Functions which consume a large fraction of the run-time can be identified easily from the output of gprof. Efforts to speed up a program should concentrate first on those functions which dominate the total run-time.

gprof uses data collected by the -pg compiler flag to construct a text display of the functions within your application (call tree and CPU time spent in every subroutine). It also provides quick access to the profiled data, which let you identify the functions that are the most CPU-intensive. The text display also lets you manipulate the display in order to focus on the application's critical areas.

Usage:

$ gfortran -pg -O3 -o myexec myprog.f90
$ ./myexec
$ ls -ltr
   .......
   -rw-r--r-- 1 aer0 cineca-staff    506 Apr  6 15:33 gmon.out
$ gprof myexec gmon.out


It is also possible to profile at code line-level (see "man gprof" for other options). In this case, you must use also the “-g” flag at compilation time:

$ gfortran -pg -g -O3 -o myexec myprog.f90
$ ./myexec
$ ls -ltr
   .......
   -rw-r--r-- 1 aer0 cineca-staff    506 Apr  6 15:33 gmon.out
$ gprof -annotated-source myexec gmon.out

It is possible to profile MPI programs. In this case, the environment variable GMON_OUT_PREFIX must be defined in order to allow to each task to write a different statistical file.

$ export GMON_OUT_PREFIX=<name>

 once the run is finished each task will create a file with its process ID (PID) extension:

$ <name>.$PID

 If the environmental variable is not set every task will write the same gmon.out file.

MPI environment

The MPI implementation of Intel, i.e. Intel-OneAPI-MPI, is recommended on the LEONARDO Data Centric partition, and it doesn't support CUDA. Here you can find some useful details on how to use it on this partition.

See the page dedicated to LEONARDO Booster partition for a description of OpenMPI, which instead is installed for supporting CUDA.

Compiling

Intel-OneAPI-MPI

To install MPI applications using IntelMPI you have to load intel-oneapi-mpi module (use "modmap  -m intel-oneapi-mpi" command to see the available versions).

The intel-oneapi-mpi module provides the following wrappers for classic Intel compilers and OneAPI ("x") compilers.

Compiler

Wrapper

Usage

icpc


icpx

mpiicpc


mpiicpc -cxx=icpx

Compile C++ source files with classic Intel


Compile C++ source files with oneapi 

icc


icx

mpiicc


mpiicc -cc=iccx

Compile C source files with classic Intel


Compile C source files with oneapi 

ifort


ifx 

mpiifort (Fortran90/77)


mpiifort -fc=ifx

Compile FORTRAN source files with classic Intel


Compile FORTRAN source files with oneapi

e.g. Compiling Fortran code:

$ module load intel-oneapi-compilers/<VERSION>
$ module load intel-oneapi-mpi/<version>
$ mpiifort -o myexec  myprog.f90 (uses the ifort compiler)


You can add all options available for the backend compiler (you can show it  by "-show" flag, e.g. "mpicc -show").  In order to list them type the "man" command:

$ man mpiifort

Running

To run MPI applications there are two ways:

  • using mpirun launcher
  • using srun launcher 
mpirun launcher 

To use mpirun launcher on LEONARDO Data Centric partition, the intel-oneapi-mpi module needs to be loaded:

$ module load intel-onepi-mpi/<VERSION>

After loading the module, MPI applications can be directly  launched as:

$ mpirun ./mpi_exec

or via salloc:

$ salloc -N 2 (allocate a job of 2 nodes)
$ mpirun ./mpi_exec

or via sbatch:

$ sbatch -N 2 my_batch_script.sh (allocate a job of 2 nodes) 
$ cat my_batch_script.sh
#!/bin/sh
mpirun ./mpi_exec

srun launcher 

MPI applications can also be launched directly with the SLURM launcher srun:

$ srun -N 2  ./mpi_exec

or via salloc:

$ salloc -N 2 (allocate a job of 2 nodes)
$ srun ./mpi_exec

or via sbatch:

$ sbatch -N 2 my_batch_script.sh (allocate a job of 2 nodes) 
$ vi my_batch_script.sh
#!/bin/sh
srun -N 2 ./mpi_exec

Scientific libraries

Libraries listed in this section do not support CUDA (see LEONARDO Booster section for GPU-accelerated libraries).

Linear Algebra

  • BLAS: openblas,  intel-oneapi-mkl
  • LAPACK: openblas, intel-oneapi-mkl
  • SCALAPACK:  netlib-scalapack, intel-oneapi-mkl
  • SPARCE MATRICES : PetSc (multi-node), SuperLU-dist (multi-node)

PetSc and SuperLU-dist are GPU-accelerated libraries and are also listed in LEONARDO Booster dedicated page. However, we report them here for the frequent use also in non-accelerated applications.

Fast Fourier Transform

  • FFTW (single and multi-node)

Hardware locality

Each compute node in the DCGP partition is equipped with:

  • 2 sockets, each containing one multi-core processor.
  • 112 cores in total (56 cores per socket).
  • 503 GiB of available RAM, divided into 8 NUMA nodes (4 per socket).

The multi-core processors are Intel Xeon Platinum 8480+ (3.80 GHz, Turbo enabled), featuring:

  • 56 cores per processor, each with 2 MiB of L2 cache and 80 KiB of L1 cache.
  • 105 MiB of L3 cache, shared across all cores.


A detailed description of the node topology can be obtained by running the lstopo-no-graphics command as follows:

[<username>@lrdnXXXX ~]$ lstopo-no-graphics  Machine (503GB total)
  Package L#0 + L3 L#0 (105MB)
    Group0 L#0
      NUMANode L#0 (P#0 62GB)
      L2 L#0 (2048KB) + L1d L#0 (48KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
      L2 L#1 (2048KB) + L1d L#1 (48KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
      L2 L#2 (2048KB) + L1d L#2 (48KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
      L2 L#3 (2048KB) + L1d L#3 (48KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
      L2 L#4 (2048KB) + L1d L#4 (48KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
      L2 L#5 (2048KB) + L1d L#5 (48KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
      L2 L#6 (2048KB) + L1d L#6 (48KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
      L2 L#7 (2048KB) + L1d L#7 (48KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
      L2 L#8 (2048KB) + L1d L#8 (48KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8)
      L2 L#9 (2048KB) + L1d L#9 (48KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9)
      L2 L#10 (2048KB) + L1d L#10 (48KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#10)
      L2 L#11 (2048KB) + L1d L#11 (48KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11)
      L2 L#12 (2048KB) + L1d L#12 (48KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#12)
      L2 L#13 (2048KB) + L1d L#13 (48KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#13)
      HostBridge
        PCIBridge
          PCI 01:00.0 (Ethernet)
            Net "enp1s0"
        PCIBridge
          PCIBridge
            PCI 03:00.0 (VGA)
        PCIBridge
          PCI 04:00.0 (NVMExp)
            Block(Disk) "nvme0n1"
        PCI 00:17.0 (SATA)
      HostBridge
        PCI 6b:00.0 (Co-Processor)
      HostBridge
        PCI 6d:00.0 (Co-Processor)
    Group0 L#1
      NUMANode L#1 (P#1 63GB)
      L2 L#14 (2048KB) + L1d L#14 (48KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#14)
      L2 L#15 (2048KB) + L1d L#15 (48KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#15)
      L2 L#16 (2048KB) + L1d L#16 (48KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#16)
      L2 L#17 (2048KB) + L1d L#17 (48KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#17)
      L2 L#18 (2048KB) + L1d L#18 (48KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#18)
      L2 L#19 (2048KB) + L1d L#19 (48KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#19)
      L2 L#20 (2048KB) + L1d L#20 (48KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 (P#20)
      L2 L#21 (2048KB) + L1d L#21 (48KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 (P#21)
      L2 L#22 (2048KB) + L1d L#22 (48KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 (P#22)
      L2 L#23 (2048KB) + L1d L#23 (48KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 (P#23)
      L2 L#24 (2048KB) + L1d L#24 (48KB) + L1i L#24 (32KB) + Core L#24 + PU L#24 (P#24)
      L2 L#25 (2048KB) + L1d L#25 (48KB) + L1i L#25 (32KB) + Core L#25 + PU L#25 (P#25)
      L2 L#26 (2048KB) + L1d L#26 (48KB) + L1i L#26 (32KB) + Core L#26 + PU L#26 (P#26)
      L2 L#27 (2048KB) + L1d L#27 (48KB) + L1i L#27 (32KB) + Core L#27 + PU L#27 (P#27)
    Group0 L#2
      NUMANode L#2 (P#2 63GB)
      L2 L#28 (2048KB) + L1d L#28 (48KB) + L1i L#28 (32KB) + Core L#28 + PU L#28 (P#28)
      L2 L#29 (2048KB) + L1d L#29 (48KB) + L1i L#29 (32KB) + Core L#29 + PU L#29 (P#29)
      L2 L#30 (2048KB) + L1d L#30 (48KB) + L1i L#30 (32KB) + Core L#30 + PU L#30 (P#30)
      L2 L#31 (2048KB) + L1d L#31 (48KB) + L1i L#31 (32KB) + Core L#31 + PU L#31 (P#31)
      L2 L#32 (2048KB) + L1d L#32 (48KB) + L1i L#32 (32KB) + Core L#32 + PU L#32 (P#32)
      L2 L#33 (2048KB) + L1d L#33 (48KB) + L1i L#33 (32KB) + Core L#33 + PU L#33 (P#33)
      L2 L#34 (2048KB) + L1d L#34 (48KB) + L1i L#34 (32KB) + Core L#34 + PU L#34 (P#34)
      L2 L#35 (2048KB) + L1d L#35 (48KB) + L1i L#35 (32KB) + Core L#35 + PU L#35 (P#35)
      L2 L#36 (2048KB) + L1d L#36 (48KB) + L1i L#36 (32KB) + Core L#36 + PU L#36 (P#36)
      L2 L#37 (2048KB) + L1d L#37 (48KB) + L1i L#37 (32KB) + Core L#37 + PU L#37 (P#37)
      L2 L#38 (2048KB) + L1d L#38 (48KB) + L1i L#38 (32KB) + Core L#38 + PU L#38 (P#38)
      L2 L#39 (2048KB) + L1d L#39 (48KB) + L1i L#39 (32KB) + Core L#39 + PU L#39 (P#39)
      L2 L#40 (2048KB) + L1d L#40 (48KB) + L1i L#40 (32KB) + Core L#40 + PU L#40 (P#40)
      L2 L#41 (2048KB) + L1d L#41 (48KB) + L1i L#41 (32KB) + Core L#41 + PU L#41 (P#41)
      HostBridge
        PCIBridge
          PCI 38:00.0 (InfiniBand)
            Net "ib0"
            OpenFabrics "mlx5_0"
          8 x { PCI 38:00.1-01.0 (InfiniBand) }
    Group0 L#3
      NUMANode L#3 (P#3 63GB)
      L2 L#42 (2048KB) + L1d L#42 (48KB) + L1i L#42 (32KB) + Core L#42 + PU L#42 (P#42)
      L2 L#43 (2048KB) + L1d L#43 (48KB) + L1i L#43 (32KB) + Core L#43 + PU L#43 (P#43)
      L2 L#44 (2048KB) + L1d L#44 (48KB) + L1i L#44 (32KB) + Core L#44 + PU L#44 (P#44)
      L2 L#45 (2048KB) + L1d L#45 (48KB) + L1i L#45 (32KB) + Core L#45 + PU L#45 (P#45)
      L2 L#46 (2048KB) + L1d L#46 (48KB) + L1i L#46 (32KB) + Core L#46 + PU L#46 (P#46)
      L2 L#47 (2048KB) + L1d L#47 (48KB) + L1i L#47 (32KB) + Core L#47 + PU L#47 (P#47)
      L2 L#48 (2048KB) + L1d L#48 (48KB) + L1i L#48 (32KB) + Core L#48 + PU L#48 (P#48)
      L2 L#49 (2048KB) + L1d L#49 (48KB) + L1i L#49 (32KB) + Core L#49 + PU L#49 (P#49)
      L2 L#50 (2048KB) + L1d L#50 (48KB) + L1i L#50 (32KB) + Core L#50 + PU L#50 (P#50)
      L2 L#51 (2048KB) + L1d L#51 (48KB) + L1i L#51 (32KB) + Core L#51 + PU L#51 (P#51)
      L2 L#52 (2048KB) + L1d L#52 (48KB) + L1i L#52 (32KB) + Core L#52 + PU L#52 (P#52)
      L2 L#53 (2048KB) + L1d L#53 (48KB) + L1i L#53 (32KB) + Core L#53 + PU L#53 (P#53)
      L2 L#54 (2048KB) + L1d L#54 (48KB) + L1i L#54 (32KB) + Core L#54 + PU L#54 (P#54)
      L2 L#55 (2048KB) + L1d L#55 (48KB) + L1i L#55 (32KB) + Core L#55 + PU L#55 (P#55)
  Package L#1 + L3 L#1 (105MB)
    Group0 L#4
      NUMANode L#4 (P#4 63GB)
      L2 L#56 (2048KB) + L1d L#56 (48KB) + L1i L#56 (32KB) + Core L#56 + PU L#56 (P#56)
      L2 L#57 (2048KB) + L1d L#57 (48KB) + L1i L#57 (32KB) + Core L#57 + PU L#57 (P#57)
      L2 L#58 (2048KB) + L1d L#58 (48KB) + L1i L#58 (32KB) + Core L#58 + PU L#58 (P#58)
      L2 L#59 (2048KB) + L1d L#59 (48KB) + L1i L#59 (32KB) + Core L#59 + PU L#59 (P#59)
      L2 L#60 (2048KB) + L1d L#60 (48KB) + L1i L#60 (32KB) + Core L#60 + PU L#60 (P#60)
      L2 L#61 (2048KB) + L1d L#61 (48KB) + L1i L#61 (32KB) + Core L#61 + PU L#61 (P#61)
      L2 L#62 (2048KB) + L1d L#62 (48KB) + L1i L#62 (32KB) + Core L#62 + PU L#62 (P#62)
      L2 L#63 (2048KB) + L1d L#63 (48KB) + L1i L#63 (32KB) + Core L#63 + PU L#63 (P#63)
      L2 L#64 (2048KB) + L1d L#64 (48KB) + L1i L#64 (32KB) + Core L#64 + PU L#64 (P#64)
      L2 L#65 (2048KB) + L1d L#65 (48KB) + L1i L#65 (32KB) + Core L#65 + PU L#65 (P#65)
      L2 L#66 (2048KB) + L1d L#66 (48KB) + L1i L#66 (32KB) + Core L#66 + PU L#66 (P#66)
      L2 L#67 (2048KB) + L1d L#67 (48KB) + L1i L#67 (32KB) + Core L#67 + PU L#67 (P#67)
      L2 L#68 (2048KB) + L1d L#68 (48KB) + L1i L#68 (32KB) + Core L#68 + PU L#68 (P#68)
      L2 L#69 (2048KB) + L1d L#69 (48KB) + L1i L#69 (32KB) + Core L#69 + PU L#69 (P#69)
      HostBridge
        PCI e8:00.0 (Co-Processor)
      HostBridge
        PCI ea:00.0 (Co-Processor)
    Group0 L#5
      NUMANode L#5 (P#5 63GB)
      L2 L#70 (2048KB) + L1d L#70 (48KB) + L1i L#70 (32KB) + Core L#70 + PU L#70 (P#70)
      L2 L#71 (2048KB) + L1d L#71 (48KB) + L1i L#71 (32KB) + Core L#71 + PU L#71 (P#71)
      L2 L#72 (2048KB) + L1d L#72 (48KB) + L1i L#72 (32KB) + Core L#72 + PU L#72 (P#72)
      L2 L#73 (2048KB) + L1d L#73 (48KB) + L1i L#73 (32KB) + Core L#73 + PU L#73 (P#73)
      L2 L#74 (2048KB) + L1d L#74 (48KB) + L1i L#74 (32KB) + Core L#74 + PU L#74 (P#74)
      L2 L#75 (2048KB) + L1d L#75 (48KB) + L1i L#75 (32KB) + Core L#75 + PU L#75 (P#75)
      L2 L#76 (2048KB) + L1d L#76 (48KB) + L1i L#76 (32KB) + Core L#76 + PU L#76 (P#76)
      L2 L#77 (2048KB) + L1d L#77 (48KB) + L1i L#77 (32KB) + Core L#77 + PU L#77 (P#77)
      L2 L#78 (2048KB) + L1d L#78 (48KB) + L1i L#78 (32KB) + Core L#78 + PU L#78 (P#78)
      L2 L#79 (2048KB) + L1d L#79 (48KB) + L1i L#79 (32KB) + Core L#79 + PU L#79 (P#79)
      L2 L#80 (2048KB) + L1d L#80 (48KB) + L1i L#80 (32KB) + Core L#80 + PU L#80 (P#80)
      L2 L#81 (2048KB) + L1d L#81 (48KB) + L1i L#81 (32KB) + Core L#81 + PU L#81 (P#81)
      L2 L#82 (2048KB) + L1d L#82 (48KB) + L1i L#82 (32KB) + Core L#82 + PU L#82 (P#82)
      L2 L#83 (2048KB) + L1d L#83 (48KB) + L1i L#83 (32KB) + Core L#83 + PU L#83 (P#83)
    Group0 L#6
      NUMANode L#6 (P#6 63GB)
      L2 L#84 (2048KB) + L1d L#84 (48KB) + L1i L#84 (32KB) + Core L#84 + PU L#84 (P#84)
      L2 L#85 (2048KB) + L1d L#85 (48KB) + L1i L#85 (32KB) + Core L#85 + PU L#85 (P#85)
      L2 L#86 (2048KB) + L1d L#86 (48KB) + L1i L#86 (32KB) + Core L#86 + PU L#86 (P#86)
      L2 L#87 (2048KB) + L1d L#87 (48KB) + L1i L#87 (32KB) + Core L#87 + PU L#87 (P#87)
      L2 L#88 (2048KB) + L1d L#88 (48KB) + L1i L#88 (32KB) + Core L#88 + PU L#88 (P#88)
      L2 L#89 (2048KB) + L1d L#89 (48KB) + L1i L#89 (32KB) + Core L#89 + PU L#89 (P#89)
      L2 L#90 (2048KB) + L1d L#90 (48KB) + L1i L#90 (32KB) + Core L#90 + PU L#90 (P#90)
      L2 L#91 (2048KB) + L1d L#91 (48KB) + L1i L#91 (32KB) + Core L#91 + PU L#91 (P#91)
      L2 L#92 (2048KB) + L1d L#92 (48KB) + L1i L#92 (32KB) + Core L#92 + PU L#92 (P#92)
      L2 L#93 (2048KB) + L1d L#93 (48KB) + L1i L#93 (32KB) + Core L#93 + PU L#93 (P#93)
      L2 L#94 (2048KB) + L1d L#94 (48KB) + L1i L#94 (32KB) + Core L#94 + PU L#94 (P#94)
      L2 L#95 (2048KB) + L1d L#95 (48KB) + L1i L#95 (32KB) + Core L#95 + PU L#95 (P#95)
      L2 L#96 (2048KB) + L1d L#96 (48KB) + L1i L#96 (32KB) + Core L#96 + PU L#96 (P#96)
      L2 L#97 (2048KB) + L1d L#97 (48KB) + L1i L#97 (32KB) + Core L#97 + PU L#97 (P#97)
    Group0 L#7
      NUMANode L#7 (P#7 63GB)
      L2 L#98 (2048KB) + L1d L#98 (48KB) + L1i L#98 (32KB) + Core L#98 + PU L#98 (P#98)
      L2 L#99 (2048KB) + L1d L#99 (48KB) + L1i L#99 (32KB) + Core L#99 + PU L#99 (P#99)
      L2 L#100 (2048KB) + L1d L#100 (48KB) + L1i L#100 (32KB) + Core L#100 + PU L#100 (P#100)
      L2 L#101 (2048KB) + L1d L#101 (48KB) + L1i L#101 (32KB) + Core L#101 + PU L#101 (P#101)
      L2 L#102 (2048KB) + L1d L#102 (48KB) + L1i L#102 (32KB) + Core L#102 + PU L#102 (P#102)
      L2 L#103 (2048KB) + L1d L#103 (48KB) + L1i L#103 (32KB) + Core L#103 + PU L#103 (P#103)
      L2 L#104 (2048KB) + L1d L#104 (48KB) + L1i L#104 (32KB) + Core L#104 + PU L#104 (P#104)
      L2 L#105 (2048KB) + L1d L#105 (48KB) + L1i L#105 (32KB) + Core L#105 + PU L#105 (P#105)
      L2 L#106 (2048KB) + L1d L#106 (48KB) + L1i L#106 (32KB) + Core L#106 + PU L#106 (P#106)
      L2 L#107 (2048KB) + L1d L#107 (48KB) + L1i L#107 (32KB) + Core L#107 + PU L#107 (P#107)
      L2 L#108 (2048KB) + L1d L#108 (48KB) + L1i L#108 (32KB) + Core L#108 + PU L#108 (P#108)
      L2 L#109 (2048KB) + L1d L#109 (48KB) + L1i L#109 (32KB) + Core L#109 + PU L#109 (P#109)
      L2 L#110 (2048KB) + L1d L#110 (48KB) + L1i L#110 (32KB) + Core L#110 + PU L#110 (P#110)
      L2 L#111 (2048KB) + L1d L#111 (48KB) + L1i L#111 (32KB) + Core L#111 + PU L#111 (P#111)

For a compact representation of the available NUMA nodes on the system, you can use the numactl command:

[<username>@lrdnXXXX ~]$ numactl -H  available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13
node 0 size: 63569 MB
node 0 free: 1417 MB
node 1 cpus: 14 15 16 17 18 19 20 21 22 23 24 25 26 27
node 1 size: 64467 MB
node 1 free: 39104 MB
node 2 cpus: 28 29 30 31 32 33 34 35 36 37 38 39 40 41
node 2 size: 64508 MB
node 2 free: 53559 MB
node 3 cpus: 42 43 44 45 46 47 48 49 50 51 52 53 54 55
node 3 size: 64508 MB
node 3 free: 54245 MB
node 4 cpus: 56 57 58 59 60 61 62 63 64 65 66 67 68 69
node 4 size: 64508 MB
node 4 free: 54351 MB
node 5 cpus: 70 71 72 73 74 75 76 77 78 79 80 81 82 83
node 5 size: 64508 MB
node 5 free: 55113 MB
node 6 cpus: 84 85 86 87 88 89 90 91 92 93 94 95 96 97
node 6 size: 64508 MB
node 6 free: 54544 MB
node 7 cpus: 98 99 100 101 102 103 104 105 106 107 108 109 110 111
node 7 size: 64505 MB
node 7 free: 55128 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  12  12  12  21  21  21  21 
  1:  12  10  12  12  21  21  21  21 
  2:  12  12  10  12  21  21  21  21 
  3:  12  12  12  10  21  21  21  21 
  4:  21  21  21  21  10  12  12  12 
  5:  21  21  21  21  12  10  12  12 
  6:  21  21  21  21  12  12  10  12 
  7:  21  21  21  21  12  12  12  10 

Among other information, numactl  reports the memory latency distance matrix between the available NUMA nodes. Specifically, the distance between NUMA Node 0 and Node 4 is 21 (i.e. 2.1x), indicating that if Node 0 accesses memory on Node 1 (or vice versa), the access latency will be 2.1 times higher than for local memory.

Note: the memory latency distance matrix can be obtained also with lstopo-no-graphic using the flags -v or --verbose.

Remember that the hardware of login nodes is different to the one of the Booster's compute nodes. Therefore, to visualize the output show before, you have to run the lstopo-no-graphic and numactl commands within a SLURM job. 

  • No labels