Link to the new User Guide https://docs.hpc.cineca.it/index.html


Production Environment

Since LEONARDO is a general purpose system and is used by several users at the same time, long production jobs must be submitted using a queuing system (scheduler). The scheduler guarantees that the access to the resources is as fair as possible. The production environment on LEONARDO Booster partition is based on the SLURM scheduler.

LEONARDO is based on a policy of node sharing among different jobs, i.e. a job can ask for resources and these can also be a part of a node, for example few cores and 1 GPU. This means that, at a given time, one physical node can be allocated to multiple jobs of different users. Nevertheless, exclusivity at the level of the single core is guaranteed by low-level mechanisms.

There are two main modes of using compute nodes:

  • Batch Mode: This mode is intended for production runs. Users must prepare a shell script with all the operations to be executed once the requested resources are available. The job will then run on the compute nodes. Store all your data, programs, and scripts in the $WORK or $SCRATCH filesystems, as these are best for compute node access. You must have valid active projects to run batch jobs, and be aware of any specific policies regarding project budgets on our systems.

The script file must contain both directives to SLURM and commands to be executed, as better described in the section  Batch Scheduler SLURM. 

Using SLURM directives you indicate the account_name (-A: which project pays for this work), where to run the job (-p: partition), what is the maximum duration of the run (--time: time limit). Moreover you indicate the resources needed, in terms of cores, GPUs and memory.

Please note that the recommended way to launch parallel MPI applications in SLURM jobs is with srun. By using srun instead of mpirun you will get full support for process tracking, accounting, task affinity, suspend/resume and other features.

 

sbatch script example
#!/bin/bash
#SBATCH -A <account_name>
#SBATCH -p boost_usr_prod
#SBATCH --time 00:10:00     # format: HH:MM:SS
#SBATCH -N 1                # 1 node
#SBATCH --ntasks-per-node=4 # 4 tasks out of 32
#SBATCH --gres=gpu:4        # 4 gpus per node out of 4
#SBATCH --mem=123000          # memory per node out of 494000MB (481GB)
#SBATCH --job-name=my_batch_job

srun ./myexecutable 


To submit the sbatch script:

$ sbatch script.x
Please refer to the general online guide to SLURM and on task/thread bindings, and please pay attention to the setting of the SRUN_CPUS_PER_TASK for hybrid applications dispatched with "srun".
  • Interactive Mode: Jobs submitted in this mode are similar to batch mode in that the user must specify the resources to allocate. The job is then managed like any other submitted job. The key difference from batch mode is that once the job is running, the user can interactively execute applications within the limits of the allocated resources. All allocated resources are available for the entire requested walltime (and consequently billed) during the submission process.

Note:  interactive Mode under SLURM has a different meaning compared to the common understanding of interactive execution of an application under a Linux shell or prompt. Interactive execution of applications is allowed on compute nodes only via SLURM (see the next sections).

On login nodes, it is permitted to perform tasks such as data movement, archiving, code development, compilations, basic debugging, and very short test runs, provided these tasks do not exceed 10 minutes of CPU time and are free of charge under the current billing policy.

For a general discussion see the section User Environment Customization.

Job Managing and SLURM Scheduler

A list of partitions defined on the cluster, with access rights and resources definition, can be displayed with the command sinfo:

$ sinfo -o "%10D %20F %P"

The command returns a more readable output which shows, for each partition, the total number of nodes and the number of nodes by state in the format "Allocated/Idle/Other/Total".


In the following table you can find the main features and limits imposed on the partitions of LEONARDO Booster.

SLURM

partition

Job QOS# cores/# GPU
per job
max walltime

max running jobs per user/

max n. of nodes/cores/GPUs per user

prioritynotes

lrd_all_serial

(default)

normal

max = 4 physical cores
(8 logical cpus)

max mem = 30800 MB

04:00:001 node / 4 cores  / 30800 MB40No GPUs
Hyperthreading x2


boost_usr_prod


normalmax = 64 nodes24:00:00
40
boost_qos_dbgmax = 2 nodes00:30:002 nodes / 64 cores / 8 GPUs80
boost_qos_bprod

min = 65 nodes

max =256 nodes

24:00:00256 nodes 60runs on 1536 nodes
min is 65 FULL nodes
boost_qos_lprod

max = 3 nodes

4-00:00:003 nodes /12 GPUs40
For EUROFusion users and their dedicated queues please refer to the dedicated document.

Programming environment

LEONARDO Booster compute nodes host four A100 GPUs per node (CUDA compute capability 8.0). The most recent versions of NVIDIA CUDA toolkit and of the NVIDIA nvhpc compilers (ex PGI, supporting CUDA Fortran) are available in the module environment.

Compilers

You can check the complete list of available compilers on LEONARDO with the command:

$ modmap -c compilers

The available CUDA-aware compilers are:

  • GNU Compilers Collection (GCC)
  • NVIDIA nvhpc (ex PGI)
  • CUDA

Intel compilers are also available, but do not support CUDA, thus they are described in the page dedicated to LEONARDO Data Centric partition.

GNU Compiler Collection (GCC)

The GNU compilers are always available. GCC version 8.5.0 is available without the need to load any module. In the module environment you can find more recent version though.

The name of the GNU compilers are:

  • g77: Fortran77 compiler
  • gfortran: Fortran95 compiler
  • gcc: C compiler
  • g++: C++ compiler

The documentation can be obtained with the "man" command after loading the GNU module:

$ man gfortan
$ man gcc

NVIDIA nvhpc (ex PORTLAND PGI + NVIDIA CUDA)

As of August 5, 2020, the "PGI Compilers and Tools" technology is a part of the NVIDIA HPC SDK product, available as a free download from NVIDIA.

InvocationsUsage
nvcCompile C source files (C11 compiler. It supports GPU programming with OpenACC, and supports multicore CPU programming with OpenACC and OpenMP)
nvc++Compile C++ source files (C++17 compiler. It supports GPU programming with C++17 parallel algorithms (pSTL) and OpenACC, and supports multicore CPU programming with OpenACC and OpenMP)
nvfortranCompile FORTRAN source files (supports ISO Fortran 2003 and many features of ISO Fortran 2008. It supports GPU programming with CUDA Fortran and OpenACC, and supports multicore CPU programming with OpenACC and OpenMP)
nvccCUDA C and CUDA C++ compiler driver for NVIDIA GPUs

For legacy reasons, the NVIDIA nvhpc suite also offers the PGI C, C++, and Fortran compilers with their original names, as follows.

InvocationsUsage
pgccCompile C source files.
pgc++Compile C++ source files.
pgf77Compile FORTRAN 77 source files
pgf90Compile FORTRAN 90 source files
pgf95Compile FORTRAN 95 source files

To enable CUDA C++ or CUDA Fortran, and link with the CUDA runtime libraries, use the -cuda option (-Mcuda is deprecated). Use the -gpu option to tailor the compilation of target accelerator regions.

The OpenACC parallelization is enabled by the -acc flag. GPU targeting and code generation can be controlled by adding the -⁠gpu flag to the compiler command line. 

The OpenMP parallelization is enabled by the -mp compiler option. The GPU offload via OpenMP is enabled by the -mp=gpu option.

CUDA

Compute Unified Device Architecture (CUDA) is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers are able to dramatically speed up computing applications by harnessing the power of GPUs. 

In GPU-accelerated applications, the sequential part of the workload runs on the CPU – which is optimized for single-threaded performance – while the compute intensive portion of the application runs on thousands of GPU cores in parallel. When using CUDA, developers program in popular languages such as C, C++, Fortran, Python and MATLAB and express parallelism through extensions in the form of a few basic keywords. We refer to the NVIDIA CUDA Parallel Computing Platform documentation.

CUDA compilers are available inside the nvhpc module, as well as in a stand-alone module.

Debugger and Profilers

If at runtime your code dies, then there is a problem. In order to solve it, you can decide to analyze the core file (core not available with PGI compilers) or to run your code using the debugger.

Compiler flags

Whatever your decision, in any case, you need to enable compiler runtime checks, by putting specific flags during the compilation phase. In the following we describe those flags for the different Fortran compilers: if you are using the C or C++ compiler, please check before because the flags may differ.

The following flags are generally available for all compilers and are mandatory for an easier debugging session:

-O0     Lower level of optimization
-g Produce debugging information

Other flags are compiler specific and are described in the following.

PORTLAND Group (PGI) Compilers

The following flags are useful (in addition to "-O0 -g") for debugging your code:

-C                     Add array bounds checking
-Ktrap=ovf,divz,inv    Controls the behaviour of the processor when exceptions occur: 
                       FP overflow, divide by zero, invalid operands

GNU Fortran compilers

The following flags are useful (in addition to "-O0 -g")for debugging your code:

-Wall             Enables warnings pertaining to usage that should be avoided
-fbounds-check    Checks for array subscripts.

Debuggers available

GNU: gdb (serial debugger)

GDB is the GNU Project debugger and allows you to see what is going on 'inside' your program while it executes -- or what the program was doing at the moment it crashed.

VALGRIND

Valgrind is a framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile your programs in detail. The Valgrind distribution currently includes six production-quality tools: a memory error detector, two thread error detectors, a cache and branch-prediction profiler, a call-graph generating cache profiler, and a heap profiler.

Valgrind is Open Source / Free Software, and is freely available under the GNU General Public License, version 2.

Profilers

In software engineering, profiling is the investigation of a program's behavior using information gathered as the program executes. The usual purpose of this analysis is to determine which sections of a program to optimize - to increase its overall speed, decrease its memory requirement or sometimes both.

A (code) profiler is a performance analysis tool that, most commonly, measures only the frequency and duration of function calls, but there are other specific types of profilers (e.g. memory profilers) in addition to more comprehensive profilers, capable of gathering extensive performance data.

gprof

The GNU profiler gprof is a useful tool for measuring the performance of a program. It records the number of calls to each function and the amount of time spent there, on a per-function basis. Functions which consume a large fraction of the run-time can be identified easily from the output of gprof. Efforts to speed up a program should concentrate first on those functions which dominate the total run-time.

gprof uses data collected by the -pg compiler flag to construct a text display of the functions within your application (call tree and CPU time spent in every subroutine). It also provides quick access to the profiled data, which let you identify the functions that are the most CPU-intensive. The text display also lets you manipulate the display in order to focus on the application's critical areas.

Usage:

$ gfortran -pg -O3 -o myexec myprog.f90
$ ./myexec
$ ls -ltr
   .......
   -rw-r--r-- 1 aer0 cineca-staff    506 Apr  6 15:33 gmon.out
$ gprof myexec gmon.out


It is also possible to profile at code line-level (see "man gprof" for other options). In this case, you must use also the “-g” flag at compilation time:

$ gfortran -pg -g -O3 -o myexec myprog.f90
$ ./myexec
$ ls -ltr
   .......
   -rw-r--r-- 1 aer0 cineca-staff    506 Apr  6 15:33 gmon.out
$ gprof -annotated-source myexec gmon.out

It is possible to profile MPI programs. In this case, the environment variable GMON_OUT_PREFIX must be defined in order to allow to each task to write a different statistical file.

$ export GMON_OUT_PREFIX=<name>

 once the run is finished each task will create a file with its process ID (PID) extension:

<name>.$PID

 If the environmental variable is not set every task will write the same gmon.out file.

Nvidia Nsight System (GPU profiler)

Nvidia Nsight System is a system-wide performance analysis tool designed to visualize an application’s algorithms, help you identify the largest opportunities to optimize, and tune to scale efficiently across any quantity or size of CPUs and GPUs; from large server to our smallest SoC.
You can find general info on how to use it in the dedicated Nvidia User Guide pages.

Our suggestion is to run the CLI inside your job script in order to generate the qdrep files. Then you can download the qdrep files on your local PC and visualize them with the Nsight System GUI available on your workstation.

The profiler is available under the module nvhpc.

Standard usage of an MPI job running on GPU is:

$ mpirun <options> nsys profile -o ${PWD}/output_%q{OMPI_COMM_WORLD_RANK} -f true --stats=true --cuda-memory-usage=true <your_code> <input> <output>

On the single node you can also run the profiler as "nsys profile mpirun", but keep in mind that with this syntax nsys will put everything in a single report.

Unfortunately nsys usually generates several files in /tmp dir of the compute node even if a TMPDIR environment variable is set (valid for versions up to nvhpc/23.11). These files may be big causing the filling of the /tmp folder and, as a consequence, the crash of the compute node and the failure of the job.
In order to avoid such a problem we strongly suggest to include in your sbatch script the following lines around your mpirun call as a workaround:

$ export TMPDIR=/dev/shm           #or $SCRATCH depending on your needs
$ ln -s $TMPDIR /tmp/nvidia
$ mpirun ... nsys profile ...

This will place the temporary outputs of the nsys code in your TMPDIR folder that by default is /dev/shm/slurm_job.$SLURM_JOB_ID where you have about 250 GB of free space.
This workaround may cause conflicts between multiple jobs running this profiler on a compute node at the same time, so we strongly suggest also to request the compute node exclusively:

#SBATCH --exclusive

Important update: Since nvhpc/24.3, nsys profile write temporary output in the $TMPDIR area instead of /tmp, so the above workaround is no longer needed. In the case you need more than 10 GB for the temporary files of the profiler, it is sufficient to export TMPDIR towards /dev/shm area (max 252 GB of space, according to the memory requested in the job) or towards $SCRATCH area (no limits).

MPI environment

OpenMPI is the most common MPI implementation. It is installed inside the GNU environment, and it is configured to support CUDA. Here you can find some useful details on how to use OpenMPI on LEONARDO Booster partition.

The MPI implementation of Intel, i.e. Intel-OneAPI-MPI, even if available, doesn't support CUDA, thus you can find details on the Data Centric partition section.

Compiling

OpenMPI

To install MPI applications using OpenMPI, you have to load openMPI module (use "modmap -m openmpi" command to see the available OpenMPI versions) and select the MPI compiler wrapper for Fortran, C or C++ codes.

The openmpi module provides the following wrappers:

Compiler

Wrapper

Usage

g++

mpic++
mpiCC
mpicxx

Compile C++ source files with GNU
gccmpiccCompile C source files with GNU
gfortran

mpif77
mpif90
mpifort

Compile FORTRAN source files with GNU

e.g. Compiling C code:

$ module load openmpi/<version>
$ mpicc -o myexec  myprog.c (uses the gcc compiler)

You can add all options available for the backend compiler (you can show it  by "-show" flag, e.g. "mpicc -show").  In order to list them type the "man" command:

$ man mpicc

Running

To run MPI applications there are two ways:

  • using mpirun launcher
  • using srun launcher 
mpirun launcher 

To use mpirun launcher on LEONARDO Booster partition, the openmpi module needs to be loaded:

$ module load openmpi/<VERSION>

After loading the module, MPI applications can be directly  launched as:

$ mpirun ./mpi_exec

or via salloc:

$ salloc -N 2 (allocate a job of 2 nodes)
$ mpirun ./mpi_exec

or via sbatch:

$ sbatch -N 2 my_batch_script.sh (allocate a job of 2 nodes) 
$ cat my_batch_script.sh
#!/bin/sh
mpirun ./mpi_exec

srun launcher 

MPI applications can also be launched directly with the SLURM launcher srun:

$ srun -N 2  ./mpi_exec

or via salloc:

$ salloc -N 2 (allocate a job of 2 nodes)
$ srun ./mpi_exec

or via sbatch:

$ sbatch -N 2 my_batch_script.sh (allocate a job of 2 nodes) 
$ vi my_batch_script.sh
#!/bin/sh
srun -N 2 ./mpi_exec

Scientific libraries

Libraries listed in this section are GPU-accelerated and support CUDA (see LEONARDO Data Centric section for not CUDA-aware libraries).

The nvidia math libraries are available by loading "nvhpc" module (use "modmap -m nvhpc" command to see the available versions of nvhpc).

Non-nvidia math libraries installed with cuda support are available by loading the corresponding module, e.g "module load magma/<vers>".  Notice that when you load the module of any of these libraries, the CUDA module is not automatically loaded.

Linear Algebra

  • BLAS: nvidia cublas, magma 
  • LAPACK: nvidia cusolver, magma 
  • SCALAPACK:  slate 
  • EIGENVALUE SOLVERS: nvidia cusolver, magma (single-node), slate, elpa and slepC (multi-node) 
  • SPARCE MATRICES : nvidia cuSPARSE, PetSc (multi-node), SuperLU-dist (multi-node)
  • Hypre (multi-node)

Fast Fourier Transform

  • nvidia cuFFT/cuFFTW (single-node)

Hardware locality (CPU)

Each compute node in the Booster partition is equipped with one Intel Xeon Platinum 8358 Processor (3.40 GHz, Turbo enabled), featuring:

  • 32 cores, each with 1.25 MiB of L2 cache and 80 KiB of L1 cache.
  • 48 MiB of L3 cache, shared across all cores.
  • 503 GiB of available RAM, divided into 2 NUMA nodes.


A detailed description of the node topology can be obtained by running the lstopo-no-graphics command as follows:

[<username>@lrdnXXXX ~]$ lstopo-no-graphics 
Machine (503GB total) + Package L#0 + L3 L#0 (48MB)
  Group0 L#0
    NUMANode L#0 (P#0 251GB)
    L2 L#0 (1280KB) + L1d L#0 (48KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
    L2 L#1 (1280KB) + L1d L#1 (48KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
    L2 L#2 (1280KB) + L1d L#2 (48KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
    L2 L#3 (1280KB) + L1d L#3 (48KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
    L2 L#4 (1280KB) + L1d L#4 (48KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
    L2 L#5 (1280KB) + L1d L#5 (48KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
    L2 L#6 (1280KB) + L1d L#6 (48KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
    L2 L#7 (1280KB) + L1d L#7 (48KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
    L2 L#8 (1280KB) + L1d L#8 (48KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8)
    L2 L#9 (1280KB) + L1d L#9 (48KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9)
    L2 L#10 (1280KB) + L1d L#10 (48KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#10)
    L2 L#11 (1280KB) + L1d L#11 (48KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11)
    L2 L#12 (1280KB) + L1d L#12 (48KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#12)
    L2 L#13 (1280KB) + L1d L#13 (48KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#13)
    L2 L#14 (1280KB) + L1d L#14 (48KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#14)
    L2 L#15 (1280KB) + L1d L#15 (48KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#15)
    HostBridge
      PCI 00:17.0 (SATA)
      PCIBridge
        PCI 01:00.0 (Ethernet)
          Net "enp1s0f0"
        PCI 01:00.1 (Ethernet)
          Net "enp1s0f1"
      PCIBridge
        PCIBridge
          PCI 04:00.0 (VGA)
    HostBridge
      PCIBridge
        PCIBridge
          PCIBridge
            PCI 1a:00.0 (InfiniBand)
              Net "ib0"
              OpenFabrics "mlx5_0"
          PCIBridge
            PCIBridge
              PCIBridge
                PCI 1d:00.0 (3D)
    HostBridge
      PCIBridge
        PCIBridge
          PCIBridge
            PCI 53:00.0 (InfiniBand)
              Net "ib1"
              OpenFabrics "mlx5_1"
          PCIBridge
            PCIBridge
              PCIBridge
                PCI 56:00.0 (3D)
    HostBridge
      PCIBridge
        PCIBridge
          PCIBridge
            PCI 8c:00.0 (InfiniBand)
              Net "ib2"
              OpenFabrics "mlx5_2"
          PCIBridge
            PCIBridge
              PCIBridge
                PCI 8f:00.0 (3D)
    HostBridge
      PCIBridge
        PCIBridge
          PCIBridge
            PCI c5:00.0 (InfiniBand)
              Net "ib3"
              OpenFabrics "mlx5_3"
          PCIBridge
            PCIBridge
              PCIBridge
                PCI c8:00.0 (3D)
  Group0 L#1
    NUMANode L#1 (P#1 252GB)
    L2 L#16 (1280KB) + L1d L#16 (48KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#16)
    L2 L#17 (1280KB) + L1d L#17 (48KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#17)
    L2 L#18 (1280KB) + L1d L#18 (48KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#18)
    L2 L#19 (1280KB) + L1d L#19 (48KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#19)
    L2 L#20 (1280KB) + L1d L#20 (48KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 (P#20)
    L2 L#21 (1280KB) + L1d L#21 (48KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 (P#21)
    L2 L#22 (1280KB) + L1d L#22 (48KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 (P#22)
    L2 L#23 (1280KB) + L1d L#23 (48KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 (P#23)
    L2 L#24 (1280KB) + L1d L#24 (48KB) + L1i L#24 (32KB) + Core L#24 + PU L#24 (P#24)
    L2 L#25 (1280KB) + L1d L#25 (48KB) + L1i L#25 (32KB) + Core L#25 + PU L#25 (P#25)
    L2 L#26 (1280KB) + L1d L#26 (48KB) + L1i L#26 (32KB) + Core L#26 + PU L#26 (P#26)
    L2 L#27 (1280KB) + L1d L#27 (48KB) + L1i L#27 (32KB) + Core L#27 + PU L#27 (P#27)
    L2 L#28 (1280KB) + L1d L#28 (48KB) + L1i L#28 (32KB) + Core L#28 + PU L#28 (P#28)
    L2 L#29 (1280KB) + L1d L#29 (48KB) + L1i L#29 (32KB) + Core L#29 + PU L#29 (P#29)
    L2 L#30 (1280KB) + L1d L#30 (48KB) + L1i L#30 (32KB) + Core L#30 + PU L#30 (P#30)
    L2 L#31 (1280KB) + L1d L#31 (48KB) + L1i L#31 (32KB) + Core L#31 + PU L#31 (P#31)

For a compact representation of the available NUMA nodes on the system, you can use the numactl command:

[<username>@lrdnXXXX ~]$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 256926 MB
node 0 free: 237337 MB
node 1 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 1 size: 257999 MB
node 1 free: 242985 MB
node distances:
node   0   1 
  0:  10  11 
  1:  11  10 

Among other information, numactl  reports the memory latency distance matrix between the available NUMA nodes. Specifically, the distance between NUMA Node 0 and Node 1 is 11 (i.e. 1.1x), indicating that if Node 0 accesses memory on Node 1 (or vice versa), the access latency will be 1.1 times higher than for local memory.

Note: the memory latency distance matrix can be obtained also with lstopo-no-graphic using the flags -v or --verbose.

Remember that the hardware of login nodes is different to the one of the Booster's compute nodes. Therefore, to visualize the output show before, you have to run the lstopo-no-graphic and numactl commands within a SLURM job. 

Intra node connection environment

Each compute node in the Booster partition is equipped with 4 NVIDIA A100 GPUs and 2 dual-port HDR100 NICs (providing 100 Gbps per GPU and 400 Gbps per node).

The GPUs are interconnected in an all-to-all topology, with each GPU linked by 4 bonded sets of NVLinks (NV4). All GPUs are closer to the first NUMA node, resulting in a GPU-to-core affinity of cores 0-15 for the 4 GPUs.

The node's topology can be visualized by running the nvidia-smi command as follows:

[<username>@lrdnXXXX ~]$ nvidia-smi topo -m

        GPU0    GPU1    GPU2    GPU3    NIC0    NIC1    NIC2    NIC3    CPU Affinity    NUMA Affinity
GPU0    X       NV4     NV4     NV4     PXB     NODE    NODE    NODE    0-15            0
GPU1    NV4     X       NV4     NV4     NODE    PXB     NODE    NODE    0-15            0
GPU2    NV4     NV4     X       NV4     NODE    NODE    PXB     NODE    0-15            0
GPU3    NV4     NV4     NV4     X       NODE    NODE    NODE    PXB     0-15            0
NIC0    PXB     NODE    NODE    NODE    X       NODE    NODE    NODE        
NIC1    NODE    PXB     NODE    NODE    NODE    X       NODE    NODE        
NIC2    NODE    NODE    PXB     NODE    NODE    NODE    X       NODE        
NIC3    NODE    NODE    NODE    PXB     NODE    NODE    NODE    X         

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  • No labels