Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Model : DGX A100

Architecture: Linux Infiniband Cluster 
Nodes: 3
Processors: Dual AMD Rome 7742, 128 cores total, 2.25 GHz (base) 3.4 (max boost)
Accelerators: 8x NVIDIA A100 GPUs per node, NVlink 3.0,
320GB GPU memory per node
RAM: 1TB/node
Internal Network: Mellanox IB EDR fully connected topology

Storage: 15 TB/node NVMe raid0
                  95 TB Gluster storage

Peak performance single node:   5 petaFLOPS AI
                               10 petaOPS INT8
               


Access

The access on DGX can be done with SSH (Secure Shell) protocol using its hostname:

...

The DGX login node is a virtual machine with 2 cpus and a x86_64 architecture without GPUs. The login node is only used for accessing the system, transferring data, and submitting jobs to the DGX nodes.

Accounting

For accounting information please consult our dedicated section.

...

Please note that the accounting is in terms of consumed core hours, but it strongly depends also on the requested memory and number of GPUs. Please refer to the dedicated section.


Disks and Filesystems


The storage organization conforms to the CINECA infrastructure (see Section "Data storage and Filesystems") as for the user's areas. In addition to the home directory ($HOME), for each user is defined a scratch area $CINECA_SCRATCH, a disk for storing run time data and files. On this DGX cluster there is no $WORK directory.

...

Please, note that /raid/scratch_local are local areas of compute nodes. So, if the dataset is present on dgx01, user needs to request dgx01 in his/her job script using the directive

#SBATCH --nodelist=dgx01


Data Management on DGX

DGX compute nodes can download data from a public website directly to the local storage (/raid/scratch_local) through wget or curl commands. If the dataset is public and could be useful for the community, we encourage you to write an email to superc@cineca.it so we can install the dataset on local storage to avoid multiple copies of the same dataset, given the limited space of the local storage.

...

> ssh -N -L 2222:dgx01:22 <username>@login.dgx.cineca.it &
> rsync -auve "ssh -p 2222" file.tar.gz <username>@localhost:/raid/scratch_local/

Modules environment

The DGX module environment is based on lmod and it is minimal, only a few modules being available on the cluster:

...

We have installed a few modules to encourage users to run their simulations using container technology (via singularity) See below section How to use Singularity on DGX. You don't need to build your container, in most cases, but you can pull (download) it from Nvidia, Docker or Singularity repositories.

AI datasets modules

The AI datasets are organized into audio, images, and videos. The datasets are stored on a shared directory AND on the NVMe memories to provide significantly faster access inside jobs. Two sets of variables are hence defined by loading these modules, pointing to the location accessible on the login node and to the location available ONLY on the compute nodes (which we encourage to use inside jobs). For instance, the imagenet ILSVRC2012 dataset module defines the following variables:

...

With the command "module help <module_name>" you will find the list of variables defined for each data module.

Production environment

Since DGX is a general purpose system and several users use it at the same time, long production jobs must be submitted using a queuing system (scheduler). This guarantees that taccess to the resources is as fair as possible. On DGX the available scheduler is SLURM.

...

Roughly speaking, there are two different modes to use an HPC system: Interactive and Batch. For a general discussion, see the section Production Environment.

Interactive

A serial program can be executed in the standard UNIX way:

...

A more specific description of the options used by salloc/srun to allocate resources in the “Batch” section, because they are the same of the sbatch command described there.

Batch

The production runs are executed in batch mode: the user writes a list of commands into a file (for example script.x) and then submits it to a scheduler (SLURM for DGX) that will search for the required resources in the system. As soon as the resources are available script.x is executed.

...

One of the commands will be probably the launch of a singularity container application (see Section How to use Singularity on DGX).


Summary

In the following table, you can find all the main features and limits imposed on the SLURM partitions and QOS.

SLURM

partition

QOS

#cores/#GPUs per job

max walltime

max running jobs per user/

max n. of cores/GPUs/nodes per user

Priority

notes

dgx_usr_proddgx_qos_sprod

max = 32 cores (64 cpus) / 2 GPUs

max mem=245000MB

48 h

1 job per user

32 cores (64 cpus) / 2 GPUs

30


normal (noQOS)

max = 128 cores (256 cpus) / 8 GPUs

max mem=980000MB

4 h

1 job per user

8 GPUs / 1 node per user

40


dgx_usr_preemptdgx_qos_sprod

max = 32 cores (64 cpus) / 2 GPUs

max mem=245000MB

48 h(no limit)1free of charge / your jobs may be killed in any moment if a high priority job requests for resources in dgx_usr_prod partition

normal (noQOS)

max = 128 cores (256 cpus) / 8 GPUs

max mem=980000MB

24 h(no limit)1free of charge / your jobs may be killed in any moments if a high priority job requests for resources in dgx_usr_prod partition


How to use Singularity on DGX

On each node of DGX cluster singularity is installed in the default path. You don't need to load singularity module in order to use it, but we have created it to provide some examples to users via the command "module help singularity". If you need more info you can type singularity --help or visit Singularity documentation web site.

Pull a container

To pull a container from a repository you need to execute the command singularity pull <container location>. For example if you want to download pytorch container from docker repository you can use this command:

...

  1. no space left on device. This error happens often because singularity uses /tmp directory to store temporary files. Since /tmp size is quite small, when it is full you obtain this error and you cannot perform container pull. We suggest to set the variable SINGULARITY_TMPDIR in order to set a different temporary directory for Singularity, e.g.
    $ export SINGULARITY_TMPDIR=$CINECA_SCRATCH/tmp
  2. CPU time limit exceeded. This error happens when the container you want to pull is quite large and the last stage (creating SIF image) needs to much CPU time. On the login node each process has 10 minutes of CPU time limit. In order to avoid this issue we suggest to request an interactive job to pull it, or pull it on your laptop and copy it on DGX cluster.

Run a container

Once you have pulled a container, running it is quite easy. Here an interactive example about pytorch container:

...

Here you can find more examples about pytorch.

Build your own container

Please refer to the relative section of this page in our User Guide.

...