Page History

...

Model : DGX A100

Architecture: Linux Infiniband Cluster 
Nodes: 3
Processors: Dual AMD Rome 7742, 128 cores total, 2.25 GHz (base) 3.4 (max boost)
Accelerators: 8x NVIDIA A100 GPUs per node, NVlink 3.0, 
              320GB GPU memory per node
RAM: 1TB/node
Internal Network: Mellanox IB EDR fully connected topology

Storage: 15 TB/node NVMe raid0
95 TB Gluster storage

Peak performance single node:   5 petaFLOPS AI
                               10 petaOPS INT8

Access

The access on DGX can be done with SSH (Secure Shell) protocol using its hostname:

...

The DGX login node is a virtual machine with 2 cpus and a x86_64 architecture without GPUs. The login node is only used for accessing the system, transferring data, and submitting jobs to the DGX nodes.

Accounting

For accounting information please consult our dedicated section.

...

Please note that the accounting is in terms of consumed core hours, but it strongly depends also on the requested memory and number of GPUs. Please refer to the dedicated section.

Disks and Filesystems

The storage organization conforms to the CINECA infrastructure (see Section "Data storage and Filesystems") as for the user's areas. In addition to the home directory ($HOME), for each user is defined a scratch area $CINECA_SCRATCH, a disk for storing run time data and files. On this DGX cluster there is no $WORK directory.

...

Please, note that /raid/scratch_local are local areas of compute nodes. So, if the dataset is present on dgx01, user needs to request dgx01 in his/her job script using the directive

#SBATCH --nodelist=dgx01

Data Management on DGX

DGX compute nodes can download data from a public website directly to the local storage (/raid/scratch_local) through wget or curl commands. If the dataset is public and could be useful for the community, we encourage you to write an email to superc@cineca.it so we can install the dataset on local storage to avoid multiple copies of the same dataset, given the limited space of the local storage.

...

> ssh -N -L 2222:dgx01:22 <username>@login.dgx.cineca.it &
> rsync -auve "ssh -p 2222" file.tar.gz <username>@localhost:/raid/scratch_local/

Modules environment

The DGX module environment is based on lmod and it is minimal, only a few modules being available on the cluster:

...

We have installed a few modules to encourage users to run their simulations using container technology (via singularity) See below section How to use Singularity on DGX. You don't need to build your container, in most cases, but you can pull (download) it from Nvidia, Docker or Singularity repositories.

AI datasets modules

The AI datasets are organized into audio, images, and videos. The datasets are stored on a shared directory AND on the NVMe memories to provide significantly faster access inside jobs. Two sets of variables are hence defined by loading these modules, pointing to the location accessible on the login node and to the location available ONLY on the compute nodes (which we encourage to use inside jobs). For instance, the imagenet ILSVRC2012 dataset module defines the following variables:

...

With the command "module help <module_name>" you will find the list of variables defined for each data module.

Production environment

Since DGX is a general purpose system and several users use it at the same time, long production jobs must be submitted using a queuing system (scheduler). This guarantees that taccess to the resources is as fair as possible. On DGX the available scheduler is SLURM.

...

Roughly speaking, there are two different modes to use an HPC system: Interactive and Batch. For a general discussion, see the section Production Environment.

Interactive

A serial program can be executed in the standard UNIX way:

...

A more specific description of the options used by salloc/srun to allocate resources in the “Batch” section, because they are the same of the sbatch command described there.

Batch

The production runs are executed in batch mode: the user writes a list of commands into a file (for example script.x) and then submits it to a scheduler (SLURM for DGX) that will search for the required resources in the system. As soon as the resources are available script.x is executed.

...

One of the commands will be probably the launch of a singularity container application (see Section How to use Singularity on DGX).

Summary

In the following table, you can find all the main features and limits imposed on the SLURM partitions and QOS.

SLURM partition	QOS	#cores/#GPUs per job	max walltime	max running jobs per user/ max n. of cores/GPUs/nodes per user	Priority	notes
dgx_usr_prod	dgx_qos_sprod	max = 32 cores (64 cpus) / 2 GPUs max mem=245000MB	48 h	1 job per user 32 cores (64 cpus) / 2 GPUs	30
	normal (noQOS)	max = 128 cores (256 cpus) / 8 GPUs max mem=980000MB	4 h	1 job per user 8 GPUs / 1 node per user	40
dgx_usr_preempt	dgx_qos_sprod	max = 32 cores (64 cpus) / 2 GPUs max mem=245000MB	48 h	(no limit)	1	free of charge / your jobs may be killed in any moment if a high priority job requests for resources in dgx_usr_prod partition
	normal (noQOS)	max = 128 cores (256 cpus) / 8 GPUs max mem=980000MB	24 h	(no limit)	1	free of charge / your jobs may be killed in any moments if a high priority job requests for resources in dgx_usr_prod partition

How to use Singularity on DGX

On each node of DGX cluster singularity is installed in the default path. You don't need to load singularity module in order to use it, but we have created it to provide some examples to users via the command "module help singularity". If you need more info you can type singularity --help or visit Singularity documentation web site .

Pull a container

To pull a container from a repository you need to execute the command singularity pull <container location>. For example if you want to download pytorch container from docker repository you can use this command:

...

no space left on device. This error happens often because singularity uses /tmp directory to store temporary files. Since /tmp size is quite small, when it is full you obtain this error and you cannot perform container pull. We suggest to set the variable SINGULARITY_TMPDIR in order to set a different temporary directory for Singularity, e.g.
$ export SINGULARITY_TMPDIR=$CINECA_SCRATCH/tmp
CPU time limit exceeded. This error happens when the container you want to pull is quite large and the last stage (creating SIF image) needs to much CPU time. On the login node each process has 10 minutes of CPU time limit. In order to avoid this issue we suggest to request an interactive job to pull it, or pull it on your laptop and copy it on DGX cluster.

Run a container

Once you have pulled a container, running it is quite easy. Here an interactive example about pytorch container:

...

Here you can find more examples about pytorch.

Build your own container

Please refer to the relative section of this page in our User Guide.

...

Page tree

Versions Compared

Old Version 25

New Version 26

Key