...
Model : DGX A100 Architecture: Linux Infiniband Cluster Storage: 15 TB/node NVMe raid0 Peak performance single node: 5 petaFLOPS AI |
---|
Access
The access on DGX can be done with SSH (Secure Shell) protocol using its hostname:
...
The DGX login node is a virtual machine with 2 cpus and a x86_64 architecture without GPUs. The login node is only used for accessing the system, transferring data, and submitting jobs to the DGX nodes.
Accounting
For accounting information please consult our dedicated section.
...
Please note that the accounting is in terms of consumed core hours, but it strongly depends also on the requested memory and number of GPUs. Please refer to the dedicated section.
Disks and Filesystems
The storage organization conforms to the CINECA infrastructure (see Section "Data storage and Filesystems") as for the user's areas. In addition to the home directory ($HOME), for each user is defined a scratch area $CINECA_SCRATCH, a disk for storing run time data and files. On this DGX cluster there is no $WORK directory.
...
Please, note that /raid/scratch_local are local areas of compute nodes. So, if the dataset is present on dgx01, user needs to request dgx01 in his/her job script using the directive
#SBATCH --nodelist=dgx01
Data Management on DGX
DGX compute nodes can download data from a public website directly to the local storage (/raid/scratch_local) through wget or curl commands. If the dataset is public and could be useful for the community, we encourage you to write an email to superc@cineca.it so we can install the dataset on local storage to avoid multiple copies of the same dataset, given the limited space of the local storage.
...
> ssh -N -L 2222:dgx01:22 <username>@login.dgx.cineca.it &
> rsync -auve "ssh -p 2222" file.tar.gz <username>@localhost:/raid/scratch_local/
Modules environment
The DGX module environment is based on lmod and it is minimal, only a few modules being available on the cluster:
...
We have installed a few modules to encourage users to run their simulations using container technology (via singularity) See below section How to use Singularity on DGX. You don't need to build your container, in most cases, but you can pull (download) it from Nvidia, Docker or Singularity repositories.
AI datasets modules
The AI datasets are organized into audio, images, and videos. The datasets are stored on a shared directory AND on the NVMe memories to provide significantly faster access inside jobs. Two sets of variables are hence defined by loading these modules, pointing to the location accessible on the login node and to the location available ONLY on the compute nodes (which we encourage to use inside jobs). For instance, the imagenet ILSVRC2012 dataset module defines the following variables:
...
With the command "module help <module_name>" you will find the list of variables defined for each data module.
Production environment
Since DGX is a general purpose system and several users use it at the same time, long production jobs must be submitted using a queuing system (scheduler). This guarantees that taccess to the resources is as fair as possible. On DGX the available scheduler is SLURM.
...
Roughly speaking, there are two different modes to use an HPC system: Interactive and Batch. For a general discussion, see the section Production Environment.
Interactive
A serial program can be executed in the standard UNIX way:
...
A more specific description of the options used by salloc/srun to allocate resources in the “Batch” section, because they are the same of the sbatch command described there.
Batch
The production runs are executed in batch mode: the user writes a list of commands into a file (for example script.x) and then submits it to a scheduler (SLURM for DGX) that will search for the required resources in the system. As soon as the resources are available script.x is executed.
...
One of the commands will be probably the launch of a singularity container application (see Section How to use Singularity on DGX).
Summary
In the following table, you can find all the main features and limits imposed on the SLURM partitions and QOS.
SLURM partition | QOS | #cores/#GPUs per job | max walltime | max running jobs per user/ max n. of cores/GPUs/nodes per user | Priority | notes |
---|---|---|---|---|---|---|
dgx_usr_prod | dgx_qos_sprod | max = 32 cores (64 cpus) / 2 GPUs max mem=245000MB | 48 h | 1 job per user 32 cores (64 cpus) / 2 GPUs | 30 | |
normal (noQOS) | max = 128 cores (256 cpus) / 8 GPUs max mem=980000MB | 4 h | 1 job per user 8 GPUs / 1 node per user | 40 | ||
dgx_usr_preempt | dgx_qos_sprod | max = 32 cores (64 cpus) / 2 GPUs max mem=245000MB | 48 h | (no limit) | 1 | free of charge / your jobs may be killed in any moment if a high priority job requests for resources in dgx_usr_prod partition |
normal (noQOS) | max = 128 cores (256 cpus) / 8 GPUs max mem=980000MB | 24 h | (no limit) | 1 | free of charge / your jobs may be killed in any moments if a high priority job requests for resources in dgx_usr_prod partition |
How to use Singularity on DGX
On each node of DGX cluster singularity is installed in the default path. You don't need to load singularity module in order to use it, but we have created it to provide some examples to users via the command "module help singularity". If you need more info you can type singularity --help or visit Singularity documentation web site.
Pull a container
To pull a container from a repository you need to execute the command singularity pull <container location>. For example if you want to download pytorch container from docker repository you can use this command:
...
no space left on device
. This error happens often because singularity uses /tmp directory to store temporary files. Since /tmp size is quite small, when it is full you obtain this error and you cannot perform container pull. We suggest to set the variable SINGULARITY_TMPDIR in order to set a different temporary directory for Singularity, e.g.$ export SINGULARITY_TMPDIR=$CINECA_SCRATCH/tmp
CPU time limit exceeded
. This error happens when the container you want to pull is quite large and the last stage (creating SIF image) needs to much CPU time. On the login node each process has 10 minutes of CPU time limit. In order to avoid this issue we suggest to request an interactive job to pull it, or pull it on your laptop and copy it on DGX cluster.
Run a container
Once you have pulled a container, running it is quite easy. Here an interactive example about pytorch container:
...
Here you can find more examples about pytorch.
Build your own container
Please refer to the relative section of this page in our User Guide.
...