Starting from January, 2020 the activity on D.A.V.I.D.E. has been stopped.
In this page:
hostname: login.davide.cineca.it
early availability: July 2018
start of production: (to be defined)
end of production: Jan 20, 2020
D.A.V.I.D.E. : (Development of an Added Value Infrastructure Designed in Europe) is an Energy Aware Petaflops Class High Performance Cluster based on Power Architecture and coupled with NVIDIA Tesla Pascal GPUs with NVLink. The innovative design of D.A.V.I.D.E. has been developed by E4 Computer Engineering for PRACE, which has as its ultimate goal to produce a leading edge HPC cluster showing higher performance, reduced power consumption and ease of use.
D.A.V.I.D.E. is based on OpenPOWER platform and is among the harbingers of a new generation of HPC systems which deliver high performances while being environmentally conscious. It has been built using best-in-class components plus custom hardware and an innovative middleware system software.
A key feature of D.A.V.I.D.E. is an innovative technology for measuring, monitoring and capping the power consumption of the node and of the whole system, through the collection of data from the relevant components (processors, memory, GPUs, fans) to further improve energy efficiency. The technology has been developed in collaboration with the University of Bologna.
FEATURES
Off-the-shelf components
High speed and accurate per-node power sensing synchronized among the nodes
Data accessible out-of-band and without processor intervention
Out-of-Band and synchronized fine grain performance sensing
Dedicated data-collection subsystem running on management nodes
Predictive Power Aware job scheduler and power manager
Model: E4 Cluster Open rack
Architecture: OpenPower NViDIA NVLink Peak Performance: ~1 PFlop/s |
---|
System Architecture
Compute nodes: There are 45 nodes connected with an efficient Infiniband EDR 100 Gb/s networking, with a total peak performance of 990 TFlops and an estimated power consumption of less than 2kW per node. Each node is a 2 OU OCP form factor and hosts two IBM POWER8 Processors with NVIDIA NVLink and four Tesla P100 (Pascal) GPUs, with the intra-node communication layout optimized for best performance.
Access
DAVIDE can be accessed via SSH (Secure Shell) protocol using the hostname:
> login.davide.cineca.it
which establishes a connection to one of the available login nodes.
For informations about data transfer from other computers please follow the instructions and caveats on the dedicated section Data storage, or the document Data Management.
Accounting
For accounting information please consult our dedicated section.
The account_no (or project) is important for batch executions. You need to indicate an account_no to be accounted for in the scheduler, using the flag "-A"
#SBATCH -A <account_no>
With the "saldo -b" command you can list all the account_no associated to your username.
Disks and Filesystems
The storage organization conforms to the CINECA infrastructure (see Section Data Storage and Filesystems).
In addition to the home directory $HOME, for each user is defined a scratch area $CINECA_SCRATCH, a large disk for the storage of run time data and files.
No $WORK storage area is currently defined for active projects on DAVIDE.
The scratch storage is a BeeGFS parallel filesystem.
Modules environment
As usual, the software modules are collected in different profiles and organized by functional category (compilers, libraries, tools, applications,..).
"Base" profile is the default. It is automatically loaded after login and it contains basic modules for the programming activities (intel e gnu compilers, math libraries, profiling and debugging tools,..).
If you want to use a module placed under others profiles, for example an application module, you will have to load preventively the corresponding profile:
>module load profile/<profile name>
>module load autoload <module name>
For listing all profiles you have loaded use the following command:
>module list
In order to detect all profiles, categories and modules available on Davide the command “modmap” is available:
>modmap
Production environment
Roughly speaking, there are two different modes to use an HPC system: Interactive and Batch. For a general discussion see the section Production Environment and Tools.
Interactive
A serial program can be executed in the standard UNIX way:
> ./program
This is allowed only for very short runs, since the interactive environment has a 10 minutes time limit: for longer runs please use the "batch" mode.
A parallel program can be executed interactively only within an "Interactive" SLURM batch job, using the "srun" command: the job is queued and scheduled as any other job, but when executed, the standard input, output, and error streams are connected to the terminal session from which srun was launched.
For example, to start an interactive session with the MPI program myprogram, using one node, one processors, one gpu, launch the command:
> salloc -N1 --ntasks-per-node=1 --gres=gpu:1 -A <account_name> -p dvd_usr_prod
SLURM will then schedule your job to start, and your shell will be unresponsive until free resources are allocated for you.
When the shell come back with the prompt, you can execute your program by typing:
> srun ./myprogram
or
> mpirun ./myprogram
The srun command will take by default PMI2 as MPI type.
SLURM automatically exports the environment variables you defined in the source shell, so that if you need to run your program myprogram in a controlled environment (i.e. specific library paths or options), you can prepare the environment in the origin shell being sure to find it in the interactive shell.
Batch
As usual on systems using SLURM, you can submit a script script.x using the command:
> sbatch script.x
You can get a list of defined partitions with the command:
> sinfo
You can simplify the output reported by the sinfo command specifying the output format via the "-o" option. A minimal output is reported, for instance, with:
> sinfo -o "%10D %20F %P"
which shows, for each partition, the total number of nodes and the number of nodes by state in the format "Allocated/Idle/Other/Total".
For more information and examples of job scripts, see section Batch Scheduler SLURM.
Submitting batch jobs on DAVIDE
On DAVIDE it is possible to submit jobs requiring different resources by specifying the corresponding partition.
Summary
In the following table you can find all the main features and limits imposed on the queue. In addition to user-based limits (max cores, memory, max walltime per job), partition-based limits are also imposed (see Max global resources per partition) on the maximum number of nodes which can be used by the running jobs on that partition.
The preemptable partition, dvd_usr_preempt has no such limitation: this partition allows to access additional nodes with respect to the allotted quantity for academic production (5 nodes), in a preemptable modality: the jobs submitted to the dvd_usr_preempt partition may be assigned nodes if available, and may be killed if the assigned resources are requested by jobs submitted to higher priority partitions (dvd_fua_prod and dvd_usr_prod); hence, we recommend its use only with restartable applications.
For EUROfusion users there are dedicated queues, please refer to the dedicated document.
SLURM Partition | QOS |
| gpu per node | max walltime | max running jobs per user | max memory per job/node | priority | Max global resources per partition | notes |
---|---|---|---|---|---|---|---|---|---|
dvd_all_serial (defalut partition) | noQOS | max = 1 (max mem = 15200MB) | 0 | 04:00:00 | - | 15200MB | 1 | 4 cores | defined on the login node |
dvd_usr_prod | noQOS | min = 1 max = 32 (2 nodes)* | 4 | 08:00:00 | - | 246000MB | 100 | 5 nodes | --gres=gpu:N (N=1,4) |
dvd_usr_prod | dvd_qos_dbg | min = 1 max = 32 (2 nodes)* | 4 | 00:30:00 | - | 246000MB | 100 | 5 nodes | --gres=gpu:N (N=1,4) |
dvd_usr_preempt | noQOS | min = 1 max = 32 (2 nodes)* | 4 | 08:00:00 | - | 246000MB | 10 | 45 nodes | --gres=gpu:N (N=1,4) |
(*) 16 core per node, 8 threads per core
Programming environment
The programming environment of the DAVIDE cluster consists of a choice of compilers for the main scientific languages (Fortran, C and C++), debuggers to help users in finding bugs and errors in the codes, profilers to help in code optimization.
In general you must "load" the correct environment also for using programming tools like compilers, since "native" compilers are not available.
If you use a given set of compilers and libraries to create your executable, very probably you have to define the same "environment" when you want to run it. This is because, since by default linking is dynamic on Linux systems, at runtime the application will need the compiler shared libraries as well as other proprietary libraries. This means that you have to specify "module load" for compilers and libraries, both at compile time and at run time. To minimize the number of needed modules at runtime, use static linking to compile the applications.
Compilers
You can check the complete list of available compilers on MARCONI with the command:
> module available
and checking the "compilers" section. In general, the available compilers are currently:
GNU (gcc, g77, g95):
> module load gnu/6.4.0
LLVM Compiler Infrastructure:
> module load llvm/6.0.0
CUDA:
> module load cuda/9.2.88
After loading the appropriate module, use the "man" command to get the complete list of the flags supported by the compiler, for example:
> module load gnu
> man gfortran
There are some flags that are common for all these compilers. Others are more specifics. The most common are reported later for each compiler.
- If you want to use a specific library or a particular include file, you have to give their paths, using the following options
-I/path_include_files specify the path of the include files
-L/path_lib_files -l<xxx> specify a library lib<xxx>.a in /path_lib_files
2. If yout want to debug your code you have to turn off optimisation and turn on run time checkings: these flags are described in the following section.
If you want to compile your code for normal production you have to turn on optimisation by choosing a higher optimisation level:
-O2 or -O3 Higher optimisation levels
Other flags are available for specific compilers and are reported later.
GNU compilers
The gnu compilers are available by loading the corresponding module, but they are not the best optimizing compilers.
The name of the GNU compilers are:
- g77: Fortran77 compiler
- gfortran: Fortran95 compiler
- gcc: C compiler
- g++: C++ compiler
The documentation can be obtained with the man command:
> man gfortan
NOTE: keep in mind that D.A.V.I.D.E. is based on Power8, so the compiling flags specific for the architecture should be searched in the "PowerPC" section of man documentation. Flags like "-mavx", used in other CINECA clusters based on x86, won't work in this one.
XL compiler
Not available.
Optimized Scientific Libraries
IBM ESSL are not available.
Parallel programming
OpenMPI
The OpenMPI libraries and wrappers are available on Davide. The package has been configured to support Mellanox Hierarchical Collectives (hcoll), and Mellanox Messaging support (mxm).
Additional information
Process/thread affinity:
Each node: 2 Power8 sockets (CoresPerSocket=8, ThreadsPerCore=8), Gres=gpu:tesla:4
The multithreading is active with 8 threads per physical core (128 logical cpus):
$ ppc64_cpu --info
Core 0: 0* 1* 2* 3* 4* 5* 6* 7*
Core 1: 8* 9* 10* 11* 12* 13* 14* 15*
Core 2: 16* 17* 18* 19* 20* 21* 22* 23*
Core 3: 24* 25* 26* 27* 28* 29* 30* 31*
Core 4: 32* 33* 34* 35* 36* 37* 38* 39*
Core 5: 40* 41* 42* 43* 44* 45* 46* 47*
Core 6: 48* 49* 50* 51* 52* 53* 54* 55*
Core 7: 56* 57* 58* 59* 60* 61* 62* 63*
Core 8: 64* 65* 66* 67* 68* 69* 70* 71*
Core 9: 72* 73* 74* 75* 76* 77* 78* 79*
Core 10: 80* 81* 82* 83* 84* 85* 86* 87*
Core 11: 88* 89* 90* 91* 92* 93* 94* 95*
Core 12: 96* 97* 98* 99* 100* 101* 102* 103*
Core 13: 104* 105* 106* 107* 108* 109* 110* 111*
Core 14: 112* 113* 114* 115* 116* 117* 118* 119*
Core 15: 120* 121* 122* 123* 124* 125* 126* 127*
Due to how the hardware is detected on a Power8 architecture, the numbering of (logical) cores follows the order of threading.
Since the nodes can be shared by users, Slurm has been configured to allocate one (physical) task per core by default.
NOTE: Without this option, by default one task will be allocated per thread on nodes with more than one ThreadsPerCore configured (as it is on Davide).
As a result of such configuration, for each requested task a physical core with all its 8 threads will be allocated to the task.
Since a physical core (8 HTs) is assigned to one task, a maximum of 16 tasks per node can be asked (--ntasks-per-node), corresponding (as mentioned) to receive 8 logical cpus per task. The use of --cpus-per-task is hence discouraged as a sbatch directive, potentially leading to incorrect allocation.
For instance:
#SBATCH -N 1
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4
#SBATCH -A <account_name>
#SBATCH -p dvd_usr_prod
will result in having 32 logical cpus for the job cpuset, for instance 0-15, 64-79. Slurm automatically assigns the cpuset most appropriate to access the (2-per-socket) GPUs
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 mlx5_0 mlx5_1 CPU Affinity
GPU0 X NV2 SYS SYS NODE SYS 0-15
GPU1 NV2 X SYS SYS NODE SYS 0-15
GPU2 SYS SYS X NV2 SYS NODE 64-79
GPU3 SYS SYS NV2 X SYS NODE 64-79
mlx5_0 NODE NODE SYS SYS X SYS
mlx5_1 SYS SYS NODE NODE SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks
If you have an hybrid MPI/OpenMP application, ask for a number of tasks so to obtain a number of logical cores equal to the product of the number of MPI processes * the number of OMP threads per task. For instance, for 4 MPI processes and 16 OMP threads per task, you need 64 logical cores, hence 8 physical cores:
#SBATCH -N 1
#SBATCH -ntasks-per-node=8
#SBATCH -A <account_name>
#SBATCH -p dvd_usr_prod
Once allocated the resource (with for instance a cpuset given by 0-15,64-79), the user can run their job specifying the desired OMP_NUM_THREADS and an adequate number of MPI processes:
export OMP_NUM_THREADS=16 (or set the --cpus-per-task as below)
srun -N 1 -n 4 ( --cpu-bind=core ) --cpus-per-task=16 -m block:block <exe>
The -m flag allows to specify the desired process distribution between nodes/socket/cores (the default is block:cyclic). Please refer to srun manual for more details on the processes distribution and binding. Note that the binding flag is required in order to obtain the correct process binding in case the -m flag is not used.
You can then set the OMP affinity to threads exporting the OMP_PLACES variable.
GPU Environment
NVIDIA GPUDirect RDMA technology is not supported on Davide because of hardware limitations: the Minsky processor does not have PCIe switches, and any attempt to use GPU Direct RDMA will incur system crashes.
Hence, cuda-aware MPI is not supported (CUDA APIs have to be used by MPI or serial processes to copy data to/from GPUs to/from Host via NVLink interconnect (intra-socket - GPU Direct PeertoPeer) or PCIe/IB (inter-socket/nodes). GPU Direct PeerToPeer is supported (via CUDA APIs) on the NVLinks.
In case of CUDA aware MPI, GPU buffers can be passed directly:
//MPI rank 0
MPI_Send(s_buf_d,size,MPI_CHAR,1,100,MPI_COMM_WORLD);
//MPI rank n-1
MPI_Recv(r_buf_d,size,MPI_CHAR,0,100,MPI_COMM_WORLD, &status);
On DAVIDE you need to do some recoding as follows:
//MPI rank 0
cudaMemcpy(s_buf_h,s_buf_d,size,cudaMemcpyDeviceToHost);
MPI_Send(s_buf_h,size,MPI_CHAR,1,100,MPI_COMM_WORLD);
//MPI rank 1
MPI_Recv(r_buf_h,size,MPI_CHAR,0,100,MPI_COMM_WORLD,&status);
cudaMemcpy(r_buf_d,r_buf_h,size,cudaMemcpyHostToDevice);