...
The topology of the node devices is as follows:
$ nvidia-smi topo -m
....... (inserire)......
The internode communications is based on a Mellanox Infiniband EDR network, and the openmpi and IBM MPI Spectrum libraries are configured so to exploit the Mellanox Fabric Collective Accelerators (also on CUDA memories) and Messaging Accelerators.
nVIDIA GPUDirect technology is fully supported (shared memory, peer-to-peer, RDMA, async), enabling the use of CUDA-aware MPI.
Modules environment
As usual, the software modules are collected in different profiles and organized by functional category (compilers, libraries, tools, applications,..).
The profiles are of two types, “domain” type (chem, phys, lifesc,..) for the production activity and “programming” type (base and advanced) for compilation, debugging and profiling activities. They can be loaded together.
The "Base" profile is the default one. It is automatically loaded after login and it contains basic modules for the programming activities (xls, pgi and gnu compilers, math libraries, profiling and debugging tools,..).
If you want to use a module placed under others profiles, for example an application module, you will have to load the corresponding profile:
>module load profile/<profile name>
>module load autoload <module name>
For listing all profiles you have loaded use the following command:
>module list
In order to detect all profiles, categories and modules available on M100 the command “modmap” is available:
>modmap
Spack
...
Production environment
Since M100 is a general purpose system and it is used by several users at the same time, long production jobs must be submitted using a queuing system. This guarantees that the access to the resources is as fair as possible.
Roughly speaking, there are two different modes to use an HPC system: Interactive and Batch. For a general discussion see the section Production Environment and Tools.
Each node of Marconi100 consists in 2 Power9 sockets with 16 cores and 2 Volta GPUs per socket (32 cores and 4 GPUs per node). The multi-threading is active with 4 threads per physical core (128 total logical cpus).
Due to how the hardware is detected on a Power9 architecture, the numbering of (logical) cores follows the order of threading:
$ ppc64_cpu --info
Core 0: 0* 1* 2* 3*
Core 1: 4* 5* 6* 7*
Core 2: 8* 9* 10* 11*
Core 3: 12* 13* 14* 15*
.............. (Cores from 4 to 27)........................
Core 28: 112* 113* 114* 115*
Core 29: 116* 117* 118* 119*
Core 30: 120* 121* 122* 123*
Core 31: 125* 126* 127*
Since the nodes can be shared by users, Slurm has been configured to allocate one (physical) task per core by default. Without this option, by default one task will be allocated per thread on nodes with more than one ThreadsPerCore (as it is on Marconi100).
As a result of such configuration, for each requested task a physical core with all its 4 threads will be allocated to the task. The use of --cpus-per-task is hence discouraged as a sbatch directive, potentially leading to incorrect allocation.You can then exploit the multithreading capability with 4 MPI processes per physical core or opportunely combining MPI processes and OpenMP threads, if adequate for your application.
Since a physical core (4 HTs) is assigned to one task, a maximum of 32 tasks per node can be asked (--ntasks-per-node), corresponding (as mentioned) to receive 4 logical cpus per task.
Interactive
A serial program can be executed in the standard UNIX way:
> ./program
This is allowed only for very short runs on the login nodes, since the interactive environment has a 10 minutes time limit.
A serial (or multithreaded) program using GPUs and needing more than 10 minutes can be executed interactively within an "Interactive" SLURM batch job, using the "srun" command: the job is queued and scheduled as any other job but, when executed, the remote standard input, output, and error streams are connected to the terminal session from which srun was launched.
For example, to start an interactive session on one node and one GPU launch the command:
> srun -N1 --ntasks-per-node=1 --gres=gpu:1 -A <account_name> --time=01:00:00 --pty /bin/bash
SLURM will then schedule your job to start, and your shell will be unresponsive until free resources are allocated for you. When the shell come back with the prompt (the hostname at the prompt will be that of the assigned node), launch the program in the standard way:
> ./program
As mentioned above, the accounting of the consumed core hours takes into account also the memory and the number of requested GPUs (see the dedicated section). For instance, a job using one core and one GPU for one hour (with the default memory per core) will consume 8 core-hours (each node being equipped with 32 physical cores and 4 V100 GPUs).
A parallel (MPI) program using GPUs and needing more than 10 minutes can as well been executed in an interactive SLURM batch jobs, using the "salloc" command in the place of "srun --pty bash". For instance:
GPU0 GPU1 GPU2 GPU3 CPU Affinity
GPU0 X NV3 SYS SYS 0-63
GPU1 NV3 X SYS SYS 0-63
GPU2 SYS SYS X NV3 64-127
GPU3 SYS SYS NV3 X 64-127
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks
The internode communications is based on a Mellanox Infiniband EDR network, and the openmpi and IBM MPI Spectrum libraries are configured so to exploit the Mellanox Fabric Collective Accelerators (also on CUDA memories) and Messaging Accelerators.
nVIDIA GPUDirect technology is fully supported (shared memory, peer-to-peer, RDMA, async), enabling the use of CUDA-aware MPI.
Modules environment
As usual, the software modules are collected in different profiles and organized by functional category (compilers, libraries, tools, applications,..).
The profiles are of two types, “domain” type (chem, phys, lifesc,..) for the production activity and “programming” type (base and advanced) for compilation, debugging and profiling activities. They can be loaded together.
The "Base" profile is the default one. It is automatically loaded after login and it contains basic modules for the programming activities (xls, pgi and gnu compilers, math libraries, profiling and debugging tools,..).
If you want to use a module placed under others profiles, for example an application module, you will have to load the corresponding profile:
>module load profile/<profile name>
>module load autoload <module name>
For listing all profiles you have loaded use the following command:
>module list
In order to detect all profiles, categories and modules available on M100 the command “modmap” is available:
>modmap
Spack
...
Production environment
Since M100 is a general purpose system and it is used by several users at the same time, long production jobs must be submitted using a queuing system. This guarantees that the access to the resources is as fair as possible.
Roughly speaking, there are two different modes to use an HPC system: Interactive and Batch. For a general discussion see the section Production Environment and Tools.
Each node of Marconi100 consists in 2 Power9 sockets with 16 cores and 2 Volta GPUs per socket (32 cores and 4 GPUs per node). The multi-threading is active with 4 threads per physical core (128 total logical cpus).
Due to how the hardware is detected on a Power9 architecture, the numbering of (logical) cores follows the order of threading:
$ ppc64_cpu --info
Core 0: 0* 1* 2* 3*
Core 1: 4* 5* 6* 7*
Core 2: 8* 9* 10* 11*
Core 3: 12* 13* 14* 15*
.............. (Cores from 4 to 27)........................
Core 28: 112* 113* 114* 115*
Core 29: 116* 117* 118* 119*
Core 30: 120* 121* 122* 123*
Core 31: 125* 126* 127*
Since the nodes can be shared by users, Slurm has been configured to allocate one (physical) task per core by default. Without this option, by default one task will be allocated per thread on nodes with more than one ThreadsPerCore (as it is on Marconi100).
As a result of such configuration, for each requested task a physical core with all its 4 threads will be allocated to the task. The use of --cpus-per-task is hence discouraged as a sbatch directive, potentially leading to incorrect allocation.You can then exploit the multithreading capability with 4 MPI processes per physical core or opportunely combining MPI processes and OpenMP threads, if adequate for your application.
Since a physical core (4 HTs) is assigned to one task, a maximum of 32 tasks per node can be asked (--ntasks-per-node), corresponding (as mentioned) to receive 4 logical cpus per task.
Interactive
A serial program can be executed in the standard UNIX way:
> ./program
This is allowed only for very short runs on the login nodes, since the interactive environment has a 10 minutes time limit.
A serial (or multithreaded) program using GPUs and needing more than 10 minutes can be executed interactively within an "Interactive" SLURM batch job, using the "srun" command: the job is queued and scheduled as any other job but, when executed, the remote standard input, output, and error streams are connected to the terminal session from which srun was launched.
For example, to start an interactive session on one node and one GPU launch the command:
> srun -N1 --ntasks-per-node=1 --gres=gpu:1> salloc -N1 --ntasks-per-node=16 --gres=gpu:2 -A <account_name> --time=01:00:00 --pty /bin/bash
SLURM will then schedule your job to start, and your shell will be unresponsive until free resources are allocated for you. When the shell come back with the prompt Again, the job is queued and scheduled as any other job and, when executed, a new session starts on the login node from which salloc was launched (the hostname at the prompt will be that of the login assigned node). You can now run your parallel program on the assigned compute node(s) as in any slurm parallel job:
> srun ./myprogram
or
> mpirun ./myprogram
srun/mpirun will dispatch the tasks of the program myprogram to the assigned compute node, i.e., the tasks do not run on the login node hosting the salloc session.
Please note that the recommended way to launch parallel tasks in slurm jobs is with srun. By using srun vs mpirun you will get full support for process tracking, accounting, task affinity, suspend/resume and other features.
, launch the program in the standard way:
> ./program
As mentioned above, the accounting of the consumed core hours takes into account also the memory and the number of requested GPUs (see the dedicated section). For instance, a job using one core and one GPU for one hour (with the default memory per core) will consume 8 core-hours (each node being equipped with 32 physical cores and 4 V100 GPUs).
A parallel (MPIA hybrid parallel (MPI/OpenMP) program using GPUs and needing more than 10 minutes can also be as well been executed in an interactive SLURM batch jobs with , using the "salloc" command in the place of "srun --pty bash". For instance:
> salloc -N1 --ntasks-per-node=4 --cpus-per-task=416 --gres=gpu:2 -A <account_name> --time=01:00:00
> export OMP_NUM_THREADS=4
...
The above request reflects the configuration of assigning a physical core with its four threads. But you can choose the tasks/threads ratio which better suits your application, and ask for a number of tasks so to obtain a number of logical cores equal to the product of the number of MPI processes * the number of OMP threads per task. For instance, for 4 MPI processes and 16 OMP threads per task, you need 64 logical cores, hence 16 physical cores:
> salloc -N1 --ntasks-per-node=16 --gres=gpu:2 -A <account_name> --time=01:00:00 # this will assign 16 physical cores with 4 HTs each
> export OMP_NUM_THREADS=16
> srun --ntasks-per-node=4 (--cpu-bind=core ) --cpus-per-task=16 -m block:block ./myprogram
The -m flag allows to specify the desired process distribution between nodes/socket/cores (the default is block:cyclic). Please refer to srun manual for more details on the processes distribution and binding. Note that the binding flag is required in order to obtain the correct process binding in case the -m flag is not used.
You can then set the OMP affinity to threads exporting the OMP_PLACES variable.
For all the mentioned cases, SLURM automatically exports the environment variables you defined in the source shell, so that if you need to run your program "myprogram" in a controlled environment (i.e. specific library paths or options), you can prepare the environment in the origin shell being sure to find it in the interactive shell (started with both srun and salloc).
Batch
The info reported here refer to the general user M100 partition. The production environment for EUROfusion users is discussed in a separate document.
As usual on systems using SLURM, you can submit a script script.x using the command:
> sbatch script.x
You can get a list of defined partitions with the command:
> sinfo
You can simplify the output reported by the sinfo command specifying the output format via the "-o" option. A minimal output is reported, for instance, with:
> sinfo -o "%10D %20F %P"
which shows, for each partition, the total number of nodes and the number of nodes by state in the format "Allocated/Idle/Other/Total".
Please note that the recommended way to launch parallel tasks in slurm jobs is with srun. By using srun vs mpirun you will get full support for process tracking, accounting, task affinity, suspend/resume and other features.
For more information and examples of job scripts, see section Batch Scheduler SLURM.
Submitting serial Batch jobs
The m100_all_serial partition is available with a maximum walltime of 4 hours, 1 task and 7600 MB per job. It runs on two dedicated nodes (equipped with 4 Volta GPUs), and it is designed for pre/post-processing serial analysis (using or not the GPUs), and for moving your data (via rsync, scp etc.) in case more than 10 minutes are required to complete the data transfer. This is the default partition, which is assumed by SLURM if you do not explicit request a partition with the flag "--partition" or "-p". You can however explicitly request it in your batch script with the directive:
#SBATCH -p m100_all_serial
Submitting Batch jobs for production
Not all of the partitions are open to access by the academic community as some are reserved to dedicated classes of users (for example *_fua_ * partitions are for EUROfusion users):
- m100_fua_prod and m100_fua_dbg, are reserved to EuroFusion users, respectively for production and debugging
- m100_usr_prod and m100_usr_dbg are open to academic production.
Each node exposes itself to SLURM as having 32 cores, 4 GPUs and XXXX memory. SLURM assigns a node in shared way, assigning to the job only the resources required and allowing multiple jobs to run on the same node/nodes. If you want to have the node/s in exclusive mode, ask for all the resources of the node (either ntasks-per-node=32 or mem=XXXX).
The maximum memory which can be requested is XXXXMB (average memory per physical core ~ 7GB) and this value guarantees that no memory swapping will occur.
For example, to request one core and one GPU in a production queue the following SLURM job script can be used:
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1 # this refers to the number of requested gpus per node, and can vary between 1 and 4
#SBATCH -A <account_name>
#SBATCH --mem=7100 # this refers to the requested memory per node with a maximum of XXXXXX
#SBATCH -p m100_usr_prod
#SBATCH --time 00:10:00 # format: HH:MM:SS
#SBATCH --job-name=my_batch_job
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<user_email>
srun ./myexecutable
Users with exhausted but still active projects are allowed to keep using the cluster resources, even if at a very low priority, by adding the "qos_lowprio" flag to their job:
#SBATCH --qos=qos_lowprio
This QOS is automatically associated to Eurofusion users once their projects exhaust the budget before their expiry date. For all the other users, please ask superc@cineca.it to request the QOS association.
Summary
In the following table you can find all the main features and limits imposed on the queues/Partitions of M100.
...
SLURM
partition
...
max running jobs per user/
max n. of cpus/nodes/GPUs per user
...
max memory per node
(MB)
...
m100_all_serial
(default partition)
...
max = 1 core, 1 GPU
(max mem= 7600 MB)
...
4 cpus/1 GPU
...
max = 2 nodes
...
Again, the job is queued and scheduled as any other job and, when executed, a new session starts on the login node from which salloc was launched (the hostname at the prompt will be that of the login node). You can now run your parallel program on the assigned compute node(s) as in any slurm parallel job:
> srun ./myprogram
or
> mpirun ./myprogram
srun/mpirun will dispatch the tasks of the program myprogram to the assigned compute node, i.e., the tasks do not run on the login node hosting the salloc session.
Please note that the recommended way to launch parallel tasks in slurm jobs is with srun. By using srun vs mpirun you will get full support for process tracking, accounting, task affinity, suspend/resume and other features.
A hybrid parallel (MPI/OpenMP) program using GPUs and needing more than 10 minutes can also be executed in an interactive SLURM batch jobs with the "salloc" command. For instance:
> salloc -N1 --ntasks-per-node=4 --cpus-per-task=4 --gres=gpu:2 -A <account_name> --time=01:00:00
> export OMP_NUM_THREADS=4
> srun ./myprogram
The above request reflects the configuration of assigning a physical core with its four threads. But you can choose the tasks/threads ratio which better suits your application, and ask for a number of tasks so to obtain a number of logical cores equal to the product of the number of MPI processes * the number of OMP threads per task. For instance, for 4 MPI processes and 16 OMP threads per task, you need 64 logical cores, hence 16 physical cores:
> salloc -N1 --ntasks-per-node=16 --gres=gpu:2 -A <account_name> --time=01:00:00 # this will assign 16 physical cores with 4 HTs each
> export OMP_NUM_THREADS=16
> srun --ntasks-per-node=4 (--cpu-bind=core ) --cpus-per-task=16 -m block:block ./myprogram
The -m flag allows to specify the desired process distribution between nodes/socket/cores (the default is block:cyclic). Please refer to srun manual for more details on the processes distribution and binding. Note that the binding flag is required in order to obtain the correct process binding in case the -m flag is not used.
You can then set the OMP affinity to threads exporting the OMP_PLACES variable.
For all the mentioned cases, SLURM automatically exports the environment variables you defined in the source shell, so that if you need to run your program "myprogram" in a controlled environment (i.e. specific library paths or options), you can prepare the environment in the origin shell being sure to find it in the interactive shell (started with both srun and salloc).
Batch
As usual on systems using SLURM, you can submit a script script.x using the command:
> sbatch script.x
You can get a list of defined partitions with the command:
> sinfo
You can simplify the output reported by the sinfo command specifying the output format via the "-o" option. A minimal output is reported, for instance, with:
> sinfo -o "%10D %20F %P"
which shows, for each partition, the total number of nodes and the number of nodes by state in the format "Allocated/Idle/Other/Total".
Please note that the recommended way to launch parallel tasks in slurm jobs is with srun. By using srun vs mpirun you will get full support for process tracking, accounting, task affinity, suspend/resume and other features.
For more information and examples of job scripts, see section Batch Scheduler SLURM.
Submitting serial Batch jobs
The m100_all_serial partition is available with a maximum walltime of 4 hours, 1 task and 7600 MB per job. It runs on two dedicated nodes (equipped with 4 Volta GPUs), and it is designed for pre/post-processing serial analysis (using or not the GPUs), and for moving your data (via rsync, scp etc.) in case more than 10 minutes are required to complete the data transfer. This is the default partition, which is assumed by SLURM if you do not explicit request a partition with the flag "--partition" or "-p". You can however explicitly request it in your batch script with the directive:
#SBATCH -p m100_all_serial
Submitting Batch jobs for production
Not all of the partitions are open to access by the academic community as some are reserved to dedicated classes of users (for example *_fua_ * partitions are for EUROfusion users):
- m100_fua_prod and m100_fua_dbg, are reserved to EuroFusion users, respectively for production and debugging
- m100_usr_prod and m100_usr_dbg are open to academic production.
Each node exposes itself to SLURM as having 32 cores, 4 GPUs and XXXX memory. SLURM assigns a node in shared way, assigning to the job only the resources required and allowing multiple jobs to run on the same node/nodes. If you want to have the node/s in exclusive mode, ask for all the resources of the node (either ntasks-per-node=32 or mem=XXXX).
The maximum memory which can be requested is XXXXMB (average memory per physical core ~ 7GB) and this value guarantees that no memory swapping will occur.
For example, to request one core and one GPU in a production queue the following SLURM job script can be used:
#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1 # this refers to the number of requested gpus per node, and can vary between 1 and 4
#SBATCH -A <account_name>
#SBATCH --mem=7100 # this refers to the requested memory per node with a maximum of XXXXXX
#SBATCH -p m100_usr_prod
#SBATCH --time 00:10:00 # format: HH:MM:SS
#SBATCH --job-name=my_batch_job
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<user_email>
srun ./myexecutable
Users with exhausted but still active projects are allowed to keep using the cluster resources, even if at a very low priority, by adding the "qos_lowprio" flag to their job:
#SBATCH --qos=qos_lowprio
This QOS is automatically associated to Eurofusion users once their projects exhaust the budget before their expiry date. For all the other users, please ask superc@cineca.it to request the QOS association.
Summary
In the following table you can find all the main features and limits imposed on the queues/Partitions of M100.
SLURM partition | Job QOS | # cores/# GPU per job | max walltime | max running jobs per user/ max n. of cpus/nodes/GPUs per user | max memory per node (MB) | priority | notes |
m100_all_serial (default partition) | normal | max = 1 core, 1 GPU (max mem= 7600 MB) | 04:00:00 | 4 cpus/1 GPU | - | 40 | |
m100_usr_prod | m100_qos_dbg | max = 2 nodes | 02:00:00 | 2 nodes/64 cpus/8 GPUs | 246000 | 45 | runs on 12 nodes #SBATCH -p m100_usr_prod #SBATCH --qos=m100_qos_dbg |
m100_usr_prod | normal | max = 16 nodes | 24:00:00 | 10 jobs | 246000 | 40 | runs on 880 nodes #SBATCH -p m100_usr_prod |
m100_qos_bprod | min = 17 nodes max = 256 nodes | 24:00:00 | 256 nodes | 246000 | 85 | runs on 256 nodes #SBATCH -p m100_usr_prod #SBATCH --qos=m100_qos_bprod | |
m100_fua_prod | m100_qos_fuadbg | max = 2 nodes | 02:00:00 | 246000 | 45 | runs on 12 nodes #SBATCH -p m100_fua_prod #SBATCH --qos=m100_qos_fuadbg | |
m100_fua_prod | normal | max = 16 nodes | 1-00:00:00 | 246000 | 40 | runs on 68 nodes #SBATCH -p m100_fua_prod | |
qos_special | >256 nodes | >24:00:00 | 246000 | 40 | #SBATCH --qos=qos_special request to superc@cineca.it | ||
qos_lowprio | max = 16 nodes | 24:00:00 | 246000 | 0 | #SBATCH --qos=qos_lowprio Non-Eurofusion users: request to superc@cineca.it |
Graphic session
If a graphic session is desired we recommend to use the tool RCM (Remote Connection Manager). For additional information visit Remote Visualization section on our User Guide.
Programming environment
The programming environment of the M100 cluster consists of a choice of compilers for the main scientific languages (Fortran, C and C++), debuggers to help users in finding bugs and errors in the codes, profilers to help in code optimisation.
In general you must "load" the correct environment also for using programming tools like compilers, since "native" compilers are not available.
If you use a given set of compilers and libraries to create your executable, very probably you have to define the same "environment" when you want to run it. This is because, since by default linking is dynamic on Linux systems, at runtime the application will need the compiler shared libraries as well as other proprietary libraries. This means that you have to specify "module load" for compilers and libraries, both at compile time and at run time. To minimize the number of needed modules at runtime, use static linking to compile the applications.
Compilers
You can check the complete list of available compilers on MARCONI with the command:
> module available
and checking the "compilers" section. The available compilers are:
- XL
- PGI
- GNU
- CUDA
XL
The XL compiler family offers C, C++, and Fortran compilers designed for optimization and improvement of code generation, exploiting the inherent opportunities in Power Architecture.
The xl/16.1.1–binary module provides:
- IBM XL C/C++ and Fortran compilers 16.1.1
- IBM XL Shared-memory parallelism (SMP) runtime library/environment 5.1.1
- Mathematical Acceleration Subsystem (MASS) Libraries 9.1.1
The name of the XL C/C++ and Fortran compilers are:
Invocations | Usage (supported standards) |
---|---|
xlc, xlc_r | Compile C source files. (ANSI C89, ISO C99, IBM language extensions) |
xlc++, xlc++_r, xlC, xlC_r | Compile C++ source files. |
cc, cc_r | Compile legacy code that does not conform to Standard C. (pre-ANSI C) |
c89, c89_r | Compile C source files with strict conformance to the C89 standard. (ANSI C89) |
c99, c99_r | Compile C source files with strict conformance to the C99 standard. (ISO 99) |
xlf, xlf_r, f77,fort77 | Compile FORTRAN 77 source files |
xlf90, xlf90_r, f90 | Compile FORTRAN 90 source files |
xlf95, xlf95_r, f95 | Compile FORTRAN 95 source files |
xlf2003, xlf2003_r, f2003 | Compile FORTRAN 2003 source files |
xlf2008, xlf2008_r, f2008 | Compile FORTRAN 2008 source files |
xlcuf | Compile CUDA FORTRAN source files |
The main difference between these commands is that they use different default options (which are set in the configuration files /cineca/prod/opt/compilers/xl/16.1.1/binary/xlC/16.1.1/etc/xlc.cfg.rhel.7.6.gcc.8.4.0.cuda.10.1 and /cineca/prod/opt/compilers/xl/16.1.1/binary/xlf/16.1.1/etc/xlf.cfg.rhel.7.6.gcc.8.4.0.cuda.10.1 respectively for the C/C++ and Fortran compilers).
All the invocation commands can be used to link programs that use multithreading. The _r versions are for backward-compatibility.
To learn more about the XL Fortran for Linux compiler, access the online product documentation in IBM Knowledge Center for the XL C/C++ compiler and the XL Fortran compiler.
The OpenMP parallelization is enabled by the -qsmp compiler option. If -qsmp=omp is specified, strict OpenMP compliance is applied on the compiling programs. Please refer to the official OpenMP support in IBM XL compilers documentation
runs on 12 nodes
#SBATCH -p m100_usr_prod
#SBATCH --qos=m100_qos_dbg
...
max = 16 nodes
...
runs on 880 nodes
#SBATCH -p m100_usr_prod
...
min = 17 nodes
max = 256 nodes
...
256 nodes
...
runs on 256 nodes
#SBATCH -p m100_usr_prod
#SBATCH --qos=m100_qos_bprod
...
runs on 12 nodes
#SBATCH -p m100_fua_prod
#SBATCH --qos=m100_qos_fuadbg
...
runs on 68 nodes
#SBATCH -p m100_fua_prod
...
>24:00:00
...
#SBATCH --qos=qos_special
...
#SBATCH --qos=qos_lowprio
Non-Eurofusion users: request to superc@cineca.it
Graphic session
If a graphic session is desired we recommend to use the tool RCM (Remote Connection Manager). For additional information visit Remote Visualization section on our User Guide.
Programming environment
The programming environment of the M100 cluster consists of a choice of compilers for the main scientific languages (Fortran, C and C++), debuggers to help users in finding bugs and errors in the codes, profilers to help in code optimisation.
In general you must "load" the correct environment also for using programming tools like compilers, since "native" compilers are not available.
If you use a given set of compilers and libraries to create your executable, very probably you have to define the same "environment" when you want to run it. This is because, since by default linking is dynamic on Linux systems, at runtime the application will need the compiler shared libraries as well as other proprietary libraries. This means that you have to specify "module load" for compilers and libraries, both at compile time and at run time. To minimize the number of needed modules at runtime, use static linking to compile the applications.
Compilers
You can check the complete list of available compilers on MARCONI with the command:
> module available
and checking the "compilers" section. The available compilers are:
...
XL
The XL compiler family offers C, C++, and Fortran compilers designed for optimization and improvement of code generation, exploiting the inherent opportunities in Power Architecture.
The xl/16.1.1–binary module provides:
- IBM XL C/C++ and Fortran compilers 16.1.1
- IBM XL Shared-memory parallelism (SMP) runtime library/environment 5.1.1
- Mathematical Acceleration Subsystem (MASS) Libraries 9.1.1
...
PORTLAND Group (PGI)
Initialize the environment with the module command:
...