Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The topology of the node devices is as follows:

$ nvidia-smi topo -m

....... (inserire)......

The internode communications is based on a Mellanox Infiniband EDR network, and the openmpi and IBM MPI Spectrum libraries are configured so to exploit the Mellanox Fabric Collective Accelerators (also on CUDA memories) and Messaging Accelerators.

nVIDIA GPUDirect technology is fully supported (shared memory, peer-to-peer, RDMA, async), enabling the use of CUDA-aware MPI. 

Modules environment

As usual, the software modules are collected in different profiles and organized by functional category (compilers, libraries, tools, applications,..).

The profiles are of two types,  “domain” type (chem, phys, lifesc,..) for the production activity and “programming” type (base and advanced)  for compilation, debugging and profiling activities. They can be loaded together.

The "Base" profile is the default one. It is automatically loaded after login and it contains basic modules for the programming activities (xls, pgi and gnu compilers, math libraries, profiling and debugging tools,..).

If you want to use a module placed under others profiles, for example an application module, you will have to load the corresponding profile:

>module load profile/<profile name>
>module load autoload <module name>

For listing all profiles you have loaded use the following command:

>module list

In order to detect all profiles, categories and modules available on M100 the command “modmap” is available:

>modmap

Spack

...

Production environment

Since M100 is a general purpose system and it is used by several users at the same time, long production jobs must be submitted using a queuing system. This guarantees that the access to the resources is as fair as possible.
Roughly speaking, there are two different modes to use an HPC system: Interactive and Batch. For a general discussion see the section Production Environment and Tools.

Each node of Marconi100 consists in 2 Power9 sockets with 16 cores and 2 Volta GPUs per socket (32 cores and 4 GPUs per node). The multi-threading is active with 4 threads per physical core (128 total logical cpus).

Due to how the hardware is detected on a Power9 architecture, the numbering of (logical) cores follows the order of threading:

$ ppc64_cpu --info

Core   0:    0*    1*    2*    3*
Core   1:    4*    5*    6*    7*
Core   2:    8*    9*   10*   11* 
Core   3:   12*   13*   14*   15*

.............. (Cores from 4 to 27)........................

Core  28:  112*  113*  114*  115*
Core  29:  116*  117*  118*  119*  
Core  30:  120*  121*  122*  123*
Core  31:  125*  126*  127*

Since the nodes can be shared by users, Slurm has been configured to allocate one (physical) task per core by default. Without this  option,  by  default one  task  will  be  allocated  per  thread  on  nodes  with more than one ThreadsPerCore (as it is on Marconi100).

As a result of such configuration, for each requested task a physical core with all its 4 threads will be allocated to the task. The use of --cpus-per-task is hence discouraged as a sbatch directive, potentially leading to incorrect allocation.You can then exploit the multithreading capability with 4 MPI processes per physical core or opportunely combining MPI processes and OpenMP threads, if adequate for your application. 

Since a physical core (4 HTs) is assigned to one task, a maximum of 32 tasks per node can be asked (--ntasks-per-node), corresponding (as mentioned) to receive 4 logical cpus per task.

Interactive

A serial program can be executed in the standard UNIX way:

> ./program

This is allowed only for very short runs on the login nodes, since the interactive environment has a 10 minutes time limit.

A serial (or multithreaded) program using GPUs and needing more than 10 minutes can be executed interactively within an "Interactive" SLURM batch job, using the "srun" command: the job is queued and scheduled as any other job but, when executed, the remote standard input, output, and error streams are connected to the terminal session from which srun was launched.

For example, to start an interactive session on one node and one GPU launch the command:

> srun -N1 --ntasks-per-node=1 --gres=gpu:1 -A <account_name> --time=01:00:00 --pty /bin/bash

SLURM will then schedule your job to start, and your shell will be unresponsive until free resources are allocated for you. When the shell come back with the prompt (the hostname at the prompt will be that of the assigned node), launch the program in the standard way:

> ./program

As mentioned above, the accounting of the consumed core hours takes into account also the memory and the number of requested GPUs (see the dedicated section). For instance, a job using one core and one GPU for one hour (with the default memory per core) will consume 8 core-hours (each node being equipped with 32 physical cores and 4 V100 GPUs). 

A parallel (MPI) program using GPUs and needing more than 10 minutes can as well been executed in an interactive SLURM batch jobs, using the "salloc" command in the place of "srun --pty bash". For instance:


GPU0   GPU1   GPU2   GPU3    CPU   Affinity
GPU0     X     NV3    SYS    SYS    0-63
GPU1    NV3     X     SYS    SYS    0-63
GPU2    SYS    SYS     X     NV3    64-127
GPU3    SYS    SYS    NV3     X     64-127

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks


The internode communications is based on a Mellanox Infiniband EDR network, and the openmpi and IBM MPI Spectrum libraries are configured so to exploit the Mellanox Fabric Collective Accelerators (also on CUDA memories) and Messaging Accelerators.

nVIDIA GPUDirect technology is fully supported (shared memory, peer-to-peer, RDMA, async), enabling the use of CUDA-aware MPI. 

Modules environment

As usual, the software modules are collected in different profiles and organized by functional category (compilers, libraries, tools, applications,..).

The profiles are of two types,  “domain” type (chem, phys, lifesc,..) for the production activity and “programming” type (base and advanced)  for compilation, debugging and profiling activities. They can be loaded together.

The "Base" profile is the default one. It is automatically loaded after login and it contains basic modules for the programming activities (xls, pgi and gnu compilers, math libraries, profiling and debugging tools,..).

If you want to use a module placed under others profiles, for example an application module, you will have to load the corresponding profile:

>module load profile/<profile name>
>module load autoload <module name>

For listing all profiles you have loaded use the following command:

>module list

In order to detect all profiles, categories and modules available on M100 the command “modmap” is available:

>modmap


Spack

...

Production environment

Since M100 is a general purpose system and it is used by several users at the same time, long production jobs must be submitted using a queuing system. This guarantees that the access to the resources is as fair as possible.
Roughly speaking, there are two different modes to use an HPC system: Interactive and Batch. For a general discussion see the section Production Environment and Tools.

Each node of Marconi100 consists in 2 Power9 sockets with 16 cores and 2 Volta GPUs per socket (32 cores and 4 GPUs per node). The multi-threading is active with 4 threads per physical core (128 total logical cpus).

Due to how the hardware is detected on a Power9 architecture, the numbering of (logical) cores follows the order of threading:

$ ppc64_cpu --info

Core   0:    0*    1*    2*    3*
Core   1:    4*    5*    6*    7*
Core   2:    8*    9*   10*   11* 
Core   3:   12*   13*   14*   15*

.............. (Cores from 4 to 27)........................

Core  28:  112*  113*  114*  115*
Core  29:  116*  117*  118*  119*  
Core  30:  120*  121*  122*  123*
Core  31:  125*  126*  127*


Since the nodes can be shared by users, Slurm has been configured to allocate one (physical) task per core by default. Without this  option,  by  default one  task  will  be  allocated  per  thread  on  nodes  with more than one ThreadsPerCore (as it is on Marconi100).

As a result of such configuration, for each requested task a physical core with all its 4 threads will be allocated to the task. The use of --cpus-per-task is hence discouraged as a sbatch directive, potentially leading to incorrect allocation.You can then exploit the multithreading capability with 4 MPI processes per physical core or opportunely combining MPI processes and OpenMP threads, if adequate for your application. 

Since a physical core (4 HTs) is assigned to one task, a maximum of 32 tasks per node can be asked (--ntasks-per-node), corresponding (as mentioned) to receive 4 logical cpus per task.

Interactive

A serial program can be executed in the standard UNIX way:

> ./program

This is allowed only for very short runs on the login nodes, since the interactive environment has a 10 minutes time limit.

A serial (or multithreaded) program using GPUs and needing more than 10 minutes can be executed interactively within an "Interactive" SLURM batch job, using the "srun" command: the job is queued and scheduled as any other job but, when executed, the remote standard input, output, and error streams are connected to the terminal session from which srun was launched.

For example, to start an interactive session on one node and one GPU launch the command:

> srun -N1 --ntasks-per-node=1 --gres=gpu:1> salloc -N1 --ntasks-per-node=16 --gres=gpu:2 -A <account_name> --time=01:00:00 --pty /bin/bash

SLURM will then schedule your job to start, and your shell will be unresponsive until free resources are allocated for you. When the shell come back with the prompt Again, the job is queued and scheduled as any other job and, when executed, a new session starts on the login node from which salloc was launched (the hostname at the prompt will be that of the login assigned node). You can now run your parallel program on the assigned compute node(s) as in any slurm parallel job:

> srun ./myprogram

or

> mpirun ./myprogram

srun/mpirun will dispatch the tasks of the program myprogram to the assigned compute node, i.e., the tasks do not run on the login node hosting the salloc session.

Please note that the recommended way to launch parallel tasks in slurm jobs is with srun. By using srun vs mpirun you will get full support for process tracking, accounting, task affinity, suspend/resume and other features.

, launch the program in the standard way:

> ./program

As mentioned above, the accounting of the consumed core hours takes into account also the memory and the number of requested GPUs (see the dedicated section). For instance, a job using one core and one GPU for one hour (with the default memory per core) will consume 8 core-hours (each node being equipped with 32 physical cores and 4 V100 GPUs). 

A parallel (MPIA hybrid parallel (MPI/OpenMP) program using GPUs and needing more than 10 minutes can also be as well been executed in an interactive SLURM batch jobs with , using the "salloc" command in the place of "srun --pty bash". For instance:

> salloc -N1 --ntasks-per-node=4 --cpus-per-task=416 --gres=gpu:2 -A <account_name> --time=01:00:00 
> export OMP_NUM_THREADS=4

...

The above request reflects the configuration of assigning a physical core with its four threads. But you can choose the tasks/threads ratio which better suits your application, and ask for a number of tasks so to obtain a number of logical cores equal to the product of the number of MPI processes * the number of OMP threads per task. For instance, for 4 MPI processes and 16 OMP threads per task, you need 64 logical cores, hence 16 physical cores:

> salloc -N1 --ntasks-per-node=16 --gres=gpu:2 -A <account_name> --time=01:00:00   # this will assign 16 physical cores with 4 HTs each
> export OMP_NUM_THREADS=16
> srun --ntasks-per-node=4 (--cpu-bind=core )  --cpus-per-task=16 -m block:block ./myprogram

The -m flag allows to specify the desired process distribution between nodes/socket/cores (the default is block:cyclic). Please refer to srun manual for more details on the processes distribution and binding. Note that the binding flag is required in order to obtain the correct process binding in case the -m flag is not used.

You can then set the OMP affinity to threads exporting the OMP_PLACES variable.

For all the mentioned cases, SLURM automatically exports the environment variables you defined in the source shell, so that if you need to run your program "myprogram" in a controlled environment (i.e. specific library paths or options), you can prepare the environment in the origin shell being sure to find it in the interactive shell (started with both srun and salloc).

Batch

The info reported here refer to the general user M100 partition. The production environment for EUROfusion users is discussed in a separate document.

As usual on systems using SLURM, you can submit a script script.x using the command:

> sbatch script.x

You can get a list of defined partitions with the command:

> sinfo

You can simplify the output reported by the sinfo command specifying the output format via the "-o" option. A minimal output is reported, for instance, with:

> sinfo -o "%10D %20F %P"

which shows, for each partition, the total number of nodes and the number of nodes by state in the format "Allocated/Idle/Other/Total".

Please note that the recommended way to launch parallel tasks in slurm jobs is with srun. By using srun vs mpirun you will get full support for process tracking, accounting, task affinity, suspend/resume and other features.

For more information and examples of job scripts, see section Batch Scheduler SLURM.

Submitting serial Batch jobs

The m100_all_serial partition is available with a maximum walltime of 4 hours, 1 task and 7600 MB per job. It runs on two dedicated nodes (equipped with 4 Volta GPUs), and it is designed for pre/post-processing serial analysis (using or not the GPUs), and for moving your data (via rsync, scp etc.) in case more than 10 minutes are required to complete the data transfer. This is the default partition, which is assumed by SLURM if you do not explicit request a partition with the flag "--partition" or "-p". You can however explicitly request it in your batch script with the directive:

#SBATCH -p m100_all_serial

Submitting Batch jobs for production

Not all of the partitions are open to access by the academic community as some are reserved to dedicated classes of users (for example *_fua_ * partitions are for EUROfusion users):

  • m100_fua_prod and m100_fua_dbg, are  reserved to EuroFusion users, respectively for production and debugging
  • m100_usr_prod and m100_usr_dbg are open to academic production.

Each node exposes itself to SLURM as having 32 cores, 4 GPUs and XXXX memory. SLURM assigns a node in shared way, assigning to the job only the resources required and allowing multiple jobs to run on the same node/nodes. If you want to have the node/s in exclusive mode, ask for all the resources of the node (either ntasks-per-node=32 or mem=XXXX).

The maximum memory which can be requested is XXXXMB (average memory per physical core ~ 7GB) and this value guarantees that no memory swapping will occur. 

For example, to request one core and one GPU in a production queue the following SLURM job script can be used:

#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1 # this refers to the number of requested gpus per node, and can vary between 1 and 4
#SBATCH -A <account_name>
#SBATCH --mem=7100 # this refers to the requested memory per node with a maximum of XXXXXX
#SBATCH -p m100_usr_prod
#SBATCH --time 00:10:00 # format: HH:MM:SS
#SBATCH --job-name=my_batch_job
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<user_email>
srun ./myexecutable

Users with exhausted but still active projects are allowed to keep using the cluster resources, even if at a very low priority, by adding the  "qos_lowprio" flag to their job:

#SBATCH --qos=qos_lowprio

This QOS is automatically associated to Eurofusion users once their projects exhaust the budget before their expiry date. For all the other users, please ask superc@cineca.it to request the QOS association.

Summary

In the following table you can find all the main features and limits imposed on the queues/Partitions of M100. 

...

SLURM

partition

...

max running jobs per user/

max n. of cpus/nodes/GPUs per user

...

max memory per node

(MB)

...

m100_all_serial

(default partition)

...

max = 1 core, 1 GPU

(max mem= 7600 MB)

...

4 cpus/1 GPU

...

max = 2 nodes

...

Again, the job is queued and scheduled as any other job and, when executed, a new session starts on the login node from which salloc was launched (the hostname at the prompt will be that of the login node). You can now run your parallel program on the assigned compute node(s) as in any slurm parallel job:

> srun ./myprogram

or

> mpirun ./myprogram

srun/mpirun will dispatch the tasks of the program myprogram to the assigned compute node, i.e., the tasks do not run on the login node hosting the salloc session.

Please note that the recommended way to launch parallel tasks in slurm jobs is with srun. By using srun vs mpirun you will get full support for process tracking, accounting, task affinity, suspend/resume and other features.

A hybrid parallel (MPI/OpenMP) program using GPUs and needing more than 10 minutes can also be executed in an interactive SLURM batch jobs with the "salloc" command. For instance:

> salloc -N1 --ntasks-per-node=4 --cpus-per-task=4 --gres=gpu:2 -A <account_name> --time=01:00:00 
> export OMP_NUM_THREADS=4
> srun ./myprogram

The above request reflects the configuration of assigning a physical core with its four threads. But you can choose the tasks/threads ratio which better suits your application, and ask for a number of tasks so to obtain a number of logical cores equal to the product of the number of MPI processes * the number of OMP threads per task. For instance, for 4 MPI processes and 16 OMP threads per task, you need 64 logical cores, hence 16 physical cores:

> salloc -N1 --ntasks-per-node=16 --gres=gpu:2 -A <account_name> --time=01:00:00   # this will assign 16 physical cores with 4 HTs each
> export OMP_NUM_THREADS=16
> srun --ntasks-per-node=4 (--cpu-bind=core )  --cpus-per-task=16 -m block:block ./myprogram

The -m flag allows to specify the desired process distribution between nodes/socket/cores (the default is block:cyclic). Please refer to srun manual for more details on the processes distribution and binding. Note that the binding flag is required in order to obtain the correct process binding in case the -m flag is not used.

You can then set the OMP affinity to threads exporting the OMP_PLACES variable.

For all the mentioned cases, SLURM automatically exports the environment variables you defined in the source shell, so that if you need to run your program "myprogram" in a controlled environment (i.e. specific library paths or options), you can prepare the environment in the origin shell being sure to find it in the interactive shell (started with both srun and salloc).

Batch

As usual on systems using SLURM, you can submit a script script.x using the command:

> sbatch script.x

You can get a list of defined partitions with the command:

> sinfo

You can simplify the output reported by the sinfo command specifying the output format via the "-o" option. A minimal output is reported, for instance, with:

> sinfo -o "%10D %20F %P"

which shows, for each partition, the total number of nodes and the number of nodes by state in the format "Allocated/Idle/Other/Total".


Please note that the recommended way to launch parallel tasks in slurm jobs is with srun. By using srun vs mpirun you will get full support for process tracking, accounting, task affinity, suspend/resume and other features.

For more information and examples of job scripts, see section Batch Scheduler SLURM.


Submitting serial Batch jobs


The m100_all_serial partition is available with a maximum walltime of 4 hours, 1 task and 7600 MB per job. It runs on two dedicated nodes (equipped with 4 Volta GPUs), and it is designed for pre/post-processing serial analysis (using or not the GPUs), and for moving your data (via rsync, scp etc.) in case more than 10 minutes are required to complete the data transfer. This is the default partition, which is assumed by SLURM if you do not explicit request a partition with the flag "--partition" or "-p". You can however explicitly request it in your batch script with the directive:

#SBATCH -p m100_all_serial

Submitting Batch jobs for production

Not all of the partitions are open to access by the academic community as some are reserved to dedicated classes of users (for example *_fua_ * partitions are for EUROfusion users):

  • m100_fua_prod and m100_fua_dbg, are  reserved to EuroFusion users, respectively for production and debugging
  • m100_usr_prod and m100_usr_dbg are open to academic production.

Each node exposes itself to SLURM as having 32 cores, 4 GPUs and XXXX memory. SLURM assigns a node in shared way, assigning to the job only the resources required and allowing multiple jobs to run on the same node/nodes. If you want to have the node/s in exclusive mode, ask for all the resources of the node (either ntasks-per-node=32 or mem=XXXX).

The maximum memory which can be requested is XXXXMB (average memory per physical core ~ 7GB) and this value guarantees that no memory swapping will occur. 

For example, to request one core and one GPU in a production queue the following SLURM job script can be used:

#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1 # this refers to the number of requested gpus per node, and can vary between 1 and 4
#SBATCH -A <account_name>
#SBATCH --mem=7100 # this refers to the requested memory per node with a maximum of XXXXXX
#SBATCH -p m100_usr_prod
#SBATCH --time 00:10:00 # format: HH:MM:SS
#SBATCH --job-name=my_batch_job
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<user_email>


srun ./myexecutable


Users with exhausted but still active projects are allowed to keep using the cluster resources, even if at a very low priority, by adding the  "qos_lowprio" flag to their job:

#SBATCH --qos=qos_lowprio

This QOS is automatically associated to Eurofusion users once their projects exhaust the budget before their expiry date. For all the other users, please ask superc@cineca.it to request the QOS association.

Summary


In the following table you can find all the main features and limits imposed on the queues/Partitions of M100. 



SLURM

partition

Job QOS# cores/# GPU per job
max walltime

max running jobs per user/

max n. of cpus/nodes/GPUs per user

max memory per node

(MB)

prioritynotes

m100_all_serial

(default partition)

normal

max = 1 core, 1 GPU

(max mem= 7600 MB)

04:00:00

4 cpus/1 GPU


 -40








m100_usr_prodm100_qos_dbg

max = 2 nodes

02:00:002 nodes/64 cpus/8 GPUs
24600045

runs on 12 nodes


#SBATCH -p m100_usr_prod


#SBATCH --qos=m100_qos_dbg

m100_usr_prodnormal

max = 16 nodes

24:00:0010 jobs24600040

runs on 880 nodes

#SBATCH -p m100_usr_prod


m100_qos_bprod

min = 17 nodes

max = 256 nodes

24:00:00

256 nodes

24600085

runs on 256 nodes

#SBATCH -p m100_usr_prod

#SBATCH --qos=m100_qos_bprod

m100_fua_prodm100_qos_fuadbgmax = 2 nodes02:00:00
24600045

runs on 12 nodes

#SBATCH -p m100_fua_prod

#SBATCH --qos=m100_qos_fuadbg

m100_fua_prodnormalmax = 16 nodes1-00:00:00
24600040

runs on 68 nodes

#SBATCH -p m100_fua_prod


qos_special>256 nodes

>24:00:00



           246000
40 

#SBATCH --qos=qos_special

request to superc@cineca.it

qos_lowpriomax = 16 nodes24:00:00
2460000

#SBATCH --qos=qos_lowprio

Non-Eurofusion users: request to superc@cineca.it




Graphic session


If a graphic session is desired we recommend to use the tool RCM (Remote Connection Manager)For additional information visit Remote Visualization section on our User Guide.

Programming environment

The programming environment of the M100 cluster consists of a choice of compilers for the main scientific languages (Fortran, C and C++), debuggers to help users in finding bugs and errors in the codes, profilers to help in code optimisation.

In general you must "load" the correct environment also for using programming tools like compilers, since "native" compilers are not available.

If you use a given set of compilers and libraries to create your executable, very probably you have to define the same "environment" when you want to run it. This is because, since by default linking is dynamic on Linux systems, at runtime the application will need the compiler shared libraries as well as other proprietary libraries. This means that you have to specify "module load" for compilers and libraries, both at compile time and at run time. To minimize the number of needed modules at runtime, use static linking to compile the applications.


Compilers

You can check the complete list of available compilers on MARCONI with the command:

> module available

and checking the "compilers" section. The available compilers are:

  • XL
  • PGI
  • GNU
  • CUDA

XL 

The XL compiler family offers C, C++, and Fortran compilers designed for optimization and improvement of code generation, exploiting the inherent opportunities in Power Architecture.

The xl/16.1.1–binary module provides:

  • IBM XL C/C++ and Fortran compilers 16.1.1
  • IBM XL Shared-memory parallelism (SMP) runtime library/environment 5.1.1
  • Mathematical Acceleration Subsystem (MASS) Libraries 9.1.1

The name of the XL C/C++ and Fortran compilers are:

InvocationsUsage (supported standards)

xlc, xlc_r  

Compile C source files.

(ANSI C89, ISO C99, IBM language extensions)

xlc++, xlc++_r, xlC, xlC_rCompile C++ source files.

cc, cc_r

Compile legacy code that does not conform to Standard C. (pre-ANSI C)

c89, c89_r

Compile C source files with strict conformance to the C89 standard. (ANSI C89)

c99, c99_r

Compile C source files with strict conformance to the C99 standard. (ISO 99)



xlf, xlf_r, f77,fort77Compile FORTRAN 77 source files
xlf90, xlf90_r, f90Compile FORTRAN 90 source files
xlf95, xlf95_r, f95Compile FORTRAN 95 source files
xlf2003, xlf2003_r, f2003Compile FORTRAN 2003 source files
xlf2008, xlf2008_r, f2008Compile FORTRAN 2008 source files
xlcufCompile CUDA FORTRAN  source files


The main difference between these commands is that they use different default options (which are set in the configuration files /cineca/prod/opt/compilers/xl/16.1.1/binary/xlC/16.1.1/etc/xlc.cfg.rhel.7.6.gcc.8.4.0.cuda.10.1 and /cineca/prod/opt/compilers/xl/16.1.1/binary/xlf/16.1.1/etc/xlf.cfg.rhel.7.6.gcc.8.4.0.cuda.10.1 respectively for the C/C++ and Fortran compilers).

All the invocation commands can be used to link programs that use multithreading. The _r versions are for backward-compatibility.

To learn more about the XL Fortran for Linux compiler, access the online product documentation in IBM Knowledge Center for the XL C/C++ compiler and the XL Fortran compiler.        

The OpenMP parallelization is enabled by the -qsmp compiler option. If -qsmp=omp is specified, strict OpenMP compliance is applied on the compiling programs. Please refer to the official OpenMP support in IBM XL compilers documentation

runs on 12 nodes

#SBATCH -p m100_usr_prod

#SBATCH --qos=m100_qos_dbg

...

max = 16 nodes

...

runs on 880 nodes

#SBATCH -p m100_usr_prod

...

min = 17 nodes

max = 256 nodes

...

256 nodes

...

runs on 256 nodes

#SBATCH -p m100_usr_prod

#SBATCH --qos=m100_qos_bprod

...

runs on 12 nodes

#SBATCH -p m100_fua_prod

#SBATCH --qos=m100_qos_fuadbg

...

runs on 68 nodes

#SBATCH -p m100_fua_prod

...

>24:00:00

...

#SBATCH --qos=qos_special

...

#SBATCH --qos=qos_lowprio

Non-Eurofusion users: request to superc@cineca.it

Graphic session

If a graphic session is desired we recommend to use the tool RCM (Remote Connection Manager)For additional information visit Remote Visualization section on our User Guide.

Programming environment

The programming environment of the M100 cluster consists of a choice of compilers for the main scientific languages (Fortran, C and C++), debuggers to help users in finding bugs and errors in the codes, profilers to help in code optimisation.

In general you must "load" the correct environment also for using programming tools like compilers, since "native" compilers are not available.

If you use a given set of compilers and libraries to create your executable, very probably you have to define the same "environment" when you want to run it. This is because, since by default linking is dynamic on Linux systems, at runtime the application will need the compiler shared libraries as well as other proprietary libraries. This means that you have to specify "module load" for compilers and libraries, both at compile time and at run time. To minimize the number of needed modules at runtime, use static linking to compile the applications.

Compilers

You can check the complete list of available compilers on MARCONI with the command:

> module available

and checking the "compilers" section. The available compilers are:

...

XL 

The XL compiler family offers C, C++, and Fortran compilers designed for optimization and improvement of code generation, exploiting the inherent opportunities in Power Architecture.

The xl/16.1.1–binary module provides:

  • IBM XL C/C++ and Fortran compilers 16.1.1
  • IBM XL Shared-memory parallelism (SMP) runtime library/environment 5.1.1
  • Mathematical Acceleration Subsystem (MASS) Libraries 9.1.1

...


PORTLAND Group (PGI)


Initialize the environment with the module command:

...