UG3.2: MARCONI100, the nVIDIA accelerated partition of MARCONI

hostname: login.m100.cineca.it

early availability: April 2020

start of production: to be defined (2020)

This system will be in production at the beginning of 2020 as an upgrade of the "non conventional" partition of the Marconi Tier-0 system. It is an accelerated cluster based on Power9 chips and Volta NVIDIA GPUs, acquired by Cineca within the PPI4HPC European initiative.

System Architecture

Architecture: IBM Power 9 AC922
Internal Network: Mellanox Infiniband EDR DragonFly+
Storage: 8 PB (raw) GPFS of local storage

Login nodes: 8 Login IBM Power9 LC922 (similar to the compute nodes)

Model: IBM Power AC922 (Whiterspoon)

Racks: 55 total (49 compute)
Nodes: 980
Processors: 2x16 cores IBM POWER9 AC922 at 3.1 GHz
Accelerators: 4 x NVIDIA Volta V100 GPUs, Nvlink 2.0, 16GB
Cores: 32 cores/node
RAM: 256 GB/node (230 GB/node usable)
Peak Performance: about 32 Pflop/s
Internal Network: Mellanox Infiniband EDR DragonFly+
Disk Space: 8PB Gpfs storage

Access

All the login nodes have an identical environment and can be reached with SSH (Secure Shell) protocol using the "collective" hostname:

> login.m100.cineca.it

which establishes a connection to one of the available login nodes.

For information about data transfer from other computers please follow the instructions and caveats on the dedicated section Data storage, or the document Data Management.

Accounting

For accounting information please consult our dedicated section.

The account_no (or project) is important for batch executions. You need to indicate an account_no to be accounted for in the scheduler, using the flag "-A"

#SBATCH -A <account_no>

Please remember that different projects are usually active on different hosts. With the "saldo -b" command you can list all the account_no associated to your username.

> saldo -b (reports projects defined on M100 )

Please note that the accounting of the consumed core hours takes into account the requested memory and number of GPUs, please refer to the dedicated section.

Budget Linearization policy

On M100 a linearization policy for the usage of project budgets has been defined and implemented. For each account, a monthly quota is defined as:

monthTotal = (total_budget / total_no_of_months)

Starting from the first day of each month, the collaborators of any account are allowed to use the quota at full priority. As long as the budget is consumed, the jobs submitted from the account will gradually loose priority, until the monthly budget (monthTotal) is fully consumed. At that moment, their jobs will still be considered for execution, but with a lower priority than the jobs from accounts that still have some monthly quota left.

This policy is similar to those already applied by other important HPC centers in Europe and worldwide. The goal is to improve the response time, giving users the opportunity of using the cpu hours assigned to their project in relation of their actual size (total amount of core-hours).

Disks and Filesystems

The storage organization conforms to the CINECA infrastructure (see Section Data Storage and Filesystems).

In addition to the home directory $HOME, for each user is defined a scratch area $CINECA_SCRATCH, a large disk for the storage of run time data and files.

A $WORK area is defined for each active project on the system, reserved for all the collaborators of the project. This is a safe storage area to keep run time data for the whole life of the project.

	Total Dimension (TB)	Quota (GB)	Notes
$HOME	200	50	permanent/backed up, user specific, local
$CINECA_SCRATCH	2.000	no quota	temporary, user specific, local no backup automatic cleaning procedure of data older than 40 days (time interval can be reduced in case of critical usage ratio of the area. In this case, users will be notified via HPC-News)
$WORK	4.000	1.024	permanent, project specific, local no backup extensions can be considered if needed (mailto: superc@cineca.it)

$DRES environment variable points to the shared repository where Data RESources are maintained. This is a data archive area available only on-request, shared with all CINECA HPC systems and among different projects. $DRES is not mounted on the compute nodes of the production partitions and can be accessed only from login nodes and from the nodes of the serial partition. This means that you cannot access it within a standard batch job: all data needed during the batch execution has to be moved to $WORK or $CINECA_SCRATCH before the run starts, either from the login nodes or via a job submitted to the serial partition.

Since all the filesystems are based on IBM Spectrum Scale™ file system (formerly GPFS), the usual unix command "quota" is not working. Use the local command cindata to query for disk usage and quota ("cindata -h" for help):

> cindata

GPU and intra/inter connection environment

Marconi100 login and compute nodes host four Tesla Volta (V100) GPUs per node (CUDA compute capability 7.0). The most recent versions of nVIDIA CUDA toolkit and of the Community Edition PGI compilers (supporting CUDA Fortran) is available in the module environment, together with a set of GPU-enabled libraries, applications and tools.

The topology of the node devices is as follows:

$ nvidia-smi topo -m

GPU0   GPU1   GPU2   GPU3    CPU   Affinity

GPU0     X     NV3    SYS    SYS    0-63

GPU1    NV3     X     SYS    SYS    0-63

GPU2    SYS    SYS     X     NV3    64-127

GPU3    SYS    SYS    NV3     X     64-127

Legend:

  X    = Self

  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)

  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node

  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)

  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)

  PIX  = Connection traversing a single PCIe switch

  NV#  = Connection traversing a bonded set of # NVLinks

The internode communications is based on a Mellanox Infiniband EDR network, and the openmpi and IBM MPI Spectrum libraries are configured so to exploit the Mellanox Fabric Collective Accelerators (also on CUDA memories) and Messaging Accelerators.

nVIDIA GPUDirect technology is fully supported (shared memory, peer-to-peer, RDMA, async), enabling the use of CUDA-aware MPI.

Modules environment

As usual, the software modules are collected in different profiles and organized by functional category (compilers, libraries, tools, applications,..).

The profiles are of two types, “domain” type (chem, phys, lifesc,..) for the production activity and “programming” type (base and advanced) for compilation, debugging and profiling activities. They can be loaded together.

The "Base" profile is the default one. It is automatically loaded after login and it contains basic modules for the programming activities (xls, pgi and gnu compilers, math libraries, profiling and debugging tools,..).

If you want to use a module placed under others profiles, for example an application module, you will have to load the corresponding profile:

>module load profile/<profile name>

>module load autoload <module name>

For listing all profiles you have loaded use the following command:

>module list

In order to detect all profiles, categories and modules available on M100 the command “modmap” is available:

>modmap

Spack

...

Production environment

Since M100 is a general purpose system and it is used by several users at the same time, long production jobs must be submitted using a queuing system. This guarantees that the access to the resources is as fair as possible.
Roughly speaking, there are two different modes to use an HPC system: Interactive and Batch. For a general discussion see the section Production Environment and Tools.

Each node of Marconi100 consists in 2 Power9 sockets with 16 cores and 2 Volta GPUs per socket (32 cores and 4 GPUs per node). The multi-threading is active with 4 threads per physical core (128 total logical cpus).

Due to how the hardware is detected on a Power9 architecture, the numbering of (logical) cores follows the order of threading:

$ ppc64_cpu --info

Core   0:    0*    1*    2*    3*

Core   1:    4*    5*    6*    7*

Core   2:    8*    9*   10*   11*

Core   3:   12*   13*   14*   15*

.............. (Cores from 4 to 27)........................

Core  28:  112*  113*  114*  115*

Core  29:  116*  117*  118*  119*

Core  30:  120*  121*  122*  123*

Core  31:  125*  126*  127*

Since the nodes can be shared by users, Slurm has been configured to allocate one (physical) task per core by default. Without this option, by default one task will be allocated per thread on nodes with more than one ThreadsPerCore (as it is on Marconi100).

As a result of such configuration, for each requested task a physical core with all its 4 threads will be allocated to the task. The use of --cpus-per-task is hence discouraged as a sbatch directive, potentially leading to incorrect allocation.You can then exploit the multithreading capability with 4 MPI processes per physical core or opportunely combining MPI processes and OpenMP threads, if adequate for your application.

Since a physical core (4 HTs) is assigned to one task, a maximum of 32 tasks per node can be asked (--ntasks-per-node), corresponding (as mentioned) to receive 4 logical cpus per task.

Interactive

A serial program can be executed in the standard UNIX way:

> ./program

This is allowed only for very short runs on the login nodes, since the interactive environment has a 10 minutes time limit.

A serial (or multithreaded) program using GPUs and needing more than 10 minutes can be executed interactively within an "Interactive" SLURM batch job, using the "srun" command: the job is queued and scheduled as any other job but, when executed, the remote standard input, output, and error streams are connected to the terminal session from which srun was launched.

For example, to start an interactive session on one node and one GPU launch the command:

> srun -N1 --ntasks-per-node=1 --gres=gpu:1 -A <account_name> --time=01:00:00 --pty /bin/bash

SLURM will then schedule your job to start, and your shell will be unresponsive until free resources are allocated for you. When the shell come back with the prompt (the hostname at the prompt will be that of the assigned node), launch the program in the standard way:

> ./program

As mentioned above, the accounting of the consumed core hours takes into account also the memory and the number of requested GPUs (see the dedicated section). For instance, a job using one core and one GPU for one hour (with the default memory per core) will consume 8 core-hours (each node being equipped with 32 physical cores and 4 V100 GPUs).

A parallel (MPI) program using GPUs and needing more than 10 minutes can as well been executed in an interactive SLURM batch jobs, using the "salloc" command in the place of "srun --pty bash". For instance:

> salloc -N1 --ntasks-per-node=16 --gres=gpu:2 -A <account_name> --time=01:00:00

Again, the job is queued and scheduled as any other job and, when executed, a new session starts on the login node from which salloc was launched (the hostname at the prompt will be that of the login node). You can now run your parallel program on the assigned compute node(s) as in any slurm parallel job:

> srun ./myprogram

or

> mpirun ./myprogram

srun/mpirun will dispatch the tasks of the program myprogram to the assigned compute node, i.e., the tasks do not run on the login node hosting the salloc session.

Please note that the recommended way to launch parallel tasks in slurm jobs is with srun. By using srun vs mpirun you will get full support for process tracking, accounting, task affinity, suspend/resume and other features.

A hybrid parallel (MPI/OpenMP) program using GPUs and needing more than 10 minutes can also be executed in an interactive SLURM batch jobs with the "salloc" command. For instance:

> salloc -N1 --ntasks-per-node=4 --cpus-per-task=4 --gres=gpu:2 -A <account_name> --time=01:00:00

> export OMP_NUM_THREADS=4

> srun ./myprogram

The above request reflects the configuration of assigning a physical core with its four threads. But you can choose the tasks/threads ratio which better suits your application, and ask for a number of tasks so to obtain a number of logical cores equal to the product of the number of MPI processes * the number of OMP threads per task. For instance, for 4 MPI processes and 16 OMP threads per task, you need 64 logical cores, hence 16 physical cores:

> salloc -N1 --ntasks-per-node=16 --gres=gpu:2 -A <account_name> --time=01:00:00   # this will assign 16 physical cores with 4 HTs each

> export OMP_NUM_THREADS=16

> srun --ntasks-per-node=4 (--cpu-bind=core )  --cpus-per-task=16 -m block:block ./myprogram

The -m flag allows to specify the desired process distribution between nodes/socket/cores (the default is block:cyclic). Please refer to srun manual for more details on the processes distribution and binding. Note that the binding flag is required in order to obtain the correct process binding in case the -m flag is not used.

You can then set the OMP affinity to threads exporting the OMP_PLACES variable.

For all the mentioned cases, SLURM automatically exports the environment variables you defined in the source shell, so that if you need to run your program "myprogram" in a controlled environment (i.e. specific library paths or options), you can prepare the environment in the origin shell being sure to find it in the interactive shell (started with both srun and salloc).

Batch

As usual on systems using SLURM, you can submit a script script.x using the command:

> sbatch script.x

You can get a list of defined partitions with the command:

> sinfo

You can simplify the output reported by the sinfo command specifying the output format via the "-o" option. A minimal output is reported, for instance, with:

> sinfo -o "%10D %20F %P"

which shows, for each partition, the total number of nodes and the number of nodes by state in the format "Allocated/Idle/Other/Total".

Please note that the recommended way to launch parallel tasks in slurm jobs is with srun. By using srun vs mpirun you will get full support for process tracking, accounting, task affinity, suspend/resume and other features.

For more information and examples of job scripts, see section Batch Scheduler SLURM.

Submitting serial Batch jobs

The m100_all_serial partition is available with a maximum walltime of 4 hours, 1 task and 7600 MB per job. It runs on two dedicated nodes (equipped with 4 Volta GPUs), and it is designed for pre/post-processing serial analysis (using or not the GPUs), and for moving your data (via rsync, scp etc.) in case more than 10 minutes are required to complete the data transfer. This is the default partition, which is assumed by SLURM if you do not explicit request a partition with the flag "--partition" or "-p". You can however explicitly request it in your batch script with the directive:

#SBATCH -p m100_all_serial

Submitting Batch jobs for production

Not all of the partitions are open to access by the academic community as some are reserved to dedicated classes of users (for example *_fua_ * partitions are for EUROfusion users):

m100_fua_prod and m100_fua_dbg, are reserved to EuroFusion users, respectively for production and debugging
m100_usr_prod and m100_usr_dbg are open to academic production.

Each node exposes itself to SLURM as having 32 cores, 4 GPUs and XXXX memory. SLURM assigns a node in shared way, assigning to the job only the resources required and allowing multiple jobs to run on the same node/nodes. If you want to have the node/s in exclusive mode, ask for all the resources of the node (either ntasks-per-node=32 or mem=XXXX).

The maximum memory which can be requested is XXXXMB (average memory per physical core ~ 7GB) and this value guarantees that no memory swapping will occur.

For example, to request one core and one GPU in a production queue the following SLURM job script can be used:

#!/bin/bash
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:1            # this refers to the number of requested gpus per node, and can vary between 1 and 4
#SBATCH -A <account_name>
#SBATCH --mem=7100              # this refers to the requested memory per node with a maximum of XXXXXX
#SBATCH -p m100_usr_prod
#SBATCH --time 00:10:00         # format: HH:MM:SS
#SBATCH --job-name=my_batch_job
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<user_email>

srun ./myexecutable

Users with exhausted but still active projects are allowed to keep using the cluster resources, even if at a very low priority, by adding the "qos_lowprio" flag to their job:

#SBATCH --qos=qos_lowprio

This QOS is automatically associated to Eurofusion users once their projects exhaust the budget before their expiry date. For all the other users, please ask superc@cineca.it to request the QOS association.

Summary

In the following table you can find all the main features and limits imposed on the queues/Partitions of M100.

SLURM partition	Job QOS	# cores/# GPU per job	max walltime	max running jobs per user/ max n. of cpus/nodes/GPUs per user	max memory per node (MB)	priority	notes
m100_all_serial (default partition)	normal	max = 1 core, 1 GPU (max mem= 7600 MB)	04:00:00	4 cpus/1 GPU	-	40

m100_usr_prod	m100_qos_dbg	max = 2 nodes	02:00:00	2 nodes/64 cpus/8 GPUs	246000	45	runs on 12 nodes #SBATCH -p m100_usr_prod #SBATCH --qos=m100_qos_dbg
m100_usr_prod	normal	max = 16 nodes	24:00:00	10 jobs	246000	40	runs on 880 nodes #SBATCH -p m100_usr_prod
	m100_qos_bprod	min = 17 nodes max = 256 nodes	24:00:00	256 nodes	246000	85	runs on 256 nodes #SBATCH -p m100_usr_prod #SBATCH --qos=m100_qos_bprod
m100_fua_prod	m100_qos_fuadbg	max = 2 nodes	02:00:00		246000	45	runs on 12 nodes #SBATCH -p m100_fua_prod #SBATCH --qos=m100_qos_fuadbg
m100_fua_prod	normal	max = 16 nodes	1-00:00:00		246000	40	runs on 68 nodes #SBATCH -p m100_fua_prod
	qos_special	>256 nodes	>24:00:00		246000	40	#SBATCH --qos=qos_special request to superc@cineca.it
	qos_lowprio	max = 16 nodes	24:00:00		246000	0	#SBATCH --qos=qos_lowprio Non-Eurofusion users: request to superc@cineca.it

Graphic session

If a graphic session is desired we recommend to use the tool RCM (Remote Connection Manager). For additional information visit Remote Visualization section on our User Guide.

Programming environment

The programming environment of the M100 cluster consists of a choice of compilers for the main scientific languages (Fortran, C and C++), debuggers to help users in finding bugs and errors in the codes, profilers to help in code optimisation.

In general you must "load" the correct environment also for using programming tools like compilers, since "native" compilers are not available.

If you use a given set of compilers and libraries to create your executable, very probably you have to define the same "environment" when you want to run it. This is because, since by default linking is dynamic on Linux systems, at runtime the application will need the compiler shared libraries as well as other proprietary libraries. This means that you have to specify "module load" for compilers and libraries, both at compile time and at run time. To minimize the number of needed modules at runtime, use static linking to compile the applications.

Compilers

You can check the complete list of available compilers on MARCONI with the command:

> module available

and checking the "compilers" section. The available compilers are:

XL
PGI
GNU
CUDA

XL

The XL compiler family offers C, C++, and Fortran compilers designed for optimization and improvement of code generation, exploiting the inherent opportunities in Power Architecture.

The xl/16.1.1–binary module provides:

IBM XL C/C++ and Fortran compilers 16.1.1
IBM XL Shared-memory parallelism (SMP) runtime library/environment 5.1.1
Mathematical Acceleration Subsystem (MASS) Libraries 9.1.1

The name of the XL C/C++ and Fortran compilers are:

Invocations	Usage (supported standards)
xlc, xlc_r	Compile C source files. (ANSI C89, ISO C99, IBM language extensions)
xlc++, xlc++_r, xlC, xlC_r	Compile C++ source files.
cc, cc_r	Compile legacy code that does not conform to Standard C. (pre-ANSI C)
c89, c89_r	Compile C source files with strict conformance to the C89 standard. (ANSI C89)
c99, c99_r	Compile C source files with strict conformance to the C99 standard. (ISO 99)

xlf, xlf_r, f77,fort77	Compile FORTRAN 77 source files
xlf90, xlf90_r, f90	Compile FORTRAN 90 source files
xlf95, xlf95_r, f95	Compile FORTRAN 95 source files
xlf2003, xlf2003_r, f2003	Compile FORTRAN 2003 source files
xlf2008, xlf2008_r, f2008	Compile FORTRAN 2008 source files
xlcuf	Compile CUDA FORTRAN source files

The main difference between these commands is that they use different default options (which are set in the configuration files /cineca/prod/opt/compilers/xl/16.1.1/binary/xlC/16.1.1/etc/xlc.cfg.rhel.7.6.gcc.8.4.0.cuda.10.1 and /cineca/prod/opt/compilers/xl/16.1.1/binary/xlf/16.1.1/etc/xlf.cfg.rhel.7.6.gcc.8.4.0.cuda.10.1 respectively for the C/C++ and Fortran compilers).

All the invocation commands can be used to link programs that use multithreading. The _r versions are for backward-compatibility.

To learn more about the XL Fortran for Linux compiler, access the online product documentation in IBM Knowledge Center for the XL C/C++ compiler and the XL Fortran compiler.

The OpenMP parallelization is enabled by the -qsmp compiler option. If -qsmp=omp is specified, strict OpenMP compliance is applied on the compiling programs. Please refer to the official OpenMP support in IBM XL compilers documentation.

PORTLAND Group (PGI)

Initialize the environment with the module command:

> module load pgi

The name of the PGI compilers are:

pgf77: Fortran77 compiler
pgf90: Fortran90 compiler
pgf95: Fortran95 compiler
pghpf: High Performance Fortran compiler
pgcc: C compiler
pgCC: C++ compiler

The documentation can be obtained with the man command after loading the relevant module:

> man pgf95
> man pgcc

Some miscellanous flags are described in the following:

-Mextend            To extend over the 77 column F77's limit
-Mfree / -Mfixed    Free/Fixed form for Fortran
-fast               Chooses generally optimal flags for the target platform
-fastsse            Chooses generally optimal flags for a processor that supports SSE instructions

GNU compilers

The gnu compilers are always available but they are not the best optimizing compilers, it ensures the maximum portability. You do not need to load the module for using them.

Initialize the environment with the module command:

> module load gnu

The name of the GNU compilers are:

g77: Fortran77 compiler
gfortran: Fortran95 compiler
gcc: C compiler
g++: C++ compiler

The documentation can be obtained with the man command:

> man gfortan

> man gcc

Some miscellanous flags are described in the following:

-ffixed-line-length-132       To extend over the 77 column F77's limit
-ffree-form / -ffixed-form    Free/Fixed form for Fortran

CUDA

Compute Unified Device Architecture is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers are able to dramatically speed up computing applications by harnessing the power of GPUs.

In GPU-accelerated applications, the sequential part of the workload runs on the CPU – which is optimized for single-threaded performance – while the compute intensive portion of the application runs on thousands of GPU cores in parallel. When using CUDA, developers program in popular languages such as C, C++, Fortran, Python and MATLAB and express parallelism through extensions in the form of a few basic keywords. We refer to the the NVIDIA CUDA Parallel Computing Platform documentation.

...

Debugger and Profilers

If at runtime your code dies, then there is a problem. In order to solve it, you can decide to analyze the core file (core not available with PGI compilers) or to run your code using the debugger.

Compiler flags

Whatever your decision, in any case you need to enable compiler runtime checks, by putting specific flags during the compilation phase. In the following we describe those flags for the different Fortran compilers: if you are using the C or C++ compiler, please check before because the flags may differ.

The following flags are generally available for all compilers and are mandatory for an easier debugging session:

-O0     Lower level of optimization
-g      Produce debugging information

Other flags are compiler specific and are described in the following:

XL Fortran compiler

to be added...

PORTLAND Group (PGI) Compilers

The following flags are useful (in addition to "-O0 -g") for debugging your code:

-C                     Add array bounds checking
-Ktrap=ovf,divz,inv    Controls the behavior of the processor when exceptions occur: 
                       FP overflow, divide by zero, invalid operands

GNU Fortran compilers

The following flags are useful (in addition to "-O0 -g")for debugging your code:

-Wall             Enables warnings pertaining to usage that should be avoided
-fbounds-check    Checks for array subscripts.

Debuggers available

Totalview

The TotalView debugger is a programmable tool that lets you debug, analyze, and tune the performance of complex serial, multiprocessor, and multithreaded programs.
TotalView has many features and it gives you a great number of tools for finding your program's problems.

Details on how to use totalview are in

https://docs.roguewave.com/en/totalview/current/html/

Scalasca

Scalasca is a tool for profiling parallel scientific and engineering applications that make use of MPI and OpenMP.

Details how to use scalasca in
http://www.scalasca.org/software/scalasca-2.x/documentation.html

PGI: pgdbg (serial/parallel debugger)

pgdbg is the Portland Group Inc. symbolic source-level debugger for F77, F90, C, C++ and assembly language programs. It is capable of debugging applications that exhibit various levels of parallelism.

GNU: gdb (serial debugger)

GDB is the GNU Project debugger and allows you to see what is going on 'inside' your program while it executes -- or what the program was doing at the moment it crashed.

VALGRIND

Valgrind is a framework for building dynamic analysis tools. There are Valgrind tools that can automatically detect many memory management and threading bugs, and profile your programs in detail. The Valgrind distribution currently includes six production-quality tools: a memory error detector, two thread error detectors, a cache and branch-prediction profiler, a call-graph generating cache profiler, and a heap profiler.

Valgrind is Open Source / Free Software, and is freely available under the GNU General Public License, version 2.

Profilers (gprof)

In software engineering, profiling is the investigation of a program's behavior using information gathered as the program executes. The usual purpose of this analisys is to determine which sections of a program to optimize - to increase its overall speed, decrease its memory requirement or sometimes both.

A (code) profiler is a performance analisys tool that, most commonly, measures only the frequency and duration of function calls, but there are other specific types of profilers (e.g. memory profilers) in addition to more comprehensive profilers, capable of gathering extensive performance data.

gprof

The GNU profiler gprof is a useful tool for measuring the performance of a program. It records the number of calls to each function and the amount of time spent there, on a per-function basis. Functions which consume a large fraction of the run-time can be identified easily from the output of gprof. Efforts to speed up a program should concentrate first on those functions which dominate the total run-time.

gprof uses data collected by the -pg compiler flag to construct a text display of the functions within your application (call tree and CPU time spent in every subroutine). It also provides quick access to the profiled data, which let you identify the functions that are the most CPU-intensive. The text display also lets you manipulate the display in order to focus on the application's critical areas.

Usage:

>  gfortran -pg -O3 -o myexec myprog.f90
> ./myexec
> ls -ltr
   .......
   -rw-r--r-- 1 aer0 cineca-staff    506 Apr  6 15:33 gmon.out
> gprof myexec gmon.out

It is also possible to profile at code line-level (see "man gprof" for other options). In this case you must use also the “-g” flag at compilation time:

>  gfortran -pg -g -O3 -o myexec myprog.f90
> ./myexec
> ls -ltr
   .......
   -rw-r--r-- 1 aer0 cineca-staff    506 Apr  6 15:33 gmon.out
> gprof -annotated-source myexec gmon.out

It is possilbe to profile MPI programs. In this case the environment variable GMON_OUT_PREFIX must be defined in order to allow to each task to write a different statistical file. Setting

export GMON_OUT_PREFIX=<name>

once the run is finished each task will create a file with its process ID (PID) extension

<name>.$PID

If the environmental variable is not set every task will write the same gmon.out file.

Scientific libraries

Engineering and Scientific Subroutine Library (ESSL)

Scientific libraries designed for Power architecture included in the XL compiler package,

> module load essl/6.2.1

Documentation: https://www.ibm.com/support/knowledgecenter/SSFHY8/essl_content.html

Page tree