In this page:


hostname: login.eurora.cineca.it

First EURORA system (GPU & MIC enabled):

A separate document about GPU usage is available in the Other documents repository.

A brief quick guide of intel MIC usage is available here.

The fastest Italian cluster based on GPU for public research

The new prototype supercomputer EURORA (EURopean many integrated cORe Architecture) is the result of a project founded by PRACE 2IP framework. The goal of the project was the evaluation of a new, fully European computing architecture for the next generation Tier-0 systems. The original proposal was submitted to PRACE   in November 2011 by CINECA and its European partners (GRNET, Greece; IPB, Serbia; NCSA, Bulgaria) and approved in May 2012. The procurement was committed to the Italian vendor Eurotech which delivered the supercomputer to CINECA in January 2013.  

The new supercomputer design addresses the current, most important HPC constraints (sustainable performance, space occupancy and cost) by combining hybrid technology to efficient cooling and a custom interconnection system. It is equipped with the state-of-the-art accelerators (Intel Xeon Phi and Keplero GPU) and its primary use should exploit these features

System Architecture

Model: Eurora prototype
Architecture: Linux Infiniband Cluster
Processors Type: 
 - Intel Xeon (Eight-Core SandyBridge) E5-2658 2.10 GHz (Compute)
 - Intel Xeon (Eight-Core SandyBridge) E5-2687W 3.10 GHz (Compute)
 - Intel Xeon (Esa-Core Westmere) E5645 2.4 GHz (Login)
Number of nodes: 64 Compute + 1 Login
Number of cores: 1024 (compute) + 12 (login)
Number of accelerators: 64 nVIDIA Tesla K20 (Kepler) + 64 Intel Xeon Phi (MIC)
RAM: 1.1 TB (16 GB/Compute node + 32GB/Fat node)
OS: RedHat CentOS release 6.3, 64 bit

Eurora is a Cluster made of 65 nodes of different types:

Compute Nodes: There are 64 16-core compute cards (nodes). Half of the nodes contain 2 Intel(R) Xeon(R) SandyBridge eight-core E5-2658 processors,  with a clock rate of about 2 GHz, while the other half of the cards contain 2 Intel(R) Xeon(R) SandyBridge eight-core E5-2687W processors,  with a clock of about 3 GHz. 58 compute nodes have 16GB of memory, but the safely allocatable memory on the node is 14 GB (see PBS resources: memory allocation). The remaining 6 nodes (with processors at 3 GHz clock rate) have 32 GB RAM. The Eurora cores are capable of 8 floating point operations per cycle (hyper-threading disabled). 32 compute cards have two nVIDIAK20 (Kepler) GPU cards and 32 compute cards have two Intel Xeon Phi accelerators.

Login nodes: The Login node has 2 Intel(R) Xeon(R) esa-core Westmere E5645 processors at 2.4 GHz Intel Xeon (Esa-Core Westmere) E5645 2.4 GHz

All the nodes are interconnected through a custom Infiniband network, allowing for a low latency/high bandwidth interconnection.

Disks and Filesystems

Eurora storage organisation conforms to the CINECA infrastructure (see Section  "Data storage and Filesystems"). In addition to the home directory ($HOME), each user can access the scratch area ($CINECA_SCRATCH) for storing run time data and files.

If your run produces or relies on large data files, the use of $CINECA_SCRATCH is mandatory. All data needed during the batch execution has to be moved on $CINECA_SCRATCH before the run starts.

Since all the filesystems, specifically those with a quota defined on them ($HOME) are based on gpfs (General Parallel FIle System), the usual unix command "quota" is not working. Use the local command "cindata" to query for disk usage and quota ("cindata -h" for help):

  > cindata

Production Environment

Since Eurora is a general purpose system and it is used by several users at the same time, long production jobs must be submitted using a queuing system. This guarantees that the access to our resources is as fair as possible.

There are two different modes to use an HPC system: Interactive and Batch. In order to request the computational resources, either in interactive or in batch, you need to have a "budget" of hours to spend, organized in projects (called accounts). Please, refer to the Billing Policy features in the Accounting Section. The command "saldo -b" displays the list of accounts you have at your disposal, together with their validity period, the initial and the consumed amounts of hours:

  > saldo -b

Detailed information on how to use the GPUs on this system are available in a separate document (in the Other documents repository), while for a general discussion see the section "Production Environment and Tools".

 

Interactive

A serial program can be executed on Eurora login node in the standard UNIX way:

> ./program

This is allowed only for very short runs, since the interactive environment on the login node has a 10 minutes time limit: for longer runs please use the "batch" mode.

A parallel program can be executed interactively only within an "Interactive" PBS batch job, using the "-I" (capital i) option: the job is queued and scheduled as any PBS batch job, but when executed, the standard input, output, and error streams are connected to the terminal session from which qsub was submitted. You can then specify the resources you need (number of nodes, number of processors per node, number of GPU, etc.) by using the flag -l (lower-case l of lima).

For example, for starting an interactive session with the MPI program myprogram, using 2 processors, enter the commands:

> qsub -A <account_no> -I -l select=1:ncpus=2:mpiprocs=2 -q debug -- /bin/bash
    qsub: waiting for job 2612.node129 to start
    qsub: job 2612.node129 ready
> mpirun ./myprogram
> ^D

As described in the System Architecture section, there are two processor types with different clock speed (2 and 3 GHz). You can select the nodes with 3 GHz of clock via the -l flag. For example, to require interactively 4 nodes with processors at 3 GHz enter the command:

> qsub -A <account_no> -I -l select=4:ncpus=16:cpuspeed=3GHz -q debug

Please notice that, if you don't specify the cpuspeed, the scheduler will try to gather the requested resources from a cpuspeed group (either 2 or 3 GHz) forst and, if it fails, it can select nodes from both groups (hence, using nodes with different cpu speed).  

The default memory assigned to a job, if not explicitly requested, is 1GB per node.  Please, ensure to request the memory you need for your processes to run via the -l flag of qsub:

> qsub -A <account_no> -I -l select=4:ncpus=16:cpuspeed=3GHz:mem=14GB -q parallel

The default walltime limit defined on the available queues (see next section for their description) is one hour on all queues except the debug (where the default equals the maximum time allowed, i.e. 30 minutes). You can define the walltime you need via the -l flag:

> qsub -A <account_no> -I -l select=4:ncpus=16:cpuspeed=3GHz:mem=14GB,walltime=04:00:00 -q parallel

If you want to export variables to the interactive session, use the -v option. For example if myprograms is not compiled statically, you have to define and export the LD_LIBRARY_PATH variables:

>export LD_LIBRARY_PATH= ...
>qsub -I -v LD_LIBRARY_PATH ...

Batch

Batch jobs are managed by the PBS batch scheduler, that is described in section "EURORA batch Scheduler PBS".

The resource usage on Eurora is enforced by the application of a fairshare policy. The fairshare is a method for ordering the start times of jobs based on the resource usage history of site members. Fairshare ensures that jobs are run in the order of how deserving they are.

On Eurora it is possible to submit jobs on different queues, each identified by different resource allocation. You submit a script with the command:

> qsub script

You can get a list of defined queues with the command:

> qstat -Q

In order to advantage the development activity (experimenting either the Xeon Phi accelerators or the Keplero GPUs) in the work days, different scheduling policies were defined in two time intervals, the primetime (10 am - 6 pm weekdays) and the non-primetime (6 pm - 10 am weekdays, Friday 6 pm - Monday 10 am). Primetime queues have a "p_" prefix, non-primetime queues have a "np_" prefix, while queues active in both intervals show no prefix. 

job type  Max nodes   Max CPUs/   max wall time   time slot       Usage
                    GPUs or MICs
-------------------------------------------------------------------------------------------
debug         2         32/ 4        0:30:00      always          Debugging
p_devel 2 32/ 4 1:00:00 primetime Developing
parallel 32 512/64 4:00:00 always Parallel production np_longpar 9 144/18 8:00:00 non-primetime Long parallel production

The Usage column reports the suggested employment of each queue, reflecting both the maximum number of resources (CPUs, GPUs, MICs) and the maximum walltime allowed for each job type. Please note that:

  • with the exception of the p_devel queue, you do not need to specify the job type: according to the resources (number of cpus/gpus/mics) and the walltime specified, the scheduler will automatically route your job in the appropriate queue. If you do not specify resources and walltime, the job will be assigned one CPU and 30 minutes walltime, and it will be scheduled in the debug queue.
  • the p_devel queue, which aims at hosting jobs aiming at code developing, debugging, and optimization on (GPU/MIC) accelerators, needs to be, instead, specifically requested via the -q flag of qsub or the directive #PBS -q <queue_name> . The queue is defined only on specific nodes configured in order to fully use profiling and performance tools (such as Intel VTune amplifier) in event-based sampling data collect.

Please, remember that the default memory assigned to a job, if not explicitly requested, is 1GB per node

For more information and examples of job scripts, see the section for "EURORA Batch Scheduler PBS".  

 

Programming environment

The programming environment of the Eurora machine consists of a choice of compilers for the main scientific languages (Fortran, C and C++), debuggers to help users in finding bugs and errors in the codes, profilers to help in code optimisation. In general you must "load" the correct environment also for using programming tools like compilers, since "native" compilers are not available.

If you use a given set of compilers and libraries to create your executable, very probably you have to define the same "environment" when you want to run it. This is because, since by default linking is dynamic on Linux systems, at runtime the application will need the compiler shared libraries as well as other proprietary libraries. This means that you have to specify "module load" for compilers and libraries, both at compile time and at run time. To minimize the number of needed modules at runtime, use static linking to compile the applications.

As described in the System Architecture section, the login node has a different type of processors (Xeon Westmere, with an SSE instruction set) with respect to the compute nodes (with the Xeon Sandy Bridge). The Sandy Bridge chip supports the Advanced Vector Extensions (AVX) instruction set which, if exploited, can enhance the performance of your code. By using the proper compilation flags, you can instruct the compiler to generate optimized code  specialized for the Intel processor that executes your program

If you want to start programming in a mixed environment using GPUs, please refer to a separate document in the documents repository.

If you want to start programming in a mixed environment using MIC, please refer to a the quick guide.

Compilers

You can check the complete list of available compiler on Eurora with the command:

> module available

and checking the "compilers" section.

In general the available compilers are:

  • INTEL (ifort, icc, icpc) : ► module load intel
  • PGI - Portland Group (pgfortran, pghpf, pgcc, pgCC): ► module load pgi
  • GNU (gfortran, gcc, g++): ► module load gnu

After loading the appropriate module, use the "man" command to get the complete list of the flags supported by the compiler, for example:

> module load intel
> man ifort

There are some flags that are common for all these compilers. Others are more specifics. The most common are reported later for each compiler.

1. If you want to use a specific library or a particular include file, you have to give their paths, using the following options

-I/path_include_files specify the path of the include files
-L/path_lib_files -l<xxx> specify a library lib<xxx>.a in /path_lib_files

2. If you want to debug your code you have to turn off optimisation and turn on run time checkings: these flags are described in the following section.

3. If you want to compile your code for normal production you have to turn on optimisation by choosing a higher optimisation level

-O2 or -O3 

Other flags are available for specific compilers and are reported later.

INTEL Compiler

Initialize the environment with the module command:

 > module load intel

The names of the Intel compilers are:

  • ifort: Fortran77 and Fortran90 compiler
  • icc: C compiler
  • icpc: C++ compiler

The documentation can be obtained with the man command after loading the relevant module:

> man ifort
> man icc

Some miscellanous flags are described in the following:

-xavx            Generate optimized code for AVX 
-extend_source Extend over the 77 column F77's limit -free / -fixed Free/Fixed form for Fortran -openmp Enables the parallelizer to generate multi-threaded code based on OpenMP directives

PORTLAND Group (PGI)

Initialize the environment with the module command:

> module load pgi

The name of the PGI compilers are:

  • pgf77: Fortran77 compiler
  • pgf90: Fortran90 compiler
  • pgf95: Fortran95 compiler
  • pghpf: High Performance Fortran compiler
  • pgcc: C compiler
  • pgCC: C++ compiler

The documentation can be obtained with the man command after loading the relevant module:

> man pgf95
> man pgcc

Some miscellanous flags are described in the following:

-tp=sandybridge-64  Specify the type of the sandybridge target processor
-Mvect=simd:256 Use vector AVX instructions # to be used with -tp=sandybridge-64
-Mextend To extend over the 77 column F77's limit -Mfree / -Mfixed Free/Fixed form for Fortran
-fast Chooses generally optimal flags for the target platform
-mp Enables the parallelizer to generate multi-threaded code based on OpenMP directives

GNU compilers

The name of the GNU compilers are:

  • gfortran: Fortran compiler
  • gcc: C compiler
  • g++: C++ compiler

The documentation can be obtained with the man command:

> man gfortran
> man gcc

Please notice that "man gfortran" only provides additional flags working for the Fortran compiler with respect to gcc. Refer to the gcc manual for the common flags. Some miscellanous flags are described in the following:

-mavx                         To enable the use of AVX instructions
-ffixed-line-length-132 To extend over the 77 column F77's limit -ffree-form / -ffixed-form Free/Fixed form for Fortran
-fopenmp Enables the parallelizer to generate multi-threaded code based on OpenMP directives

 

GPU programming

The new Eurora system is equipped with 2 GPUs per node. They can be addressed within C or Fortran programs by means of directives for a "pre-processor" made availble by the CUDA toolkit. Detailed information about GPU programming are contained in a specific document (see GPGPU (General Purpose Graphics Processing Unit))

Debuggers

If at runtime your code dies, then there is a problem. In order to solve it, you can decide to analyze the core file (core not available with PGI compilers) or to run your code using the debugger.

Compiler flags

Whatever your decision, in any case you need enable compiler runtime checks, by putting specific flags during the compilation phase. In the following we describe those flags for the different Fortran compilers: if you are using the C or C++ compiler, please check before because the flags may differ.

The following flags are generally available for all compilers and are mandatory for an easier debugging session:

-O0     Lower level of optimisation
-g      Produce debugging information

Other flags are compiler specific and are described in the following.

INTEL Fortran compiler

The following flags are usefull (in addition to "-O0 -g")for debugging your code:

-traceback       generate extra information to provide source file traceback at run time
-fp-stack-check  generate extra code to ensure that the floating-point stack is in the expected state
-check bounds    enables checking for array subscript expressions
-fpe0            allows some control over floating-point exception handling at run-time

PORTLAND Group (PGI) Compilers

The following flags are usefull (in addition to "-O0 -g")for debugging your code:

-C                     Add array bounds checking
-traceback Generate extra information to provide source file traceback at run time
-Ktrap=ovf,divz,inv Controls the behavior of the processor when exceptions occur: FP overflow, divide by zero, invalid operands

GNU Fortran compilers

The following flags are usefull (in addition to "-O0 -g")for debugging your code:

-Wall                                         Enables warnings pertaining to usage that should be 
avoided -fbounds-check Checks for array subscripts
-ffpe-trap=zero,overflow,invalid,underflow Specify the list of IEEE exceptions when
a Floating Point Exception should be raised

Profilers (gprof)

In software engineering, profiling is the investigation of a program's behavior using information gathered as the program executes. The usual purpose of this analysis is to determine which sections of a program to optimize - to increase its overall speed, decrease its memory requirement or sometimes both.

A (code) profiler is a performance analysis tool that, most commonly, measures only the frequency and duration of function calls, but there are other specific types of profilers (e.g. memory profilers) in addition to more comprehensive profilers, capable of gathering extensive performance data.

gprof

The GNU profiler gprof is a useful tool for measuring the performance of a program. It records the number of calls to each function and the amount of time spent there, on a per-function basis. Functions which consume a large fraction of the run-time can be identified easily from the output of gprof. Efforts to speed up a program should concentrate first on those functions which dominate the total run-time.

gprof uses data collected by the -pg compiler flag to construct a text display of the functions within your application (call tree and CPU time spent in every subroutine). It also provides quick access to the profiled data, which let you identify the functions that are the most CPU-intensive. The text display also lets you manipulate the display in order to focus on the application's critical areas.

Usage:

>  gfortran -pg -O3 -o myexec myprog.f90
> ./myexec
> ls -ltr
   .......
   -rw-r--r-- 1 aer0 cineca-staff    506 Apr  6 15:33 gmon.out
> gprof myexec gmon.out

It is also possible to profile at code line-level (see "man gprof" for other options). In this case you must use also the “-g” flag at compilation time:

>  gfortran -pg -g -O3 -o myexec myprog.f90
> ./myexec
> ls -ltr
   .......
   -rw-r--r-- 1 aer0 cineca-staff    506 Apr  6 15:33 gmon.out
> gprof -annotated-source myexec gmon.out

Scientific libraries (MKL)

MKL

The Intel Math Kernel Library (Intel MKL) enables improving performance of scientific, engineering, and financial software that solves large computational problems. Intel MKL provides a set of linear algebra routines, fast Fourier transforms, as well as vectorized math and random number generation functions, all optimized for the latest Intel processors, including processors with multiple cores.

Intel MKL is thread-safe and extensively threaded using the OpenMP technology

documentation can be found in:

${MKLROOT}/../Documentation/en_US/mkl

To use the MKL in your code you to load the module, then to define includes and libraries at compile and linking time:

> module load mkl
> icc -I$MKL_INC -L$MKL_LIB  -lmkl_intel_lp64 -lmkl_core -lmkl_sequential

With the Intel Compilers you can simply link the mkl library via the flag -mkl=sequential/parallel, for example:

> ifort -mkl=sequential mycode.f90
> icc -mkl=parallel mycode.c

For more information please refer to the documentation.

Parallel programming

The parallel programming is based on the OpenMPI version of MPI. The library and special wrappers to compile and link the personal programs are contained in several "openmpi" modules, one for each supported suite of compilers.

The main four parallel-MPI commands for compilation are:

  • mpif90 (Fortran90)
  • mpif77 (Fortran77)
  • mpicc (C)
  • mpiCC (C++)

These command names refers to wrappers around the actual compilers, they behave differently depending on the module you have loaded. On Eurora the "intel", "gnu", and "pgi" versions of OpenMPI are available. To load one of them, check the names with the "module avail" command then load the relevant module:

> module avail openmpi
   openmpi/1.6.4--gnu--4.6.3        
openmpi/1.6.4--intel--cs-xe-2013--binary
openmpi/1.6.4--pgi--12.10 > module load gnu openmpi/1.6.4--gnu--4.6.3 > man mpif90 > mpif90 -o myexec myprof.f90 (uses the gfortran compiler) > module purge > module load intel openmpi/1.6.4--intel--cs-xe-2013--binary > man mpif90 > mpif90 -o myexec myprof.f90 (uses the ifort compiler)

In all cases the parallel applications have to be executed with the command:

> mpirun ./myexec
> 

There are limitations on running parallel programs in the login shell. You should use the "Interactive PBS" mode, as described in the "Interactive" section, previously in this page.

© Copyright 2

 

  • No labels