Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Content

  • Introduction
  • GPGPU at GALILEO
  • Programming environment (how to write GPU enabled applications)
  • Production environment (how to run a GPU enabled application)
  • :

    Table of Contents

    Accounting

    Introduction

    A GPU is a specialized device designed to rapidly manipulate high amounts of graphical pixels. Historically, GPU were born for being used in advanced graphics and videogames.

    More recently interfaces have been built to interact with codes not related to graphical purposes, for example for linear algebraic manipulations.

    General-purpose GPU computing or GPGPU computing is the use of a GPU (graphics processing unit) to do general purpose scientific and engineering computing.

    The model for GPU computing is to use a CPU and GPU together in a heterogeneous co-processing computing model. The sequential part of the application runs on the CPU and the computationally-intensive part is accelerated by the GPU. From the user’s perspective, the application just runs faster because it is using the high-performance of the GPU to boost performance.

    (courtesy of http://www.nvidia.com/object/GPU_Computing.html)

     



     

    The GPU has evolved over the years to have teraflops of floating point performance.

    NVIDIA revolutionized the GPGPU and accelerated computing world in 2006-2007 by introducing its new massively parallel architecture called “CUDA”.

    The success of GPGPUs in the past few years has been the ease of programming of the associated CUDA parallel programming model. In this programming model, the application developer modify their application to take the compute-intensive kernels and map them to the GPU. The rest of the application remains on the CPU. Mapping a function to the GPU involves rewriting the function to expose the parallelism in the function and adding “C” keywords to move data to and from the GPU.

    The K80 GPUs on GALILEO are based on the "Kepler” architecture. Each K80 consits of a dual GPU board that combines 24 GB of memory with blazing fast memory bandwidth and up to 2.91 Tflops double precision performance with NVIDIA GPU Boost™. It is designed for the most demanding computational tasks and it is ideal for single and double precision workloads that not only require leading compute performance but also demands high data throughput. 

    GPGPU at GALILEO

    The GPU resources of the GALILEO cluster consist of 2 nVIDIA Tesla K80 "Kepler" per node (being 4 the total number of K40 visible devices), with compute capability 3.7.  

    All the GPUs are configured with the Error Correction Code (ECC) support active, that offers protection of data in memory to enhance data integrity and reliability for applications. Registers, L1/L2 caches, shared memory, and DRAM all are ECC protected.

    At present (June 2015), the billling policy is based on: the elapsed time used on the requested cores, the amount of memory you requested for your job, besides it does not take into account the used GPUs (these are free of charge). The rationale is to invite the users to take advantage as much as possible from the possibilities of the GPUs. More information on the billing policy can be found here.

    Programming environment (how to write GPU enabled applications)

    All tools and libraries required in the GPU programming environment are contained in the CUDA toolkit. The CUDA toolkit is made available through the “cuda” module. When need to start a programming environment session with GPUs, the first thing to do is to load the CUDA module.

    > module load profile/advanced
    > module load cuda

    In doing so you will load the most recent version of the package. At present (June 2015), the most recent version is CUDA 7.0.28 on GALILEO. With the previous command in general you load the most recent version of the package. For listing all the available version you can type:

    > module available cuda

    CUDA, in addition to the C compiler, provides optimized GPU-enabled scientific libraries for linear algebra, FFT, random number generators, and basic algorithms (such as sorting, reductions, signal processing, image processing, etc) through the following libraries:

    • CUBLAS: GPU-accelerated BLAS library
    • CUFFT: GPU-accelerated FFT library
    • CUSPARSE: GPU-accelerated Sparse Matrix library
    • CURAND: GPU-accelerated RNG library
    • CUSOLVER: provides a collection of dense and sparse direct solvers which deliver significant acceleration for Computer Vision, CFD, Computational Chemistry, and Linear Optimization applications.
    • CUDA NPP: nVidia Performance Primitives
    • THRUST: a CUDA library of parallel algorithms with an interface resembling the C++ Standard Template Library (STL).

    The CUDA C compiler is nvcc. It's important to remember that the modules relatives to the needed compilers or MPI/OpenMP libraries must be loaded before the CUDA module.

    In order to take full advantage of the GPUs capabilities, you should add the --arch=sm_37 switch to the nvcc command.

    Example1: how to compile a C serial program with cuda (using the cublass library)

    cd $CINECA_SCRATCH/test/ 
    module load gnu
    module load cuda
    nvcc –arch=sm_37 –I$CUDA_INC –L$CUDA_LIB –lcublas –o myprog myprog.c

    Example2: how to compile a C MPI program with cuda (using a built in makefile)

    module load …
    module load gnu
    module load openmpi/1.8.5--gnu--4.9.2
    module load cuda
    make

    Note that PGI C and Fortran compilers provide its own cuda library and cuda extensions to the programming languages. Therefore you don't need to load any cuda module.

    Production environment (how to run a GPU enabled application)

    Access to computational resources is granted through job requests to the resource manager. The resource manager is a program that runs on the front-end and listens to users’ requests.

    A job request typically consists in:

    •  resource specification: the kind and amount of resources you want for your job;
    •  job script: a shell script with the sequence of commands and controls needed to carry out your job.

    On GALILEO, it is possible to submit jobs of different types, using only one "routing" queue: just declare how many resources you need and your job will be directed into the right queue with a right priority, and wait until requested resources become available. The job is then processed and subsequently removed from the queue.

    On our GPGPU systems the resource manager is PBS. More information on PBS can be found on the HPC User Guide.

    Job submission

    Job requests are submitted to the resource manager (PBS) using the qsub command:

    $ qsub [opts] my_job_script.sh

    Where [opts] specifies resources  and settings required by the job.

    -l select:<N>:ncpus=<C>:ngpus=<G>:mpiprocs:<P>

    asks for <N> nodes, and for each of them: <C> cores, <G> gpus and <P> MPI tasks;

    -l walltime=hh:mm:ss 

    specifies the maximum duration of your job;

    -A <account_no>

    specifies the project account for your credit. If due, you can find it with the “saldo –b” command (see "Accounting" page for more details).

    Other useful qsub switches can be found in the Batch Scheduler PBS section of the HPC User Guide and in the manual pages (see “man qsub”).

    For example, if you need one core and one GPU for three hours, submit your job as follows:

    $ qsub –l select=1:ncpus=1:ngpus=1 –l walltime=3:00:00 -A <project> my_script.sh

    or, if you need 4 cores and two GPUs for three hours,

    $ qsub –l select=1:ncpus=4:ngpus=2 –l walltime=3:00:00 -A <project>  my_script.sh

    As another example, if you want your job script my_script.sh to run for 3 hours on 2 nodes with 16 cores and 4 GPUs for each node (with a total of 32 cores and 8 GPUs), using credit from project gran10, you can use the following command:

    $ qsub –l select=2:ncpus=16:ngpus=4,mpiprocs=2 –l walltime=3:00:00  A gran10  my_script.sh

    The previous example will assign you two “full” nodes, i.e. 32 cores and 8 GPUs.

    If you do not specify the walltime resource your job a default value of 30 minutes will be assumed (max. walltime = 24h). Remember also that, if not specified, the default assigned memory to your job is 8GB (max. mem. = 120GB).

    Please do not ask for a whole node if you do not intend to use all GPUs, since such request will prevent other users to access the GPUs on that same node.

    For any other information regarding features and limitations of the "routing queue" as well as how to write job scripts see our HPC User Guide.

    Accounting

    At present the use of the GPUs and other accelerators is not accounted.

    More details about "Accounting" can be found in the HPC User Guide.

    ...