UG3.4.x: FERMI Batch Scheduler LoadLeveler

(updated: )

Attention

FERMI in no more available at CINECA

In this page:

LoadLeveler (LL) is the native batch scheduler for the FERMI (IBM-BlueGene /Q) machine. The scheduler is responsible for managing jobs on the machine by allocating partitions for the user on the compute nodes, returning job output and error files to the users.

The FERMI compute nodes consist of 16 cores and 16 GB of RAM.

1 compute node = 16 cores + 16 GB Memory

Compute nodes. A compute node is not sharable. A job can ask only for whole compute nodes (containing 16 cores each). Compute nodes are allocated in groups called partitions of fixed size. You can ask for an arbitrary number of compute nodes, but the LL will allocate them for you as fixed (an maybe larger) partitions. In fact, each job must be connected with at least one I/O node, wich implies a minimum partition of 64 (or 128) compute nodes, i.e. 1024 (or 2048) cores depending on the job class. Larger partitions are allocated in multiples of 64 with the bg_size keyword of LL (described below). A batch job is charged for the number of procs (nodes*16) allocated multiplied by the wall clock time spent.
Elapsed Time. This is the value of time after which the job is killed if not completed. It is set by the wall_clock_limit keyword in the LoadLeveler script and is important for chosing a specific class of the scheduler.
Memory. You cannot ask for a particular quantity of memory in a BlueGene job. You can access all the memory of the nodes in your partition (16GB per node). If you need more memory, you have to allocate more compute nodes. Jobs with high memory requirements but with poor parallel scalabilty are not suitable for the BlueGene /Q architecture.
Running one MPI process (rank) per node, there will be approximately 16 GB available to the application. Running n ranks per node, there will be approximately 16/n GB available for the application. Valid values for number of ranks are:1,2,4,8,16,32,64.

In order to submit a batch job, a LL script file must be written with directives for the scheduler followed by the commands to be executed. The script file has then to be submitted using the llsubmit command.

IMPORTANT: A job submitted for batch execution runs on the compute nodes of the cluster. Due to CINECA polices, $CINECA_DATA and $CINECA_PROJECT filesystems are not mounted there. So you have to copy in advance on the $CINECA_SCRATCH disks all data and executables needed to a batch job. This is not necessary for "public" executables, accessible via the "module" command.

The basic LL commands

llsubmit script    Submit the job described in the "script" file
                  (See below for the scripts syntax) 
llq               Return information about all the jobs waiting and running in the LL queues
llq -u $USER      Return information about your jobs only in the queues
llq -l  <jobID>   Return detailed information about the specific job (long format)
llq -s  <jobID>   Return information about why the job remains queued
llcancel  <jobID>  Cancel a job from the queues, either it is waiting or running 
llstatus          Return information about the status of the machine

No valid Data Center license found

Please go to Atlassian Marketplace to purchase or evaluate Refined Toolkit for Confluence Data Center.
Please read this document to get more information about the newly released Data Center version.

In general, users do not need to specify a class (queue) name: LL chooses the appropriate class for the user depending on the resources requested in terms of time and nodes. This is not true for the class "archive" (for archiving data on cartridges CINECA_DATA/PRJ) and "keyproject" (for exceptional long runs): these classes must be selected by declaring them with the#@class clause.

For a list of available classes and resources limits on the FERMI see the section "FERMI User Guide".

LL script file syntax

In order to submit a batch execution, you need to create a file, e.g. job.cmd, which contains

LL directives specifying how many resources (walltime and processors) you wish to allocate to your job.
shell commands and programs which you wish to execute. The file paths can be absolute or relative to the directory from which you will submit the job.

Note that if your job tries to use more memory or time than requested, it will be killed by LL. On the contrary, if you require larger resources than needed, you will pay more and/or you will wait longer before your job is taken into consideration.Design your applications in such a way that it can be easily restarted and request the right amount of resources.

The first part of the script file contains directives for LL specifying the resources needed by the job. Some of them are general keyworks, some others are BGQ specific.

General LL keywords

Useful keywords not strictly related to BGQ architecture are:

# @ wall_clock_limit = gg+hh:mm:ss

this defines the maximum value for wall clock time available for the job;

# @ account_no = <myaccount>

this selects the account to be charged for the job. This keyword is mandatory for the personal usernames on the system. The possible values for account_no are given by the "saldo -b" command.

#@ wall_clock_limit = <hh:mm:ss>     (maximum elapsed time: 24:00:00)
#@ input = <filename>      (File used for std-in)
#@ output = <filename>    (File used for std-out)
#@ error = <filename>      (File used for std-err) 
#@ initialdir = <dir>           (Initial working dir during job execution 
                                          def: Current working dir at the time the job was submitted)  
#@notification= <value>    (Specifies when the user is sent email about the job:
                                         start/complete/error/always/never           )
#@notify_user=<email addr> (Address for email notification)
#@environment= <value>   (Initial environment variables set by LL when your job step starts:
                                           COPY_ALL=all variables are copied
                                           $var= variable to be copied
#@account_no=<value>      (Indicate the account_no to be charged for this job)
#@ queue                           (Conclude the LL directives)

No valid Data Center license found

Please go to Atlassian Marketplace to purchase or evaluate Refined Toolkit for Confluence Data Center.
Please read this document to get more information about the newly released Data Center version.

BGQ specific LL keywords

The important BlueGene-specific keywords for a LoadLeveler batch job script are:

# @ job_type=bluegene

this keyword is necessary for running on FERMI compute nodes. To run a serial job on the front-end nodes, the job_type keyword must set to serial:

# @ job_type=serial

An other important keyword is

# @ bg_size=<no. of compute nodes>

partition size in terms of compute nodes. On FERMI you can ask for any partition size (up to a maximum of 2048); the possible allowed values are bg_size=64*2^n with n=0,1,..5. If you ask for a different value, the scheduler will automatically assign the largest value allowed closer to your request. You will be informed at submit time how many compute nodes will be actually allocated (to be implemented). The total amount of cpu-hours that will be charged is number of compute nodes multiplicated by 16 (cores/node).

Since it's easy to burn your budget, please be careful before submitting jobs. For further info read this FAQ.

Optionally it is possible to specify the connection network used to link the allocated nodes:

# @   bg_connectivity = MESH | TORUS | EITHER

The default value is MESH. The fastest connection type is TORUS but it can only be used when bg_size is 512 or higher, otherwise you must use MESH.

#@job_type=bluegene  (Specifies the type of job step to process. Set to bluegene!)
#@bg_size=<num>      (Number of compute nodes to be allocated)
#@bg_shape=<int>x<int>x<int>x<int>  (Specifies the shape of the partition to be allocated)
                                                              bg_size and bg_shape are mutually exclusive)
#@bg_rotate=<value>  (Specifies wheter LL should consider all possible rotations of the given 
                                     job shape when searching for partitions
                                     TRUE/FALSE
#@bg_connectivity=<value> (Type of wiring requested for the partition 
                                           MESH/TORUS/EITHER

No valid Data Center license found

Please go to Atlassian Marketplace to purchase or evaluate Refined Toolkit for Confluence Data Center.
Please read this document to get more information about the newly released Data Center version.

The runjob command

User applications always need to be executed with the "runjob" command to run on the compute nodes. All the other commands in the LL script run on the Login node. Access to the compute nodes happens only through the runjob command.

The runjob command takes multiple arguments. Please type "runjob --h" to get a list of possible arguments.

Sintax:

runjob [params]
runjob [params] binary [arg1 arg2 ... argn]

The parameters of the runjob copmmand can be set either:

by command-line options (higher priority!)
environment variables

For a complite description type

runjob --h

In the following the most important parameters are reported. Please note the use of "--" (double minus) for introducing all the option in the command line mode.

Command line option	Environment variable	Description
--exe <exec>	RUNJOB_EXE=<exec>	Specifies the full path to the executable (this argument can be also specified as the first argument afer “:”) runjob --exe /home/user/a.out runjob : /home/user/a.out
--args <prg_args>	RUNJOB_ARGS=<prg_args>	Passes args to the launched application on the compute node runjob : a.out hello world runjob --args hello --args world --exe a.out
--envs “ENVVAR=values”	RUNJOB_ENVS=“ENVVAR=value”	Sets the environment variable ENVVAR=value in the job environment on the compute nodes
--exp_env ENVVAR	RUNJOB_EXP_ENV=ENVVAR	Exports the environment variable ENVVAR in the current environment to the job on the compute nodes
--ranks-per-node		Specifies the number of processes per compute node. Valid values are:1,2,4,8,16,32,64
--np n	RUNJOB_NP=n	Number of processes in the job(<= bg_size*ranks-per-node)
--mapping	RUNJOB_MAPPING=mapfile	Permutation of ABCDET or a path to a mapping file containing coordinates for each rank
--start_tool		Path to tool to start with the job (debuggers)
--tool_args		Arguments for the tool (debuggers)

Example job scripts

In the following we report same typical jobs. You can use these templates to write your own job script for running on the FERMI system.

EX: Serial job

#!/bin/bash 
# @ job_type =serial
# @ job_name = serial.$(jobid)
# @ output = $(job_name).out
# @ error = $(job_name).err
# @ wall_clock_limit = 0:25:00
# @ class = serial
# @ queue

./executable < input > output

Remember to use the serial compiler for the front-end to generate the executable.

EX1: pure MPI batch job

#!/bin/bash
# @ job_type = bluegene
# @ job_name = bgsize.$(jobid)
# @ output = z.$(jobid).out
# @ error = z.$(jobid).err
# @ shell = /bin/bash
# @ wall_clock_limit = 02:00:00
# @ notification = always
# @ notify_user=<email addr>
# @ bg_size = 256
# @ account_no = <my_account_no>
# @ queue
runjob --ranks-per-node 64  --exe ./program.exe

This batch job asks for 256 compute nodes, corresponding to 256*16 cores. With the runjob command 64 ranks per nodes are required. This means the parallelism of the "progra.exe" code is 256*64 MPI processes, each of them with a per-process memory of 16/64 GB.

EX2: pure MPI batch job

#!/bin/bash
# @ job_type = bluegene
# @ job_name = example_2
# @ comment = "BGQ Job by Size"
# @ output = $(job_name).$(jobid).out
# @ error = $(job_name).$(jobid).out
# @ environment = COPY_ALL
# @ wall_clock_limit = 1:00:00
# @ notification = error
# @ notify_user=<email addr>
# @ bg_size = 1024
# @ bg_connectivity = TORUS
# @ account_no = <my_account_no>
# @ queue
runjob --ranks-per-node 16 --exe /gpfs/scratch/.../my_program

This job asks for 1024 nodes, 16 processes (ranks) per node, for a total of 16K MPI tasks. The TORUS connection is required.

EX4: MPI + OpenMP hybrid job

#!/bin/bash
# @ job_type = bluegene
# @ job_name = example_3
# @ output = $(job_name).$(jobid).out
# @ error = $(job_name).$(jobid).out
# @ environment = COPY_ALL
# @ wall_clock_limit = 12:00:00
# @ bg_size = 1024
# @ bg_connectivity = TORUS
# @ account_no = <my_account_no>
# @ queue
module load scalapack
runjob --ranks-per-node 1 --exe /path/myexe --envs OMP_NUM_THREADS=16

This job requires 1024 nodes (a full BGQ rack), 1 MPI process per node, and 16 threads per task (via the OMP variable in runjob). When you use an MPI+OpenMP application you need to set the environment variable OMP_NUM_THREADS directly on the compute nodes. This is done through the -envs argument of runjob (remember that all the batch job script, but the runjob, iis executed on the login nodes):

Multi-step jobs

LoadLeveler scheduler allows to chain many jobs in a single multi-step job. In this case, each step has a name, specified by the keyword

# @ step_name

and is terminated by the # @ queue statement. A given step is executed whenever the condition specified by the keyword

#@ dependency= ...

is satisfied. For example, setting

# @ step_name = step01
# @ dependency = step00 == 0

means that step01 will wait for the step00 to complete and succeed (exit staus=0). Note that for each job step you need to specify the keyword # @ bg_size since it is not inherited by subsequent steps. Otherwise the default value bg_size=64 will be assumed.

EX5: Pure MPI multistep job

#!/bin/bash
# @ job_type = bluegene
# @ job_name = DNS
# @ output = $(job_name).$(jobid).out
# @ error = $(job_name).$(jobid).err
# @ wall_clock_limit = 1:00:00       # WCL for each step
# @ account_no = <my_account_no>
#
#************* STEP SECTION ************
# @ step_name=step_0
# @ bg_size = 256
# @ queue

# @ step_name=step_1
# @ bg_size= 64
# @ dependency =  step_0 == 0
# @ queue

case $LOADL_STEP_NAME in
step_0)
        runjob  –-ranks-per-node 16 : ./your_exe input1 >output1
;;
step_1)
        runjob –-ranks-per-node 64 : ./your_exe input2 >output2 
;;
esac

This is an example of a multi-step job script: the first step require 256 nodes with 16 MPI tasks per node; the second step runs only when the first one is completed (with success) and require 64 nodes with 64 mpi tasks each.

EX6: Serial and pure MPI multistep job

#!/bin/bash
# @ job_name = gromacs
# @ output = $(job_name).$(jobid).out
# @ error = $(job_name).$(jobid).err
# @ account_no = <my_account_no>
#
#************* STEP SECTION ************
# @ step_name=step_0
# @ job_type = bluegene
# @ wall_clock_limit = 20:00
# @ bg_size = 64
# @ queue

# @ step_name=step_1
# @ dependency =  step_0 == 0
# @ job_type    = serial
# @ wall_clock_limit = 5:00
# @ class = serial
# @ queue

case $LOADL_STEP_NAME in
step_0)
        module load gromacs
        mdrun=$(which mdrun_bgq)
        exe="$mdrun -v -s dkv12.tpr"
        runjob --ranks-per-node 16 --env-all : $exe
;;
step_1)
        rsync --timeout=600 -r -avzHS --bwlimit=80000 --block-size=1048576 --progress <firststep_output_data> <path_to>
;;
esac

This is an example of a multi-step job script where the first step is of bluegene type. It requires 64 back-end nodes and executes the "gromacs" code. The second step is of serial type and executes the "rsync" command to transfer data created during the first step. It runs only when the first one is completed (with success) and on a single front-end node.

Sub-block jobs

It is possible to lunch multiple runs in the minimum allocatable block of 64 compute nodes and in the block of 128 nodes.

Sub-blocking tecnhique enables you to submit jobs with bg_size= 64 in which 2, 4, 8, 16, 32, or 64 simulations are simultaneously running, each occupying 32, 16, 8, 4, 2, 1 compute nodes respectively and submit jobs with bg_size= 128 in which 2, 4, 8, 16, 32, 64, or 128 simulations are simultaneously running each occupying 64, 32, 16, 8, 4, 2 and 1 compute nodes respectively.

Most of the environment variables required to set up the sub-blocks are set in specific files that need to be sourced in the job script. These files are available after having loaded the subblock module:

module load subblock

User needs just to modify the enviroment variables in the "User Section" of the following job script templates (see below)

EX7: Pure MPI sub-block jobs

In this example, the bg_size=64 is divided into 2 sub-blocks where two different inputs (pep1 and pep2) are simulated by using the same software, Gromacs, and the same options.

#!/bin/bash
#@ job_name = ggvvia.$(jobid)
#@ output = gmxrun.$(jobid).out
#@ error = gmxrun.$(jobid).err
#@ shell = /bin/bash
#@ job_type = bluegene
#@ wall_clock_limit = 1:00:00
#@ notification = always
#@ notify_user= <valid email address>
#@ account_no = <account no>
#@ bg_size = 64 
#@ queue

#################################################################
### USER SECTION
### Please modify only the following variables
#################################################################      

### Dimension of bg_size, the same set in the LoadLeveler keyword
export N_BGSIZE=64

### No. of sub-block you want. For bg_size 64 you can choose between 2, 4, 8, 16, 32, 64
export N_SUBBLOCK=2                  

### No. of MPI tasks in each sub-block.
export NPROC=512        

### No. of MPI tasks in each node.
export RANK_PER_NODE=16           

### module load <your applications>
module load gromacs/4.5.5
mdrun=$(which mdrun_bgq)

export WDR=$PWD
inptpr1="$WDR/pep1.tpr"
inptpr2="$WDR/pep2.tpr"
export EXE_1="$mdrun -s $inptpr1 <options>"
export EXE_2="$mdrun -s $inptpr2 <ooptins>"
export EXECUTABLES="$EXE_1 $EXE_2"

#################################################################

echo "work dir: " $WDR
echo "executable: " $EXECUTABLES

module load subblock
source ${SUBBLOCK_HOME}/bgsize_${N_BGSIZE}/npart_${N_SUBBLOCK}.txt

runjob --block $BLOCK_ALLOCATED --corner $(n_cor 1) --shape $SHAPE_SB  --np $NPROC \
              --ranks-per-node $RANK_PER_NODE  : $EXE_1 > out_1   &
runjob --block $BLOCK_ALLOCATED --corner $(n_cor 2) --shape $SHAPE_SB  --np $NPROC \
              --ranks-per-node $RANK_PER_NODE  : $EXE_2 > out_2   &
wait

In this example, a bg_size=128 is divided into 4 sub-blocks where different executables run (executable_1, executable_2, executable_3, executable_4). The output files (out_1, out_2, out_3, out_4) of the for runs will be stored on for different directory, i. e. dir_1, dir_2, dir_3 and dir_4 .

#!/bin/bash
# @ job_name = sub_block.$(jobid)
# @ output = $(job_name).out
# @ error = $(job_name).err
# @ environment = COPY_ALL
# @ job_type = bluegene
# @ wall_clock_limit = 00:05:00
# @ bg_size = 128
# @ account_no = <account no>
# @ notification = always
# @ notify_user = <valid email address>
# @ queue

#################################################################
### USER SECTION
### Please modify only the following variables
#################################################################

### Dimension of bg_size, the same set in the LoadLeveler keyword
export N_BGSIZE=128             
 
### No. of required sub-blocks. For N_BGSIZE=128 you can choose between 2, 4, 8, 16, 32, 64, 128.
export N_SUBBLOCK=4            
                                                 
### No. of MPI tasks in each sub-block.
export NPROC=512                   

 ### No. of MPI tasks in each node.
export RANK_PER_NODE=16    

### module load <your application>
export WDR=$PWD
export EXE_1=$WDR/executable_1.exe
export EXE_2=$WDR/executable_2.exe
export EXE_3=$WDR/executable_3.exe
export EXE_4=$WDR/executable_4.exe
export EXECUTABLES="$EXE_1,$EXE_2,$EXE_3,$EXE_4"

n_exe () { echo $EXECUTABLES | awk -F',' "{print \$$1}"; }
#################################################################

echo "work dir: " $WDR
echo "executable: " $EXECUTABLES

module load subblock
source ${SUBBLOCK_HOME}/bgsize_${N_BGSIZE}/npart_${N_SUBBLOCK}.txt

for i in `seq 1 $N_SUBBLOCK`;
do
  if [ ! -d $WDR/dir_$i ]; then
      mkdir dir_$i
      cd dir_$i
  else
      cd dir_$i
  fi
  echo $(n_exe $i)
  runjob --block $BLOCK_ALLOCATED --corner $(n_cor $i) --shape $SHAPE_SB  --np $NPROC \
              --ranks-per-node $RANK_PER_NODE  : $(n_exe $i) > out_$i   &
  cd ..
done
wait

Mapping and MPMD

MPMD (Multiple Program Multiple Data) is a parallel programming style that allows different topologies to execute different programs (or the same program with different settings) on the same resource allocation. For istance, a partition of the resource requested can be mapped with 8 ranks-per-node and launch executable A, and another partition can be mapped with 2 ranks-per-node and launch executable B. What is different from the sub-blocking technique (explained above) is that the partitions can communicate with each other.

In order to exploit the MPMD advantages on FERMI, a mapfile must be created and specified on the jobscript with the --mapping flag. A mapfile is an ASCII file composed of two parts:
- MPMD directives (which MPI tasks execute which program)
- Mapping coordinates (how are the tasks distributed among the resources allocated)

For further details and examples we remind you at the Best Practice Guide for BG/Q available at the PRACE website:
http://www.prace-ri.eu/best-practice-guide-blue-gene-q-html/#id-1.5.4.12

Further documentation

(to be completed)

Outgoing links:

Page tree

UG3.4.x: FERMI Batch Scheduler LoadLeveler

The basic LL commands

No valid Data Center license found

LL script file syntax

General LL keywords

No valid Data Center license found

BGQ specific LL keywords

No valid Data Center license found

The runjob command

Example job scripts

EX: Serial job

EX1: pure MPI batch job

EX2: pure MPI batch job

EX4: MPI + OpenMP hybrid job

Multi-step jobs

EX5: Pure MPI multistep job

EX6: Serial and pure MPI multistep job

Sub-block jobs

EX7: Pure MPI sub-block jobs

Mapping and MPMD

Further documentation