...
SLURM job script and directives
A SLURM job script consists of:
- An optional shell specification
- SLURM directives
- Tasks -- programs or commands to be executed
Once ready, the job must be submitted to SLURM:
> sbatch [options] <name of script>
The shell to be used by SLURM is defined in the first line of the job script (mandatory!):
#!/bin/bash (or #!/bin/sh)
The SLURM directives are used to request resources or set attributes. A directive begins with the default string #SBATCH. One or more directives can follow the shell definition in the job script.
The tasks can be programs or commands. This is where the user specifies the application to run.
SLURM directives: resources
The type of resources required for a serial or parallel MPI/OpenMP/mixed job must be specified with a SLURM directive:
#SBATCH --<chunk_resource>=<value>
where <chunk_resource> can be one of the following:
- --nodes=NN number of nodes
- --ntasks-per-node=CC number of tasks/processes per node
- --cpus-per-task=TT number of threads/cpus per task
For example for a MPI or MPI/OpenMP mixed job (2 MPI processes and 8 threads):
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
For a serial job for example:
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
SLURM directives: processing time
Resources such as computing time must be requested by this syntax:
#SBATCH --time=<value>
where <value> expresses the actual elapsed time (wall clock) in the format hh:mm:ss
for example:
#SBATCH --time=1:00:00 (one hour)
Please note that there are specific limitations on the maximum walltime on a system, also depending on the partition. Check the system specific guide for more information.
SLURM directives: memory allocation
The default memory depends on the partition/queue you are working with. Usually we set it as the Total Memory of the Node divided by the total number of cores in a single node that we call here memory-per-core. So if you request 3 cores by default you would get the equivalent of 3 times the memory-per-core. Alternatively, you can specify the requested memory with the --mem=<value> directive up to maximum memory available on the nodes.
#SBATCH --mem=10000
The default measurement unit for memory requests is the Megabyte (in the example above, we are requesting for 10000MB per node). It is possible to ask for an amount of memory expressed in GB, like this:
#SBATCH --mem=10GB
However, the default request method in MB is preferable, since the memory limits defined for any partition are expressed in these terms. For example, Marconi SkyLake partition has 182000MB as a limit, corresponding to approx. 177GB.
Please note: if you are requiring a larger memory with respect to the "main amount" for the system, the number of "effective cores" and the cost of your job could increase. For more information check the accounting section.
SLURM directives: MPI tasks/OpenMP threads affinity
You may have to modify the default affinity, in order to ensure optimal performances on A3 Marconi.
The slurm directives that concern the processes binding are the following:
--cpu-bind=<cores|threads>
--cpus-per-task=<physical or logical cpus number to allocate for single task>
With --cpus-per-task you define the SLURM_CPUS_PER_TASK environment variable, that has to be exported with
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
In order to modify them correctly, we suggest to follow our guidelines.
Other SLURM directives
#SBATCH --account=<account_no> --> name of the project to be accounted to ("saldo -b" for a list of projects)
#SBATCH --job-name=<name> --> job name
#SBATCH --partition=<destination> --> partition/queue destination. For a list and description of available partitions, please refer to the specific cluster description of the guide.
#SBATCH --qos=<qos_name> --> quality of service. Please refer to the specific cluster description of the guide.
#SBATCH --output=<out_file> --> redirects output file (default, if missing, is slurm-<Pid> containing merged output and error file)
#SBATCH --error=<err_file> --> redirects error file (as above)
#SBATCH --mail-type=<mail_events> --> specify email notification (NONE, BEGIN, END, FAIL, REQUEUE, ALL)
#SBATCH --mail-user=<user_list> --> set email destination (email address)
Directives in contracted form
Some SLURM directives can be written with a contracted syntax. Here are all the possibilities:
#SBATCH -N <NN> --> #SBATCH --nodes=<NN>
#SBATCH -c <TT> --> #SBATCH --cpus-per-task=<TT>
#SBATCH -t <value> --> #SBATCH --time=<value>
#SBATCH -A <account_no> --> #SBATCH --account=<account_no>
#SBATCH -J <name> --> #SBATCH --job-name=<name>
#SBATCH -p <destination> --> #SBATCH --partition=<destination>
#SBATCH -q <qos_name> --> #SBATCH --qos=<qos_name>
#SBATCH -o <out_file> --> #SBATCH --output=<out_file>
#SBATCH -e <err_file> --> #SBATCH --error=<err_file>
Note: the directives --mem, --mail-type, --mail-user and --ntasks-per-node can't be contracted. About the latter, it exists a SLURM directive "-n" for the number of tasks, but it can be misleading since it is used to indicate the TOTAL number of tasks and not the number of tasks per node. Therefore, it is not recommended since it can lead to confusion and unexpected behaviour. Use of the uncontracted --ntasks-per-node is recommended instead.
Using sbatch attributes to assign job attributes and resource request
It is also possible to assign the job attributes using the sbatch command options:
> sbatch [--job-name=<name>] [--partition=<queue/partition>] [--out=<out_file>] [--err=<err_file>] [--mail-type=<mail_events>] [--mail-user=<user_list>] <name of script>
And the resources can also be requested using the sbatch command options:
> sbatch [--time=<value>] [-ntasks=<value>] [--account=<account_no>] <name of script>
The sbatch command options override script directives if present.
...
Here we ask for a single node and a single task, thus allocating 48 physical cpus for SKL or 36 for Galileo. With the export of SRUN_CPUS_PER_TASK and OMP_NUM_THREADS we are setting 48 or 36 OperMP threads for the single task.
...
#!/bin/bash
#SBATCH --time=01:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=48 or 36
#SBATCH --partition=<partition_name> #SBATCH --qos=<qos_name> #SBATCH --mem=<mem_per_node> #SBATCH --out=myJob.out #SBATCH --err=myJob.err #SBATCH --account=<account_no>
module load intel
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
export OMP_NUM_THREADS=48 or 36 srun myprogram < myinput > myoutput
###################################
...
#!/bin/bash #SBATCH –time=01:00:00 #SBATCH --nodes=2 #SBATCH --ntasks-per-node=4
#SBATCH --ntasks-per-socket=2
#SBATCH --cpus-per-task=4 #SBATCH --mem=<mem_per_node> #SBATCH --partition=<partition_name> #SBATCH --qos=<qos_name> #SBATCH --job-name=jobMPI #SBATCH --err=myJob.err #SBATCH --out=myJob.out #SBATCH --account=<account_no>
module load intel intelmpi
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
export OMP_NUM_THREADS=4
export OMP_PLACES=cores
export OMP_PROC_BIND=true
...
#!/bin/bash #SBATCH –time=01:00:00 #SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-tasks=1
#SBATCH --nodes=2 #SBATCH --partition=<partition_name> #SBATCH --qos=<qos_name> #SBATCH --mem=86000 #SBATCH --out=myJob.out #SBATCH --err=myJob.err #SBATCH --account=<account_no>
module load intel
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
export OMP_NUM_THREADS=1
srun myprogram < myinput > myoutput
...