Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Currently, SLURM is the scheduling system of  of MARCONI,  MARCONI100 and GALILEO100. Comprehensive documentation is on this portal, as well as on the original SchedMD site.

...

Running applications using SLURM

With SLURM you can specify the tasks that you want to be executed; the system takes care of running these tasks and returns the results to the user. If the resources are not available, then SLURM holds your jobs and runs them when they will become available.

With SLURM you normally create a batch job which you submit to the scheduler. A batch job is a file (a shell script under UNIX) containing the set of commands that you want to run. It also contains the directives that specify the characteristics (attributes) of the job and the resource requirements (e.g. number of processors and CPU time) that your job needs.

Once you create your job, you can reuse it if you wish. Or, you can modify it for subsequent runs.

For example, here is a simple SLURM job script to run a user's application by setting a limit (one hour) to the maximum wall time, requesting 1 full node with 36 32 cores:

#!/bin/bash
#SBATCH --nodes=1                    # 1 node
#SBATCH --ntasks-per-node=3632 # 3632 tasks per node
#SBATCH --time=1:00:00 # time limits: 1 hour
#SBATCH --error=myJob.err # standard error file
#SBATCH --output=myJob.out # standard output file
#SBATCH --account=<account_no> # account name
#SBATCH --partition=<partition_name> # partition name
#SBATCH --qos=<qos_name> # quality of service
./my_application

Check for the maximum value allowed for ntasks per node for the cluster you are using (i.e. the number of cores per node, hyperthreading included when enabled). More in general, SLURM has been configured differently on the various systems reflecting the different system features. Please refer to the system specific guidesfor more detailed information.

Basic SLURM commands

The main user's commands of SLURM are reported in the table below: please consult the man pages for more information.

sbatch, srun, sallocSubmit a job
squeueLists jobs in the queue
sinfoPrints queue information about nodes and partitions
sbatch <batch script>Submits a batch script to the queue
scancel <jobid>Cancel a job from the queue
scontrol hold <jobid>Puts a job on hold in the queue.
scontrol release
Releases a job from hold
scontrol update
Change attributes of submitted job.
scontrol requeue
Requeue a running, suspended, or finished Slurm batch job into pending state.
scontrol show job <jobid>Produce a very detailed report for the job.
sacct -k, --timelimit-minOnly send data about jobs with this time limit.
sacct -A account_listDisplay jobs when a comma separated list of accounts are given as the argument.
sstatDisplay information about CPU, Task, Node, Resident Set Size, and Virtual Memory
sshareDisplay information about shared for a user, a repo, a job, a partition, etc.
sprioDisplay information about a job's scheduling priority from multi-factor priority components.


Submit a job:

> sbatch [opts] job_script
> salloc [opts] <command>  (interactive job)
where:
[opts] --> --nodes=<nodes_no> --ntasks-per-node=<tasks_per_node_no> --account=<account_no> --partition=<name> ...

job_script is a SLURM batch job.

The second command is related to a so-called "Interactive job": with salloc the user allocates a set of resources (nodes, cores, etc). The job is queued and scheduled as any SLURM batch job, but when executed with srun, the standard input, output, and error streams of the job are connected to the terminal session in which salloc is running. When the job begins its execution, all the input to the job is taken from the terminal session. You can use CTRL-D or "exit" to close the session.
If you specify a command at the end of your salloc string (like "./myscript"), the job will simply execute the command and close, prompting the standard output and error directly on your working terminal.

salloc -N 1 --ntask-per-node 8 # here I'm asking for a compute node with 1 GPU and 8 core
squeue -u $USER # can be used to check remote allocation is ready
hostname # will run on the front-end NOT ON ALLOCATED RESOURCES
srun hostname # will run on allocated resources showing the name of remote compute node
exit # ends the salloc allocation

WARNING: interactive jobs with SLURM are quite delicate. With salloc, your prompt won't tell you that you are working on a compute node, so it can be easy to forget that there is an interactive job running. Furthermore, deleting the job with "scancel" while inside the job itself will not boot you out of the nodes, and will invalid your interactive session because every command is searching for a jobid that doesn't exist anymore. If you are stuck in this situation, you can always revert back to your original front-end session with "CTRL-D" or "exit".

WARNING: interactive jobs may also be created launching the command

...

SLURM job script and directives

A SLURM job script consists of:

  • An optional shell specification
  • SLURM directives
  • Tasks -- programs or commands to be executed

Once ready, the job must be submitted to SLURM:

> sbatch [options] <name of script>

The shell to be used by SLURM is defined in the first line of the job script (mandatory!):

#!/bin/bash (or #!/bin/sh)

The SLURM directives are used to request resources or set attributes. A directive begins with the default string #SBATCH. One or more directives can follow the shell definition in the job script.

The tasks can be programs or commands. This is where the user specifies the application to run.

SLURM directives: resources

The type of resources required for a serial or parallel MPI/OpenMP/mixed job must be specified with a SLURM directive:

#SBATCH --<chunk_resource>=<value>

where <chunk_resource> can be one of the following:

  • --nodes=NN                        number of nodes 
  • --ntasks-per-node=CC       number of tasks/processes per node 
  • --cpus-per-task=TT            number of threads/cpus per task

For example for a MPI or MPI/OpenMP mixed job (2 MPI processes and 8 threads):

#SBATCH --nodes=2 
#SBATCH --ntasks-per-node=8

For a serial job for example:

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1

SLURM directives: processing time

Resources such as computing time must be requested by this syntax:

#SBATCH --time=<value>

where <value> expresses the actual elapsed time (wall clock) in the format hh:mm:ss

for example:

#SBATCH --time=1:00:00 (one hour)

Please note that there are specific limitations on the maximum walltime on a system, also depending on the partition. Check the system specific guide for more information.

SLURM directives: memory allocation

The default memory depends on the partition/queue you are working with. Usually we set it as the Total Memory of the Node divided by the total number of cores in a single node that we call here memory-per-core. So if you request 3 cores by default you would get the equivalent of 3 times the memory-per-core. Alternatively, you can specify the requested memory with the --mem=<value> directive up to maximum memory available on the nodes.

#SBATCH --mem=10000

The default measurement unit for memory requests is the Megabyte (in the example above, we are requesting for 10000MB per node). It is possible to ask for an amount of memory expressed in GB, like this:

#SBATCH --mem=10GB

However, the default request method in MB is preferable, since the memory limits defined for any partition are expressed in these terms. For example, Marconi SkyLake partition has 182000MB as a limit, corresponding to approx. 177GB.

Please note: if you are requiring a larger memory with respect to the "main amount" for the system, the number of "effective cores" and the cost of your job could increase. For more information check the accounting section.

SLURM directives: MPI tasks/OpenMP threads affinity

You may have to modify the default affinity, in order to ensure optimal performances on A3 Marconi.

The slurm directives that concern the processes binding are the following:

--cpu-bind=<cores|threads>
--cpus-per-task=<physical or logical cpus number to allocate for single task>
With

The value of --cpus-per-task

you define

 defines the SLURM_CPUS

_PER

_TASK

environment variable, that has to be exported with 

variable. When launching with "srun" (whenever possible), the variable is not inherited by srun and needs to be exported with 

export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK

In order to modify them correctly, we suggest to follow our guidelines.

Other SLURM directives

#SBATCH --account=<account_no>          --> name of the project to be accounted to ("saldo -b" for a list of projects)
#SBATCH --job-name=<name> --> job name
#SBATCH --partition=<destination> --> partition/queue destination. For a list and description of available partitions, please refer to the specific cluster description of the guide.
#SBATCH --qos=<qos_name> --> quality of service. Please refer to the specific cluster description of the guide.
#SBATCH --output=<out_file> --> redirects output file (default, if missing, is slurm-<Pid> containing merged output and error file)
#SBATCH --error=<err_file> --> redirects error file (as above)
#SBATCH --mail-type=<mail_events> --> specify email notification (NONE, BEGIN, END, FAIL, REQUEUE, ALL)
#SBATCH --mail-user=<user_list> --> set email destination (email address)

Directives in contracted form

Some SLURM directives can be written with a contracted syntax. Here are all the possibilities:

#SBATCH -N <NN>                         --> #SBATCH --nodes=<NN>
#SBATCH -c <TT> --> #SBATCH --cpus-per-task=<TT>
#SBATCH -t <value> --> #SBATCH --time=<value>
#SBATCH -A <account_no>                 --> #SBATCH --account=<account_no>
#SBATCH -J <name> --> #SBATCH --job-name=<name>
#SBATCH -p <destination> --> #SBATCH --partition=<destination>
#SBATCH -q <qos_name> --> #SBATCH --qos=<qos_name>
#SBATCH -o <out_file> --> #SBATCH --output=<out_file>
#SBATCH -e <err_file> --> #SBATCH --error=<err_file>

Note: the directives --mem, --mail-type, --mail-user and --ntasks-per-node can't be contracted. About the latter, it exists a SLURM directive "-n" for the number of tasks, but it can be misleading since it is used to indicate the TOTAL number of tasks and not the number of tasks per node. Therefore, it is not recommended since it can lead to confusion and unexpected behaviour. Use of the uncontracted --ntasks-per-node is recommended instead.

Using sbatch attributes to assign job attributes and resource request

It is also possible to assign the job attributes using the sbatch command options:

> sbatch [--job-name=<name>]  [--partition=<queue/partition>]  [--out=<out_file>] [--err=<err_file>] [--mail-type=<mail_events>] [--mail-user=<user_list>] <name of script>

And the resources can also be requested using the sbatch command options:

> sbatch  [--time=<value>] [-ntasks=<value>] [--account=<account_no>] <name of script>

The sbatch command options override script directives if present.


...