Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: updated links to GALILEO100 User Guide

...

Currently, SLURM is the scheduling system of MARCONI, GALILEO, and MARCONI100of MARCONI100 and GALILEO100. Comprehensive documentation is on this portal, as well as on the original SchedMD site.

...

Running applications using SLURM

With SLURM you can specify the tasks that you want to be executed; the system takes care of running these tasks and returns the results to the user. If the resources are not available, then SLURM holds your jobs and runs them when they will become available.

With SLURM you normally, create a batch job which you then  submit to the scheduler. A batch job is a file (a shell script under UNIX) containing the set of commands that you want to run. It also contains the directives that specify the characteristics (attributes) of the job and the resource requirements (e.g. number of processors and CPU time) that your job needs.

Once you create your job, you can reuse it if you wish. Or, you can modify it for subsequent runs.

For example, here is a simple SLURM job script to run a user's application by setting a limit (one hour) to the maximum wall time, requesting 1 full node with 36 cores:

#!/bin/bash
#SBATCH --nodes=1                    # 1 node
#SBATCH --ntasks-per-node=36 # 36 tasks per node
#SBATCH --time=1:00:00 # time limits: 1 hour
#SBATCH --error=myJob.err # standard error file
#SBATCH --output=myJob.out # standard output file
#SBATCH --account=<account_no> # account name
#SBATCH --partition=<partition_name> # partition name
#SBATCH --qos=<qos_name> # quality of service
./my_application

SLURM has been configured differently on the various systems reflecting the different system features. Please refer to the system specific guides for more detailed information.

Basic SLURM commands

The main user's commands of SLURM are reported in the table below: please consult the man pages for more information.

sbatch, srun, sallocSubmit a job
squeueLists jobs in the queue
sinfoPrints queue information about nodes and partitions
sbatch <batch script>Submits a batch script to the queue
scancel <jobid>Cancel a job from the queue
scontrol hold <jobid>Puts a job on hold in the queue.
scontrol release
Releases a job from hold
scontrol update
Change attributes of submitted job.
scontrol requeue
Requeue a running, suspended, or finished Slurm batch job into pending state.
scontrol show job <jobid>Produce a very detailed report for the job.
sacct -k, --timelimit-minOnly send data about jobs with this time limit.
sacct -A account_listDisplay jobs when a comma separated list of accounts are given as the argument.
sstatDisplay information about CPU, Task, Node, Resident Set Size, and Virtual Memory
sshareDisplay information about shared for a user, a repo, a job, a partition, etc.
sprioDisplay information about a job's scheduling priority from multi-factor priority components.


Submit a job:

> sbatch [opts] job_script
> salloc [opts] <command>  (interactive job)
where:
[opts] --> --nodes=<nodes_no> --ntasks-per-node=<tasks_per_node_no> --account=<account_no> --partition=<name> ...

job_script is a SLURM batch job.

The second command is related to a so-called "Interactive job": with salloc the user allocates a set of resources (nodes, cores, etc). The job is queued and scheduled as any SLURM batch job, but when executed with srun, the standard input, output, and error streams of the job are connected to the terminal session in which salloc is running. When the job begins its execution, all the input to the job is taken from the terminal session. You can use CTRL-D or "exit" to close the session.
If you specify a command at the end of your salloc string (like "./myscript"), the job will simply execute the command and close, prompting the standard output and error directly on your working terminal.

WARNING: interactive jobs with SLURM are quite delicate. With salloc, your prompt won't tell you that you are working on a compute node, so it can be easy to forget that there is an interactive job running. Furthermore, deleting the job with "scancel" while inside the job itself will not boot you out of the nodes, and will invalid your interactive session because every command is searching for a jobid that doesn't exist anymore. If you are stuck in this situation, you can always revert back to your original front-end session with "CTRL-D" or "exit".


Displaying Job Status:

> squeue                           (lists all jobs, default format)
> squeue --format=... (lists all jobs, more readable format)
> squeue -u $USER (lists only jobs submitted by you)
> squeue --job <job_id> (only the specified job)
> squeue --job <job_id> -l (full display of the specified job)
> scontrol show job <job_id> (detailed informations about your job)


Displaying Queue Status:

The command sinfo displays information about nodes and partitions (queues).

It offers several options -  here is a template that you may find useful.

> sinfo -o "%20P %10a %10l %15F %10z"

Display a straight-forward summary: available partitions, their status, timelimit, node information with A/I/O/T ( allocated, idle, other, total ) and specifications S:C:T (sockets:cores:threads)
Numbers represent field length and should be used to properly accommodate the data.

Other useful options are:

> sinfo
> sinfo -p <partition> (Long format of the specified partition, eg gll_usr_prod)
> sinfo -d (Information about the offline nodes. The list of available partition is also easier to read)
> sinfo --all (Displays more details)
> sinfo -i <n> (Top-like display, iterates every "n" seconds)
> sinfo -l or --long (Displays several additional information, such as the reason why specific nodes are down/drained. Usually used together with -N)
> sinfo -n <node>
(Shows information about a specific node, eg sinfo -N -n r033c01s01)

To view a complete list of all options and their descriptions, use man sinfo, or access the SchedMD webpage.


Delete a job:

> scancel <jobID> 

More information about these commands is available with the man command.


The User Environment

There are a number of environment variables provided to the SLURM job. Some of them are taken from the user's environment and carried with the job. Others are created by SLURM.

All SLURM-provided environment variable names start with the characters SLURM_.

Below are listed some of the more useful variables, and some typical values taken as an example:

SLURM_JOB_NAME=job
SLURM_NNODES (or SLURM_JOB_NUM_NODES)=2
SLURM_JOBID (or SLURM_JOB_ID)=453919
SLURM_JOB_NODELIST=node1,node2,...
SLURM_SUBMIT_DIR=/marconi_scratch/userexternal/username
SLURM_SUBMIT_HOST=node1
SLURM_CLUSTERNAME=cluster1
SLURM_JOB_PARTITION=partition1

There are a number of ways that you can use these environment variables to make more efficient use of SLURM. For example, SLURM_JOB_NAME can be used to retrieve the SLURM jobname. Another commonly used variable is SLURM_SUBMIT_DIR which contains the name of the directory from which the user submitted the SLURM job.

WARNING: $SLURM_JOB_NODELIST will display the node names in contracted forms, meaning that for consecutive nodes you will get their range instead of the full list. You will see in square brackets the ID of the first and the last node of the chunk, meaning that all the nodes between them are also part of the actual node list.

Job TMPDIR:

When a job starts, a temporary area is defined on the storage local to each compute node:

TMPDIR=/scratch_local/slurm_job.$SLURM_JOB_ID

which can be used exclusively by the job's owner. During your jobs, you can access the area with the (local) variable $TMPDIR. The directory is removed at the end of the job, hence remember to save the data stored in such area to a permanent directory. Please note that the area is located on local disks, so it can be accessed only by the processes running on the node. For multinode jobs, if you need all the processes to access some data, please use the shared filesystems $HOME, $WORK, $CINECA_SCRATCH.


SLURM Resources

A job requests resources through the SLURM syntax; SLURM matches requested resources with the available ones, according to the rules defined by the administrator. When the resources are allocated to the job, the job can be executed.

There are different types of resources, i.e. server level resources, like walltime, chunk resources, like number of cpus or nodes, and generic resource (GRES) like GPUs on the systems that have them.

Other resources may be added to manage access to software resources, for example when resources are limited and the lack of availability leads to jobs abort when they are scheduled for execution. More details may be found in the module help of the application you are trying to execute.

The syntax of the request depends on the type of resource:

#SBATCH --<resource>=<value>          (server level resources, e.g. walltime)
#SBATCH --<chunk_resource>=<value>    (chunk resources, e.g. cpus, nodes,...)
#SBATCH --gres=gpu:<value> (generic resources, e.g. gpus)


For example:

#SBATCH --time=10:00:00
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:2

Resources can be required either:

1) using SLURM directives in the job script

2) using options of the sbatch/salloc command

...

More specific information about partitions and qos (quality of service), limits and available features are described on the "system specific" pages of this Guide, for MARCONI, MARCONI100 and GALILEO GALILEO100, as well as "man" pages about SLURM commands:

...