Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

SLURM

partition

Job QOS# cores/# GPU
per job
max walltime

max running jobs per user/

max n. of cpus/nodes/GPUs per user

prioritynotes

m100_all_serial

(def. partition)

normal

max = 1 core,
1 GPU

max mem= 7600MB

04:00:00

4 cpus/1 GPU


40
m100_usr_prodm100_qos_dbgmax = 2 nodes02:00:002 nodes/64cpus/8GPUs45runs on 12 nodes
m100_usr_prodnormalmax = 16 nodes24:00:00
40runs on 880 nodes

m100_qos_bprod

min = 17 nodes

max =256 nodes

24:00:00256 nodes85runs on 512 nodes
m100_usr_preemptnormalmax = 16 nodes24:00:00
1

runs on 99 nodes

m100_fua_prod

(EUROFUSION)

m100_qos_fuadbgmax = 2 nodes02:00:00
45

runs on 12 nodes


m100_fua_prod

(EUROFUSION)

normalmax = 16 nodes24:00:00
40

runs on 68 nodes



m100_qos_fuabprodmax = 32 nodes24:00:00
40run on 64 nodes at same time
all partitions


qos_special


> 32 nodes


> 24:00:00


40 

request to superc@cineca.it

all partitions


qos_lowprio


max = 16 nodes

24:00:00


0


active projects with exhausted budget

M100 specific information

In the following we report information specific to M100, as well as examples suited for this kind of system.

Each node exposes itself to SLURM as having 128 (virtual) cpus, 4 GPUs and 246.000 MB memory. SLURM assigns a node in shared way, assigning to the job only the resources required and allowing multiple jobs to run on the same node/nodes. If you want to have the node/s in exclusive mode, use the SLURM option “--exclusive” together with “--gres=gpu:4”.

The maximum memory which can be requested is 246.000 MB (average memory per physical core ~ 7GB) and this value guarantees that no memory swapping will occur. 

Even if the nodes are shared among users, exclusivity is guaranteed for the single physical core and the single GPU. When you ask for “tasks” (--ntasks-per-node), SLURM gives you the requested number of (virtual) cpus  rounded on multiple of four. For example

#SBATCH --ntasks-per-node = 1 (or 2, 3, 4)     → 1 core
#SBATCH --ntasks-per-node = 13 (or 14, 15, 16) → 4 cores

By default the number of (virtual) cpus per task is one, but you can change it. 

#SBATCH --ntasks-per-node=8  
#SBATCH --cpus-per-task=4

In this way each tasks will correspond to one (physical) core. 

Users with exhausted but still active projects are allowed to keep using the cluster resources, even if at a very low priority, by adding the  "qos_lowprio" flag to their job:

#SBATCH --qos=qos_lowprio

This QOS is automatically associated to Eurofusion users once their projects exhaust the budget before their expiry date. For all the other users, please ask superc@cineca.it the QOS association.



The preemptable partition, m100_usr_preempt allows users to access additional nodes of m100_fua_prod partition in preemptable modality (if available and not used by Eurofusion community).  The jobs submitted to the m100_usr_preempt partition may be killed if the assigned resources are requested by jobs submitted to higher priority partition (m100_fua_prod); hence we reccomend its use only with restartable applications. 

Users with exhausted but still active projects are allowed to keep using the cluster resources, even if at a very low priority, by adding the "qos_lowprio" flag to their job:

#SBATCH --qos=qos_lowprio

This QOS is automatically associated to Eurofusion users once their projects exhaust the budget before their expiry date. For all the other users, please ask superc@cineca.it the QOS association.

Eurofusion users can also use the computing resources at low priority before their budget gets exhausted, in case they wish to run non urgent jobs without consuming the budget of the granted project. Please ask superc@cineca.it to be added to the Account FUAC4_LOWPRIO, and specify this account and the qos_lowprio QOS in your submission script.

(EUROFUSION)


M100 specific information

In the following we report information specific to M100, as well as examples suited for this kind of system.

Each node exposes itself to SLURM as having 128 (virtual) cpus, 4 GPUs and 246.000 MB memory. SLURM assigns a node in shared way, assigning to the job only the resources required and allowing multiple jobs to run on the same node/nodes. If you want to have the node/s in exclusive mode, use the SLURM option “--exclusive” together with “--gres=gpu:4”.

The maximum memory which can be requested is 246.000 MB (average memory per physical core ~ 7GB) and this value guarantees that no memory swapping will occur. 

Even if the nodes are shared among users, exclusivity is guaranteed for the single physical core and the single GPU. When you ask for “tasks” (--ntasks-per-node), SLURM gives you the requested number of (virtual) cpus  rounded on multiple of four. For example

#SBATCH --ntasks-per-node = 1 (or 2, 3, 4)     → 1 core
#SBATCH --ntasks-per-node = 13 (or 14, 15, 16) → 4 cores

By default the number of (virtual) cpus per task is one, but you can change it. 

#SBATCH --ntasks-per-node=8  
#SBATCH --cpus-per-task=4

In this way each tasks will correspond to one (physical) core. Eurofusion users can also use the computing resources at low priority before their budget gets exhausted, in case they wish to run non urgent jobs without consuming the budget of the granted project. Please ask superc@cineca.it to be added to the Account FUAC4_LOWPRIO, and specify this account and the qos_lowprio QOS in your submission script.


Submitting serial batch jobs

...