Page History

...

SLURM partition	Job QOS	# cores/# GPU per job	max walltime	max running jobs per user/ max n. of cpus/nodes/GPUs per user	priority	notes
m100_all_serial (def. partition)	normal	max = 1 core, 1 GPU max mem= 7600MB	04:00:00	4 cpus/1 GPU	40
m100_usr_prod	m100_qos_dbg	max = 2 nodes	02:00:00	2 nodes/64cpus/8GPUs	45	runs on 12 nodes
m100_usr_prod	normal	max = 16 nodes	24:00:00		40	runs on 880 nodes
	m100_qos_bprod	min = 17 nodes max =256 nodes	24:00:00	256 nodes	85	runs on 512 nodes
m100_usr_preempt	normal	max = 16 nodes	24:00:00		1	runs on 99 nodes
m100_fua_prod (EUROFUSION)	m100_qos_fuadbg	max = 2 nodes	02:00:00		45	runs on 12 nodes
m100_fua_prod (EUROFUSION)	normal	max = 16 nodes	24:00:00		40	runs on 68 nodes
	m100_qos_fuabprod	max = 32 nodes	24:00:00		40	run on 64 nodes at same time
all partitions	qos_special	> 32 nodes	> 24:00:00		40	request to superc@cineca.it
all partitions	qos_lowprio	max = 16 nodes	24:00:00		0	active projects with exhausted budget

M100 specific information

In the following we report information specific to M100, as well as examples suited for this kind of system.

Each node exposes itself to SLURM as having 128 (virtual) cpus, 4 GPUs and 246.000 MB memory. SLURM assigns a node in shared way, assigning to the job only the resources required and allowing multiple jobs to run on the same node/nodes. If you want to have the node/s in exclusive mode, use the SLURM option “--exclusive” together with “--gres=gpu:4”.

The maximum memory which can be requested is 246.000 MB (average memory per physical core ~ 7GB) and this value guarantees that no memory swapping will occur.

Even if the nodes are shared among users, exclusivity is guaranteed for the single physical core and the single GPU. When you ask for “tasks” (--ntasks-per-node), SLURM gives you the requested number of (virtual) cpus rounded on multiple of four. For example

#SBATCH --ntasks-per-node = 1 (or 2, 3, 4)     → 1 core
#SBATCH --ntasks-per-node = 13 (or 14, 15, 16) → 4 cores

By default the number of (virtual) cpus per task is one, but you can change it.

#SBATCH --ntasks-per-node=8  
#SBATCH --cpus-per-task=4

In this way each tasks will correspond to one (physical) core.

Users with exhausted but still active projects are allowed to keep using the cluster resources, even if at a very low priority, by adding the "qos_lowprio" flag to their job:

#SBATCH --qos=qos_lowprio

This QOS is automatically associated to Eurofusion users once their projects exhaust the budget before their expiry date. For all the other users, please ask superc@cineca.it the QOS association.

The preemptable partition, m100_usr_preempt allows users to access additional nodes of m100_fua_prod partition in preemptable modality (if available and not used by Eurofusion community). The jobs submitted to the m100_usr_preempt partition may be killed if the assigned resources are requested by jobs submitted to higher priority partition (m100_fua_prod); hence we reccomend its use only with restartable applications.

Users with exhausted but still active projects are allowed to keep using the cluster resources, even if at a very low priority, by adding the "qos_lowprio" flag to their job:

#SBATCH --qos=qos_lowprio

This QOS is automatically associated to Eurofusion users once their projects exhaust the budget before their expiry date. For all the other users, please ask superc@cineca.it the QOS association.

Eurofusion users can also use the computing resources at low priority before their budget gets exhausted, in case they wish to run non urgent jobs without consuming the budget of the granted project. Please ask superc@cineca.it to be added to the Account FUAC4_LOWPRIO, and specify this account and the qos_lowprio QOS in your submission script.

(EUROFUSION)

M100 specific information

In the following we report information specific to M100, as well as examples suited for this kind of system.

Each node exposes itself to SLURM as having 128 (virtual) cpus, 4 GPUs and 246.000 MB memory. SLURM assigns a node in shared way, assigning to the job only the resources required and allowing multiple jobs to run on the same node/nodes. If you want to have the node/s in exclusive mode, use the SLURM option “--exclusive” together with “--gres=gpu:4”.

The maximum memory which can be requested is 246.000 MB (average memory per physical core ~ 7GB) and this value guarantees that no memory swapping will occur.

Even if the nodes are shared among users, exclusivity is guaranteed for the single physical core and the single GPU. When you ask for “tasks” (--ntasks-per-node), SLURM gives you the requested number of (virtual) cpus rounded on multiple of four. For example

#SBATCH --ntasks-per-node = 1 (or 2, 3, 4)     → 1 core
#SBATCH --ntasks-per-node = 13 (or 14, 15, 16) → 4 cores

By default the number of (virtual) cpus per task is one, but you can change it.

#SBATCH --ntasks-per-node=8  
#SBATCH --cpus-per-task=4

In this way each tasks will correspond to one (physical) core. Eurofusion users can also use the computing resources at low priority before their budget gets exhausted, in case they wish to run non urgent jobs without consuming the budget of the granted project. Please ask superc@cineca.it to be added to the Account FUAC4_LOWPRIO, and specify this account and the qos_lowprio QOS in your submission script.

Submitting serial batch jobs

...

Page tree

Versions Compared

Old Version 97

New Version 98

Key

M100 specific information

M100 specific information

Submitting serial batch jobs