Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

On some clusters (for example on GALILEO100 or MARCONI100LEONARDO) you can choose to allocate for your job only part of the node. You are not forced to allocate all of it as it happens in clusters (like MARCONI) running in exclusive mode. In this case, the accounting procedure also takes into account the amount of memory you request for your job. If you ask for an amount of memory that is larger than the equivalent number of cores requested, the jobs will be billed for a larger number of cores than the ones you have reserved.
The billing always follows the basic idea illustrated above, but a generalized parameter for the number of reserved cores, accounting for the memory request, is now used:

...

This rule applies to each cluster based on its amount of total memory and cores.

"Non-exclusive" nodes: TMPDIR matters (LEONARDO only)!

On LEONARDO, the job's TMPDIR local area is managed by the slurm job_container/tmpfs plugin and can be explicitly requested on the diskful nodes. In this case, the accounting procedure also takes into account the amount of space (gres/tmpfs) you request for your job. If you ask for an amount of local storage that is larger than the equivalent number of cores requested, the jobs will be billed for a larger number of cores than the ones you have reserved.
The billing always follows the basic idea illustrated above, but a generalized parameter for the number of reserved cores, accounting for the local storage request, is now used:

accounted hours = ElapsedTime x ReservedCoresEquiv

where

ReservedCoresEquiv = ReservedCores x LocalStorageFactor

  • LocalStorageFactor = 1
    If the local storage you ask for (in terms of the equivalent number of cores) is smaller than or equal to the number of reserved cores. In this case, the amount of cpu-hours billed depends only on the number of requested cores.
  • LocalStorageFactor = (ReservedLocalStorage / TotalLocalStorage) / (ReservedCores / TotalCores)
    If the local storage you ask for (in terms of the equivalent number of cores) is larger than the number of reserved cores. In this case, the amount of cpu-hours billed also depend on the amount of local storage requested (i.e. the actual percentage of node allocated).

For example, on LEONARDO-DCGP partition, the TotalLocalStorage considered to calculate the LocalStorageFactor is 3 TB, and each compute node has 112 cores:

  • TotalLocalStorage = 3 TB
  • TotalCore = 112

If you ask for only one core and 1 TB of memory (thus allocating for yourself one third of the local storage of the node even if you are using one core), the LocalStorageFactor is:

  • ReservedLocalStorage = 1 TB
  • ReservedCores = 1
  • → LocalStorageFactor = (1 TB / 3 TB) / (1 / 112) = 37

Hence, with such a request for each hour of computation, your budget will be billed for 37 equivalent CPUs, i.e., for 37 hours. 

At present the slurm job_container/tmpfs plugin is ONLY enabled on LEONARDO.

Accounting and accelerators

Recently, the accounting system has been extended to nodes equipped with accelerators. The principle is the same as memory accounting: asking for a number of accelerators that will make you allocate a bigger portion of the node than what is suggested by the simple number of cores requested, will increase the consumption accordingly. 

For GALILEO100LEONARDO, every GPU will be treated as 24 8 cores in terms of accounting. That is because GPU nodes have 48 32 CPUs and 2 4 GPU each. So allocating 1 GPU is equivalent to allocating half a quarter of the node (i.e. 24 8 CPUs). For MARCONI100 which has 32 CPUs and 4 GPU each, the rule holds similarly. 

Some examples based on GALILEO100 LEONARDO (1 node):

  • cpus=3224, gpus=1 ==> the number of GPUs requested is equal to having requested 24 8 CPUs, but since 32 24 of them have been requested in the standard way, they are not taken into account. Thus 32 CPUs will 24 CPUs will be billed;
  • cpus=6, gpu=1 ==>  the number of GPUs requested is equal to having requested 24 8 CPUs, which is higher than the number of CPUs requested. Thus 24 8 CPUs will be billed;
  • cpus=24, gpus=2 4 ==> the number of GPUs requested is equal to having requested 48 32 CPUs, while 24 of them have been requested in the standard way, and they are not enough to cover for the GPU request. Therefore 48 32CPUs will be billed;
  • cpus=24, gpus=1,mem=366GB 500GB ==> the situation is similar to the first example (so 24 CPUs billed), but the memory request is higher than what is guaranteed by the simple allocation of the CPUs or GPUs, since it is equivalent of allocating the entire node. So, 48 32CPUs will be billed.

Low priority production jobs for active projects with exhausted budget

...