Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

saldo -b (reports projects defined on M100 )

Accounting gpu resources

to be added...Please note that the accounting of the consumed core hours takes into account the requested memory and number of GPUs, please refer to the dedicated section.

Budget Linearization policy

...

This is allowed only for very short runs on the login nodes, since the interactive environment has a 10 minutes time limit.

A serial (or multithreaded) program using GPUs and needing more than 10 minutes can be executed interactively within an "Interactive" SLURM batch job, using the "srun" command: the job is queued and scheduled as any other job but, but when executed, the remote standard input, output, and error streams are connected to the terminal session from which srun was launched.

...

> srun -N1 --ntasks-per-node=1 --gres=gpu:1 -A <account_name> --time=01:00:00 --pty /bin/bash

SLURM will then schedule your job to start, and your shell will be unresponsive until free resources are allocated for you. When the shell come back with the prompt (the hostname at the prompt will be that of the assigned node), launch the program in the standard way:

> ./program

As mentioned above, Once obtained the node, launch the program in the standard way. Please note that the accounting of the consumed core hours takes into account also the memory and the number of requested GPUs , please refer to the dedicated section

A parallel program can be executed interactively only within an "Interactive" SLURM batch job, using the "srun" command: the job is queued and scheduled as any other job, but when executed, the standard input, output, and error streams are connected to the terminal session from which srun was launched.

For example, to start an interactive session with the MPI program myprogram, using one node, two processors, launch the command:

(see the dedicated section). For instance, a job using one core and one GPU for one hour (with the default memory per core) will consume 8 core-hours (each node being equipped with 32 physical cores and 4 V100 GPUs). 

A parallel (MPI) program using GPUs and needing more than 10 minutes can as well been executed in an interactive SLURM batch jobs, using the "salloc" command in the place of "srun --pty bash":

> salloc -N1> srun -N1 -n2 --ntasks-per-node=16 --gres=gpu:2 -A <account_name> --pty /bin/bash

add gpu version?

SLURM will then schedule your job to start, and your shell will be unresponsive until free resources are allocated for you.

When the shell come back with the prompt, you can execute your program by typing:

> srun ./myprogram

or

> mpirun ./myprogram

Please note that

time=01:00:00 

Again, the job is queued and scheduled as any other job and, when executed, a new session starts on the login node from which salloc was launched (the hostname at the prompt will be that of the login node). You can now run your parallel program on the assigned compute node(s) as in any slurm parallel job:

> srun ./myprogram

or

> mpirun ./myprogram

srun/mpirun will dispatch the tasks of the program myprogram to the assigned compute node, i.e., the tasks do not run on the login node hosting the salloc session.

Please note that the 1) The recommended way to launch parallel tasks in slurm jobs is with srun. By using srun vs mpirun you will get full support for process tracking, accounting, task affinity, suspend/resume and other features.

2) Controlling the processes and threads affinity is crucial to ensure the optimal performances on M100. Do not rely on slurm autoaffinity and use the proper SLURM --cpu-bind option 

SLURM automatically exports the environment variables you defined in the source shell, so that if you need to run your program "myprogram" in a controlled environment (i.e. specific library paths or options), you can prepare the environment in the origin shell being sure to find it in the interactive shell (started with both srun and salloc).

Batch

The info reported here refer to the general user M100 partition. The production environment for EUROfusion users is discussed in a separate document.

...