GROMACS benchmark

In this page

FERMI

Version 4.5

The 4.5 version of the code is pure MPI, while the 4.6 adopts a mixed MPI-Thread approach for the parallel computation. The number of MPI process per node are selected with the --ranks-per-node option of LoadLeveler, while the number of OpenMP threads per MPI process with the -ntomp flag of gromacs.

As a first example, we have chosen a small system simulated by means of Replica Exchange to exploit the large number of cores available on Fermi. As shown in Graphic 1, the scaling is good up to bg_size = 1024, which allows to simulate about 2.3 microsecond per day.

Graphic 1: the Gromacs performance (simulation time in ns/day) is reported vs. the increasing number of cores using 256 replicas of the system

RUN REMD GMX 4.5

bg_size=128

mdrun=$(which mdrun_bgq)

nmulti="256"

replex="2500"
exe="$mdrun -s remd_ -multi $nmulti -replex $replex "
#launch mdrun on all the back-end nodes (note the : which must be present)
runjob -n 4096 --ranks-per-node 32 --env-all : $exe

In this case we have used --ranks-per-node 32 but 64 could be more efficient for different systems. Additional benchmark for larger systems will be published soon.

Version 4.6

The effect of mixed MPI+Threads parallelization scheme is shown in Graphic 2, where the number of MPI processes * Threads is kept constant to the maximum available (64) for each node. These data suggest to use a number of Threads between 4 and 16 (and therefore a number of MPI processes between 16 and 4) in order to achieve the best performance.

Graphic 2: the Gromacs performance (simulation time in ns/day) is reported vs. the increasing number of Threads at a constant number of MPI processes * Threads = 64

RUN SINGLE MD GMX 4.6

mdrun=$(which mdrun_bgq)

exe="$mdrun -s remd_0.tpr -ntomp 4"

runjob -n 4 --ranks-per-node 4 --env-all : $exe

Page tree

GROMACS benchmark