In this page
Version 4.5
The 4.5 version of the code is pure MPI, while the 4.6 adopts a mixed MPI-Thread approach for the parallel computation. The number of MPI process per node are selected with the --ranks-per-node option of LoadLeveler, while the number of OpenMP threads per MPI process with the -ntomp flag of gromacs.
As a first example, we have chosen a small system simulated by means of Replica Exchange to exploit the large number of cores available on Fermi. As shown in Graphic 1, the scaling is good up to bg_size = 1024, which allows to simulate about 2.3 microsecond per day.
Graphic 1: the Gromacs performance (simulation time in ns/day) is reported vs. the increasing number of cores using 256 replicas of the system
RUN REMD GMX 4.5
bg_size=128
mdrun=$(which mdrun_bgq)
nmulti="256"
replex="2500"
exe="$mdrun -s remd_ -multi $nmulti -replex $replex "
#launch mdrun on all the back-end nodes (note the : which must be present)
runjob -n 4096 --ranks-per-node 32 --env-all : $exe
In this case we have used --ranks-per-node 32 but 64 could be more efficient for different systems. Additional benchmark for larger systems will be published soon.
Version 4.6
The effect of mixed MPI+Threads parallelization scheme is shown in Graphic 2, where the number of MPI processes * Threads is kept constant to the maximum available (64) for each node. These data suggest to use a number of Threads between 4 and 16 (and therefore a number of MPI processes between 16 and 4) in order to achieve the best performance.
Graphic 2: the Gromacs performance (simulation time in ns/day) is reported vs. the increasing number of Threads at a constant number of MPI processes * Threads = 64
RUN SINGLE MD GMX 4.6
mdrun=$(which mdrun_bgq)
exe="$mdrun -s remd_0.tpr -ntomp 4"
runjob -n 4 --ranks-per-node 4 --env-all : $exe
© Copyright 2