NAMD Benchmark

In this page

FERMI

Version 2.9 (BG/Q optimized)
Version 2.9 with plumed 1.3

Version 2.9 (BG/Q optimized)

Input: (http://www.ks.uiuc.edu/Research/namd/utilities/)

STMV (virus) benchmark (1,066,628 atoms, periodic, PME)    
config, directory, 27.5M .tar.gz, 27.5M .zip

This version of the code, optimized by IBM, adopts a mixed MPI-Thread approach for the parallel computation. The number of MPI process per node are selected with the --ranks-per-node option of LoadLeveler, while the number of OpenMP threads per MPI process with the +ppn flag of namd.

Chosen the number of nodes with bg_size, we have tested different combinations of ranks-per-node and +ppn options and Graphic 1 reports the best performance obtained for the STMV benchmark.

Graphic 1: the NAMD performance (simulation time in ns/day) is reported vs. the increasing number of cores

The scalability figure is good, up to the higher number of nodes.

In all the cases except bg_size 1024 and 2048, the best combination is given by ranks-per-node = 4 or 8 and +ppn16 or 8, respectively (thus making use of the maximum number, 64 threads per node).

For example in Graphic 2 are the results for the calculations with bg_size = 128 and maximum number of threads (total number 8192), corresponding to these running options:

runjob --np 2048 --ranks-per-node 16 : $NAMD_HOME/namd2 stmv_ori.namd +ppn4 > stmv.out

runjob --np 1024 --ranks-per-node 8 : $NAMD_HOME/namd2 stmv_ori.namd +ppn8 > stmv.out

runjob --np 512 --ranks-per-node 4 : $NAMD_HOME/namd2 stmv_ori.namd +ppn16 > stmv.out

runjob --np 256 --ranks-per-node 2 : $NAMD_HOME/namd2 stmv_ori.namd +ppn32 > stmv.out

runjob --np 128 --ranks-per-node 1 : $NAMD_HOME/namd2 stmv_ori.namd +ppn64 > stmv.out

Note that the namd output can be misleading in its definition of process/thread. For example, when running with these options:

runjob --np 512 --ranks-per-node 4 : $NAMD_HOME/namd2 stmv_ori.namd +ppn16 > stmv.out

you can find this line in the namd output:

Info: Running on 8192 processors, 512 nodes, 128 physical nodes.

where nodes are actually the MPI processes and processors the total number of threads.

Graphic 2: the NAMD performance (simulation time in ns/day) is reported vs. the increasing number of ranks-per-node (fixed bg_size=128 and no. of total threads)

The dependency from the number of threads per process at fixed bg_size and rank_per_node is shown in Graphic 3. The plot clearly shows that the best performance are obtained by using the maximum number of threads (64 = 16 threads per MPI process * 4 process).

Graphic 3: the NAMD performance (simulation time in ns/day) is reported vs. the increasing number of threads per MPI process (fixed bg_size=128, fixed ranks-per-nodes=4)

When increasing the number of nodes, this is not true anymore. For example, Graphic 4 shows the same data for bg_size = 1024 and 2048.

Graphic 4: the NAMD performance (simulation time in ns/day) is reported vs. the increasing number of threads per MPI process (fixed bg_size=1024 and 2048, fixed ranks-per-node=4)

This graph shows that with bg_size = 1024 the best performance are obtained with two threads per core, while with bg_size = 2048 one thread per core is already enough.

Note that even though the options --ranks-per-node=4 and +ppn16 were the best in a wide range of cores for this benchmark, this could not be true for your system since the MD calculations are strongly dependent on system size and topology.

Page tree

NAMD Benchmark

Version 2.9 (BG/Q optimized)