In this page
- MARCONI
- Version 2.12 (BDW A1 partition)
Version 2.12 (A1 partition) - work in progress
Input: (http://www.ks.uiuc.edu/Research/namd/utilities/)
STMV (virus) benchmark (1,066,628 atoms, periodic, PME)
Graphic 1: the NAMD performance (simulation time in ns/day) is reported vs. the increasing number of cores
Input: (http://www.ks.uiuc.edu/Research/namd/utilities/)
APOA1 - Acquaporin, (water channel) benchmark (92 224 atoms, periodic, PME)
Graphic 1: the NAMD performance (simulation time in ns/day) is reported vs. the increasing number of cores
MARCONI (KNL- A2 partition)
Problem
STMV (virus) benchmark (1,066,628 atoms, periodic, PME)
Code Block |
---|
namd2 +p 136 apoa1/apoa1.namd +pemap 0-135 |
NB: SMP version of NAMD-12 downlowded directly from NAMD website
Notice the drop in performance >32 cores and particularly with hyperthreading (136 cores).
For this reason Intel were given access to Marconi: they obtained the same results.
Reason for poor performance
The origin of the poor single node performance on KNL was eventually found to be the gettimeofday() system call when called in parallel:
Marconi
Endeavour
cores
54
64
54
64
Total time
28.59
43.64
23.92
20.75
__vdso_gettimeofday
5.25
21.29
0.77
1.44
performance, ns/day, HB
3.78
3.38
3.76
4.47
(Endeavour is the KNL cluster based at Intel.)
This call is heavily used at the beginning of NAMD in the Dynamic Load Balancing (DLB) phase and leads to a slowdown at least in the first few hundred steps.
The difference between the Marconi and Endeavour versions of gettimeofday() lie in the value of the current_clocksource setting of Linux:
Code Block |
---|
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet |
On Endeavour, and Cori, for example we have the following setting:
Code Block |
---|
# Cori setting
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc |
tsc is based on the processor clock ticks, while hpet is another protocol but known to be inefficient when called by multiple cores:
TSC is the fastest since the cycle value is stored in a CPU register, which can be quickly retrieved using the RDTSC instruction. HPET is a hardware timer and access overhead is pretty high when multiple CPUs try to access it (since access is serialized in hardware)
© Copyright 2