Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

 

In this page

  • MARCONI
  • Version 2.12 (BDW A1 partition)

Version 2.12 (A1 partition) - work in progress

Input:  (http://www.ks.uiuc.edu/Research/namd/utilities/)

STMV (virus) benchmark (1,066,628 atoms, periodic, PME)

Graphic 1: the NAMD performance (simulation time in ns/day) is reported vs. the increasing number of cores

 

 

 

Input:  (http://www.ks.uiuc.edu/Research/namd/utilities/)
APOA1 - Acquaporin, (water channel) benchmark (92 224 atoms, periodic, PME)

Graphic 1: the NAMD performance (simulation time in ns/day) is reported vs. the increasing number of cores

 

    

MARCONI (KNL- A2 partition)

Problem

 

Performance data for NAMD on KNL is present on the Intel website,
but until recently it  has been impossible to replicate the data reported by Intel. In particular, single node performance was very poor:
APOA1 benchmark, 92K atoms

 

STMV (virus) benchmark (1,066,628 atoms, periodic, PME)

 

Code Block
namd2 +p 136 apoa1/apoa1.namd +pemap 0-135
coresdays/nsns/day81.538630.649929160.7983831.25253320.415942.40419680.335022.98491366.173280.16198

NB: SMP version of NAMD-12 downlowded directly from NAMD website 

Notice the drop in performance >32 cores and particularly with hyperthreading (136 cores).  

For this reason Intel were given access to Marconi: they obtained the same results.

Reason for poor performance

The origin of the poor single node performance on KNL was eventually found to be the gettimeofday() system call when called in parallel:

 

Marconi

Endeavour

cores

54

64

54

64

Total time

28.59

43.64

23.92

20.75

__vdso_gettimeofday

5.25

21.29

0.77

1.44

performance, ns/day, HB

3.78

3.38

3.76

4.47

(Endeavour is the KNL cluster based at Intel.)

This call is heavily used at the beginning of NAMD in the Dynamic Load Balancing (DLB) phase and leads to a slowdown at least in the first few hundred steps.

The difference between the Marconi and Endeavour versions of gettimeofday() lie in the value of the current_clocksource setting of Linux: 

Code Block
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet

On Endeavour, and Cori, for example we have the following setting:

Code Block
# Cori setting
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc

tsc is based on the processor clock ticks, while hpet is another protocol but known to be inefficient when called by multiple cores:

 

Info Why TSC over HPET:
TSC is the fastest since the cycle value is stored in a CPU register, which can be quickly retrieved using the RDTSC instruction. HPET is a hardware timer and access overhead is pretty high when multiple CPUs try to access it (since access is serialized in hardware) 

(https://www.quora.com/Linux-Kernel-What-are-the-advantages-and-disadvantages-of-TSC-and-HPET-as-a-clocksource)

 
 

 

© Copyright 2