Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

(Updated August October 2023)

G100

version 7.1

CNT10POR8 : R&G scaling on the CPU

  • Performance analysis for a a Carbon nanotube functionalized with two porphyrine molecules, about 1500 atoms, 8000 bands, 1 k-point
  • The average time per iteration is reported as a function of the number of nodes.


Table1: The performance of the Pure MPI


N° nodes

Time (s)

8248
16134
3275
6448


Graphic 1: the QE performance (simulation time in s) is reported vs. the increasing number of nodes

Leonardo

version 7.2

CNT10POR8 : R&G scaling on the GPUs

  • Performance analysis for a a Carbon nanotube functionalized with two porphyrine molecules, about 1500 atoms, 8000 bands, 1 k-point.
  • The average time per iteration is reported as a function of the number of nodes.


Table2: The performance of the MPI (1 task per GPU) + GPU (4 per node) + OpenMP (8 threads per task)

N° nodes

Time (s)

821.34
1614.06
2012.18
2411.60


 

...

In the following benchmark we present the time to solution (s) when distributing from 1 to 1024 nodes (colums), the efficiency of the entire simulation (black dashed line) and the efficiency of the main kernel (violet line). 


Nodesphqscf (s)dynmat0 (s)solve_linter (s)sth_kernel (s)h_psi (s)walltime (s)
431770,0136,5830.897,5927444,5612649,3831860,00
817837,2234,716.981,2014023,526422,3317880,00
1610570,7231,569.766,427051,893189,6610620,00
325290,3530,74.891,413535,661598,945340,00
642700,9731,712.496,781805,39817,762733,14
1281375,2932,251270,39917,08415,091401,46
256711,8331,81657,49474,75214,19757,98
512382,9431,96353,18253,74125,36447,93
1024223,6931,55206,78148,5198,76319,31


Graphic 3: Time to solution (m) and efficiencies from 1 to 1024 nodes. The simulation is scaled on pools up to 16 nodes, then images are used to further distribute computation. The labels on top of the columns (ni,nk,npw,omp) define the parallelization used. ni : is the number of images, nk : number of pools, npw : number of R&G processes, omp : number of openmp threads. 

...

  • CPU (G100) - GPU (Leonardo) speedup
  • Scaling with increasing pools ( one pool per gpu ) for a single irreducible representation


Nodesphqscforthosth_kernelGPU time (s)CPU time ref (s)
12136,07121,232047,892203,88-
11099,3464,391047,001137,37-
1584,0134,88543,51607,726120,00
2302,3118,34272,62318,613179,40
4161,829,68138,19174,651549,20
891,604,9470,84102,5800,40
1655,812,6636,0466,44408,00


Graphic 4: Time to solution (m) for the CPU (G100) and GPU (Leonardo) runs from 1 to 16 nodes. CPU and GPU runs use a different parallelization, defined by the labels on top of the columns (ni,nk,npw,omp). ni : is the number of images, nk : number of pools, npw : number of R&G processes, omp : number of openmp threads. The yellow line shows the CPU-GPU speedup

...