The HPC monitoring daemon (hpcmd) is a lightweight middleware designed to measure performance data of your running jobs on HPC compute nodes, to compute derived metrics, and to write the results to syslog, to standard output, or to a file. This software has been deployed on Marconi cluster and it is actually in  pre-production phase.  

hpcmd on Marconi cluster

The hpcmd software, that is mainly written in python, has been installed and launched as a systemd service on Marconi compute nodes available at the skl_fua_prod partition. The HPC monitoring daemon (hpcmd) that is always active, manages to run and query for each detected running job some standard Linux command line tools (perf, ps), virtual file system files (as /proc/loadavg) and other proprietary tools (mmpmon, opainfo) at regular intervals (epochs) to obtain related metrics. The hpcmd daemon is extremely lightweight and operates in the background, being invisible to the user.

hpcmd daemon:

  • supports Intel Skylake processors and manages to compute the job performance in GFLOPS;
  • supports performance metrics from OPA network and GPFS file systems, to obtain network and disk I/O bandwidths;
  • it fully integrates with the SLURM scheduler, allowing the SLURM job detection and enabling the correlation of performance metrics with each job and to gather also other information as the jobid, the requested number of nodes, threads, etc.
  • it also computes derived metrics and writes the data to syslog lines, that can be collected via rsyslog and finally stored in a database for subsequent analysis and visualization.

Tools used to perform measurements:

  • perf: query the Performance Monitor Unit (PMU) events, core counters of processors and also software events counted by the Linux kernel; these data is aggregated by socket
  • ps: obtain information about the threads running on different cores; independently, also obtain information about RSS memory
  • opainfo, to query OmniPath (OPA) network metrics; per-node data
  • mmpmon, to query GPFS file systems metrics; per-node data
  • SLURM commands as scontrolsqueue and sacct to gather additional information for jobs

Epochs:

Performance data is sampled at regular intervals by hpcmd. It calls the perf tool in a blocking way: perf sets up counters, measures for some time (duration: epoch = 240 sec, awake = 230 sec) and finally reads from these counters.

All the other tools called by hpcmd return within a fraction of a second. At the beginning of each epoch, each 4 minutes, a new measurement is started.

Data management and visualization 

The syslog messages generated by hpcmd (each 4 minutes for running jobs) on each node are transported and collected by the rsyslog tool (via ethernet) to be recorded in a database. 

Jobs collected data is available for visualization.  

 

Interference with other performance tools

To run some special category of jobs might be necessary to suspend the hpcmd system service for the duration of the job as the hpcmd tool continuously queries hardware counters through the linux perf tool and those cannot be simultaneously accessed by a second tool. For example this may be necessary when using one of the following or similar tools in a job: Intel VTUNE, Intel Advisor, PAPI, perf...

To suspend the service insert "hpcmd_suspend" after "srun" in the batch job script before executing the application:

srun hpcmd_suspend <your_exe>

Once the job has finished, the hpcmd systemd service will be automatically enabled for subsequent jobs so no action is needed by the user.

 

About this software 

The hpcmd software is open source and available at https://gitlab.mpcdf.mpg.de/mpcdf/hpcmd

Reference:  Stanisic L., Reuter K., MPCDF HPC Performance Monitoring System: Enabling Insight via Job-Specific Analysis, Euro-Par 2019, Lect. Notes Comput. Sci, 11997 (2020) (arxiv)


Presentation

Please find bellow the slides shown at the "webinar: HPC performance monitoring tool on MARCONI cluster", done on June, 16th 2023:

hpcmd_mon_tool_15-06-23.pdf

  • No labels