CINECA has added 3600 nodes each housing an Intel's second generation Xeon-Phi processor (Knights Landing or KNL). This document provides a getting started guide with regards to using the KNL nodes. For further help please email superc@cineca.it

System Configuration

Each KNL node consists of:

...

16 GB of MCDRAM and 96 GB of DDR4 RAM

...

They support Intel AVX-512 instruction set extensions. The same three login nodes serve the Marconi-Broadwell (Marconi-A1) and the Marconi-KNL (Marconi-A2) partitions and queueing systems. Also the storage devices are in common between the two partitions.

Compiling Code for KNL

Since KNL nodes are binary compatible with legacy x86 instruction set, any code compiled for normal A1 Marconi nodes will run on these nodes. However, specific compiler option is needed to generate AVX-512 instructions to derive better performance from these nodes.

Intel Compilers

Version 15.0 and newer of the Intel compilers can generate these instructions if you specify the -xMIC-AVX512 flag:

...

icpc -xMIC-AVX512 -O3 -o executable source.cc

ifort -xMIC-AVX512 -O3 -o executable source.f

Production Environment

The production environment is accessible via module environment, shared with the Broadwell partition. At present we did not recompile all software to have a KNL optimized version, but the binaries built for the Broadwell nodes can run on the KNL as well. We will inform you via newsletter when the optimized version of all softwares will be available.

Submitting and checking Jobs

Marconi-A2 production environment is based on the latest release of PBS scheduler/resource manager.

The KNL nodes are accessible via the PBS "knl" routing queue, defined on the login nodes. The routing queue is in charge to assign the jobs to the actual queues defined on the PBS server installed for the KNL partition. The A2-queues are knldebug and knlprod, but they are not directly accessible: you need to submit job to the A1-queue "knl", the job will be received by the A2-queue knlroute which will routes the jobs either to the knldebug or to the knlprod (depending on the number of requested nodes and the walltime).

Each KNL node exposes itself to PBS has having 68 cores (corresponding to the physical cores of the KNL processor). Jobs should request the entire node (hence, ncpus=68), and the KNL PBS server is configured so that to assign the KNL nodes in an exclusive way (even if less ncpus are asked). Hyper-threading is enabled, hence you can run up to 272 processes/threads on each assigned node.

The preliminar configuration of the Marconi-A2 partition allows to explore different HBM modes (on-package high-bandwidth memory based on the multi-channel dynamic random access memory (MCDRAM) technology) and clustering modes of cache operations. Please refer to the official Intel documentation for a description of the different modes. The accessible KNL racks have been configured in the following modes:

cache/SNC-2
cache/quadrant
cache/hemisphere
flat/SNC-2
flat/quadrant
flat/hemisphere

Correspondingly, two "custom resources" have been defined at the chunk-level (mcdram and numa) to request nodes in a specific configuration. The resource mcdram can assume the value "flat" or "cache", the numa resource can be snc2, quadrant, hemisphere.

For example, to request a single KNL node with all its memory in "flat" mode (108GB; the maximum memory for nodes in "cache" mode is instead 93 GB) and the sub-NUMA cluster SNC-2 mode, the following PBS job script would be required:

#!/bin/bash

#PBS -q knl

#PBS -l select=1:ncpus=68:mpiprocs=68:mem=108GB:mcdram=flat:numa=snc2

#PBS -l walltime=00:30:00

#PBS -A <account_no>

... # Other PBS resource requests

PATH_TO_EXECUTABLE > output_file

This will enqueue the job on the knldebug queue. In this preliminar configuration, all job requesting up to two KNL nodes and less than 30 minutes will be queued on the knldebug (defined on a pool of reserved nodes). All other jobs will end up in the knlprod queue.

An additional routing queue (knltest) is also defined on the login nodes, which routes the jobs to the A2-queue "knltest". This queue is defined on nodes in cache/flat mode, and SNC-2/SNC-4 clustering mode. This is a reserved queue, please write to superc@cineca.it if you are interested to run tests with such configurations.

We summarize in the following table the present possible requests:

...

Routing queue

#PBS -q <routing queue>

...

Destination queue

(from route only)

...

knl

...

knldebug

...

mcdram=<cache/flat>

numa=<snc2>

...

136 (2 KNL nodes)

...

93 GB (mcdram=cache)

108 GB (mcdram=flat)

...

30 min

...

knl

...

knlprod

...

mcdram=<cache/flat>

numa=<snc2>

...

68000 (1000 KNL nodes)

...

93 GB (mcdram=cache)

108 GB (mcdram=flat)

...

24 h

...

knltest

(restricted access

ask superc@cineca.it)

...

knltest

...

mcdram=<cache/flat>

numa=<snc4,quadrant,hemisphere>

...

4896 (72 KNL nodes)

...

93 GB (mcdram=cache)

108 GB (mcdram=flat)

...

-

Please notice that in this pre-production phase the jobs will not be accounted (and the #PBS -A flag is not requested).

For the jobs submitted on the A2 partition to be listed, the A2 PBS server must be explicitly interrogated. At present you need to know which of the two A2 PBS servers is acting as master between knl1 and knl2. The command:

qstat @knl1

or

qstat @knl2

will return the list of jobs submitted to the A2-queues accordingly to if it's the knl1 or knl2 node to be acting as PBS server in that moment.

To have the list of jobs submitted by a specific user(assuming e.g. that the first server is the master one) :

qstat -u $USER @knl1

To obtain a full display job status you also need to specify the server:

qstat -f <job_id>@knl1

and the same is true to delete a job. Hence, if the knl1 is the master server, given a job_id reported by the qstat command, you need to type:

qdel <job_id>@knl1

Please note that this is a PRELIMINARY configuration, we will update you within the shortest delay at each configuration change.

Optimizing Code for KNL - Vectorization

There are certain considerations to be taken into account before running legacy codes on KNL nodes. Primarily, the effective use of vector instructions is critical to achieve good performance on KNL cores. For guideline on how to get vectorization information and improve code vectorization, refer to

How to Improve Code Vectorization

been integrated into the general UserGuide about MARCONI Environment

Page tree

Versions Compared

Old Version 88

New Version Current

Key

System Configuration

Compiling Code for KNL

Intel Compilers

Production Environment

Submitting and checking Jobs

Optimizing Code for KNL - Vectorization

Page tree

Page History

Versions Compared

Old Version 88

New Version Current

Key

System Configuration

Compiling Code for KNL

Intel Compilers

Production Environment

Submitting and checking Jobs

Optimizing Code for KNL - Vectorization