...
Example:
module load intel-cc /16 .0.3.210 icc -g -qopenmp -O2 example.c –o example |
...
-g | Build application with debug information to allow binary-to-source correlation in the reports. |
-qopenmp | Enable generation of multi-threaded code if OpenMP directives/pragmas exist. |
-O2 (or higher) | Request compiler optimization. |
-vec | Enable vectorization if option O2 or higher is in effect (enabled by default). |
-simd | Enable SIMD directives/pragmas (enabled by default). |
For details of these options refer to man page or documentations of Intel compilers.
2) Submit a PBS job which executes the binary and runs the Intel Advisor command line tool advixe-cl to collect vectorisation information:
Example (PBS scripting part is omitted):
...... module load intel-advisor /2016 .1.40.463413 advixe-cl --collect survey --project- dir . /advi . /example |
For a 4-process MPI program, collect survey data into the shared ./advi project directory:
...... module load intel-advisor /2016 .1.40.463413 mpirun -n 4 advixe-cl --collect survey --project- dir . /advi . /mpi_example_serial |
3) Once the job finishes, launch advixe-gui on a login node to visualize the data collected by advixe-cl:
Example:
module load intel-advisor /2016 .1.40.463413 advixe-gui & |
|
- Choose “Open Result” tab and then select the .advixeexp file generated by advixe-cli in previous step.
- In the “Summary” part the summary of the report generated by Vectorization Advisor is shown. Vector instruction sets used, vectorization gain/efficiency are shown. Below is a screenshot of of the summary for a code run on a KNL node:
- The Survey Report provides detailed compiler report data and performance data regarding vectorization such as:
- Which loops are vectorized, the location in the source code
- Vectorization issues
- The reason why a loop is not vectorized
- Vector ISA used
- Vectorization efficiency, speedup
- Vector length (# of elements processed in the SIMD instruction)
- Vectorization instructions used
Below is a screenshot of of the Survey Report for a code run on a KNL node:
If a loop cannot be vectorized with automatic vectorization, Intel Advisor will provide the reason and advices on how to fix the vectorization issues specific to your code, such as dependency analysis and memory access pattern analysis. Users should follow these advices, modify their source code and give compiler more hints to improve vectorization, by using compiler options or adding directives/pragmas to the source code (explicit vectorization).
Explicit Vectorization
Compiler SIMD directives/pragmas
Users can add compiler SIMD directives/pragmas to the source code to tell the compiler that dependency does not exist, so that the compiler can vectorize the loop when the user re-compiles the modified source code. Such SIMD directives/pragmas include:
#pragma vector always: instruct to vectorize a loop if it is safe to do so #pragma vector align: assert that data within the loop is aligned on 16B boundary #pragma ivdep: instruct the compiler to ignore potential data dependencies #pragma simd: enforce vectorization of a loop |
OpenMP directives/pragmas
Users can use OpenMP 4.0 new directives/pragmas for explicit vectorization:
#pragma omp simd: enforce vectorization of a loop #pragma omp declare simd: instruct the compiler to vectorize a function #pragma omp parallel for simd: target same loop for threading and SIMD, with each thread executing SIMD instructions |
Compiler options and macros
Users can also use compiler options and macros for explicit vectorizaiton:
-D NOALIAS /-noalias : assert that there is no aliasing of memory references (array addresses or pointers) -D REDUCTION: apply an omp simd directive with a reduction clause -D NOFUNCCALL: remove the function and inline the loop -D ALIGNED /-align : assert that data is aligned on 16B boundary -fargument-noalias: function arguments cannot alias each other |
SIMD enabled functions
Users can also declare and use SIMD enabled functions. In the example below, function foo is declared as a SIMD enabled function (vector function), so it is vectorized. So is the for loop in which it is called.
__attribute(vector) float foo( float ); void vfoo( float *restrict a, float *restrict b, int n){ int i; for (i=0; i<n; i++) { a[i] = foo(b[i]); } } float foo( float x) { ... } |
Programming Guidelines for Writing Vectorizable Code
- Use simple loops, avoid variant upper iteration limit and data-dependent loop exit conditions
- Write straight-line code: avoid branches, most function calls or if constructs
- Use array notations instead of pointers
- Use unit stride (increment 1 for each iteration) in inner loops
- Use aligned data layout (memory addresses)
- Use structure of arrays instead of arrays of structures
- Use only assignment statements in the innermost loops
- Avoid data dependencies between loop iterations, such as read-after-write, write-after-read, write-after-write
- Avoid indirect addressing
- Avoid mixing vectorizable types in the same loop
- Avoid functions calls in innermost loop, except math library calls