When you select a Target System of Intel Xeon Phi or Offload to Intel Xeon Phi coprocessor, additional modeling parameters appear below Runtime Modeling area under Intel Xeon Phi Advanced Modeling:
Select Consider Code Vectorization if you agree to modify your parallel code later to improve vector parallel execution. If checked, you can specify:
Reference CPU Vectorization Speedup you expect can be achieved. This value indicates the speedup multiplier gain for the current site by using vectorization techniques with the reference CPU (dual-socket 8-core Intel Xeon processor E5-26xx product family at 2.7 GHz, 16 cores total). When providing this estimate, base your estimates on target device characteristics and your expertise of how much and how well this part of code can be vectorized.
Intel Xeon Phi Vectorization Speedup you expect can be achieved. This value indicates the speedup multiplier gain for current site by using vectorization techniques with an Intel Xeon Phi processor. When providing this estimate, base your estimates on target device characteristics and your expertise of how much and how well this part of code can be vectorized.
When you choose Target System as Offload to Intel Xeon Phi, you can select the Offload Transfer Data Size to specify data transfer size value you expect can be achieved (unit is KB).
- Click Apply after modifying any of these values.
In some cases, you can restructure your code to enable more efficient vector operations. Loop vectorization allows hardware to process data independently in smaller units (usually 64-byte), such as operations on data arrays.
One way to enable more efficient vector operations is to modify a single loop to create a new outer loop where the two loops cover the same iteration space. A technique called strip-mining allows the innermost loop to use vector operations in small chunks.
Other ways to enable more efficient vector operations include examining outermost loops where threading parallelism might already be used, and consider vectorizing its innermost loops and/or callee functions:
Certain innermost loops may benefit from OpenMP 4 constructs. That is, under certain conditions you can use both an
omp parallel for
threading pragma and aomp simd
(or similar) simd vectorization pragma (see the compiler vectorization report and descriptions at http://openmp.org).Certain innermost loops may benefit from the Intel Cilk Plus
cilk_for
pragma. With Intel Cilk Plus, under certain conditions using a singlecilk_for
pragma can result in threading parallelism of outer loops and vector parallelism of inner loops (see the compiler vectorization report).
The processor microarchitecture determines the type of vector instructions that will be supported and thus the size of data the hardware can process efficiently (see http://en.wikipedia.org/wiki/List_of_Intel_CPU_microarchitectures).
For a description of the Intel® Xeon Phi™ coprocessor architecture, visit the Intel® Developer Zone and read such articles as http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner.