Note
This type of analysis is supported only by the Intel® VTune™ Amplifier XE.
Prerequisites:
To analyze OpenMP parallel regions, make sure to compile and run your code with the Intel® Compiler 13.1 Update 2 or higher (part of the Intel Composer XE 2013 Update 2). If an obsolete version of the OpenMP runtime libraries is detected, VTune Amplifier provides a warning message. In this case the collection results may be incomplete.
OpenMP is a fork-join parallel model, which starts with an OpenMP program running with a single master serial-code thread. When a parallel region is encountered, that thread forks into multiple threads, which then execute the parallel region. At the end of the parallel region, the threads join at a barrier, and then the master thread continues executing serial code. It is possible to write an OpenMP program more like an MPI program, where the master thread immediately forks to a parallel region and constructs such as barrier and single are used for work coordination. But it is far more common for an OpenMP program to consist of a sequence of parallel regions interspersed with serial code.
Ideally, parallelized applications have working threads doing useful work from the beginning to the end of execution, utilizing 100% of available CPU core processing time. In real life, useful CPU utilization is likely to be less when working threads are waiting, either actively spinning (for performance, expecting to have a short wait) or waiting passively, not consuming CPU. There are several major reasons why working threads wait, not doing useful work:
Execution of serial portions (outside of any parallel region): When the master thread is executing a serial region, the worker threads are in the OpenMP runtime waiting for the next parallel region.
Load imbalance: When a thread finishes its part of workload in a parallel region, it waits at a barrier for the other threads to finish.
Not enough parallel work: The number of loop iterations is less than the number of working threads so several threads from the team are waiting at the barrier not doing useful work at all.
Synchronization on locks: When synchronization objects are used inside a parallel region, threads can wait on a lock release, contending with other threads for a shared resource.
VTune Amplifier together with Intel Composer XE 2013 Update 2 or later help you understand how an application utilizes available CPUs and identify causes of CPU underutilization.
Configuring OpenMP Parallel Region Analysis
The OpenMP runtime library in the Intel Composer XE provides special markers for applications running under profiling that can be used by the VTune Amplifier to decipher the statistics of OpenMP parallel regions and distinguish serial parts of the application code.
You can use this OpenMP region analysis on both the host and the Intel Xeon Phi™ coprocessor.
Interpreting OpenMP Analysis Data
VTune Amplifier provides OpenMP analysis data in the Hotspots and Hotspots by CPU Usage viewpoints.
Start your analysis with the CPU Usage Histogram in the Summary window of your application results. It displays the Elapsed time of your application, broken down by CPU utilization levels. The histogram shows only useful utilization so the CPU cycles that were spent by the application burning CPU in spin loops (active wait) are not counted. You can adjust sliders from the default levels if you intentionally use a number of OpenMP working threads less than the number of available hardware threads.
If the bars are close to Ideal utilization, you might need to look deeper, at algorithm or microarchitecture tuning opportunities, to find performance improvements. If not, explore the OpenMP Analysis section of the Summary window for inefficiencies in parallelization of the application:
This section of the Summary window shows the Collection Time as well as the duration of serial (outside of any parallel region) and parallel portions of the program. If the serial portion is significant, consider options to minimize serial execution, either by introducing more parallelism or by doing algorithm or microarchitecture tuning for sections that seem unavoidably serial. For high thread-count machines, serial sections have a severe negative impact on potential scaling (Amdahl's Law) and should be minimized as much as possible.
To analyze the serially executed code, switch to the Bottom-up window, select the /OpenMP Region/Thread/Function grouping, and filter the view by the OMP Master Thread of [Serial - outside any region] row:
To estimate the efficiency of CPU utilization in the parallel part of the code, use the Potential Gain metric. This metric estimates the difference in the Elapsed time between the actual measurement and an idealized execution of parallel regions, assuming perfectly balanced threads and zero overhead of the OpenMP runtime on work arrangement. Use this data to understand the maximum time that you may save by improving parallel execution.
The Summary window provides a detailed table listing the top five parallel regions with the highest Potential Gain metric values. For each parallel region defined by a pragma #omp parallel
, this metric is a sum of potential gains of all instances of the parallel region.
If Potential Gain for a region is significant, you can go deeper and select the link on a region name to navigate to a Bottom-up view employing an OpenMP Region dominant grouping and the region of interest selection:
Bottom-up view enables classifying inefficiencies by presenting a breakdown of CPU time spent in the region: high Spin Time values can signal a parallel region imbalance. As a potential solution, you may try dynamic scheduling to reduce the imbalance. High Overhead Time values can result from parallel work divided too finely, resulting in excessive scheduling cost. In this case consider increasing the parallel work executed by each working thread, for example, moving the parallel region to an outer loop or using larger chunking for dynamic scheduling. To get more details of the reasons for high Spin or Overhead Time values, you may click to expand the column and analyze time distribution by the following metrics:
Overhead Time reasons for OpenMP regions: Creation, Scheduling, Reduction
Spin Time reasons for OpenMP regions: Imbalance or Serial Spinning, Lock Contention
Spin or Overhead Time that is not classified is provided in the Other Time metric column. Hover over a column header to view the metric description and a formula used for its calculation. A pink cell signals a performance issue. Hover overs such a value to get more details on the detected problem.
To analyze the source of a performance-critical OpenMP parallel region, double-click the region identifier in the grid, sorted by the OpenMP Region/.. grouping level. VTune Amplifier opens the source view at the beginning of the selected OpenMP region in the pseudo function created by the Intel compiler.
To explore an impact from synchronization on locks inside a region (for example, #pragma omp critical
), use the Locks and Waits predefined analysis type. The Locks and Waits analysis is based on synchronization object function tracing, so big contention on synchronization objects can cause significant runtime overhead because of the analysis. Try to avoid synchronization inside regions using OpenMP reduction or thread local storage where possible.
Use the OpenMP Region Duration histogram in the Summary window to analyze instances of an OpenMP region, explore the time distribution of instance durations and identify Fast/Good/Slow region instances. Initial distribution of region instances by Fast/Good/Slow categories is done as a ratio of 20/40/20 between min and max region time values. Adjust the thresholds as needed.
Use this data for further detailed analysis in the grid views with /OpenMP Region/Region Type grouping levels.
Note
By default, the Intel compiler does not add a source file name to region names, so the unknown
string shows up in the OpenMP parallel region name. To get the source file name in the region name, use the -parallel-source-info=2
option during compilation.
Limitations
VTune Amplifier supports the analysis of parallel OpenMP regions with the following limitations:
Maximum number of supported lexical parallel regions is 512, which means that no region annotations will be emitted for regions whose scope is reached after 512 other parallel regions are encountered.
Regions from nested parallelism are not supported. Only top-level items emit regions.
VTune Amplifier does not support static linkage of OpenMP libraries.
For an MPI analysis result including more than one process with OpenMP regions, the VTune Amplifier does not provide OpenMP statistics in the Summary window and provides a limited list of OpenMP metrics in the Bottom-up window. For complete OpenMP statistics, make sure to launch MPI analysis on a single rank. Note that region instance marks on the timeline will be available for multi-process OpenMP results only for hierarchical groupings starting with Process/.. level.