Quantcast
Channel: C#
Viewing all articles
Browse latest Browse all 1853

OpenMP* Support

$
0
0

Prerequisites:

  • To collect and view OpenMP-specific information with all features, make sure you have a version of the OpenMP runtime libraries required by the VTune Amplifier. If an obsolete version of the OpenMP runtime libraries is detected, the VTune Amplifier provides a warning message. In this case the collection results may be incomplete.

  • To analyze OpenMP parallel frames, make sure to compile and run your code with the Intel® Compiler 13.1 Update 2 or higher (part of the Intel Composer XE 2013 Update 2). This version of Intel compiler uses OpenMP runtimes to insert Frame API and emit notifications at fork and join points that correspond to parallel region start and end points.

One of the most frequent problems of OpenMP* runtimes overhead is load imbalance or serial time. OpenMP is a fork-join parallel model, which means that an OpenMP program starts with a single master thread executing serial code. When a parallel region is encountered, that thread forks into multiple threads, which then execute the parallel region. At the end of the parallel region, the threads join at a barrier, and then the master thread continues executing serial code. It is possible to write an OpenMP program more like an MPI program, where the master thread immediately forks to a parallel region and constructs such as barrier and single are used for synchronization. But it is far more common for an OpenMP program to consist of a sequence of parallel regions interspersed with serial code. In such a program, the time is spent waiting in the OpenMP runtime in two cases:

  1. Serial time: When the master thread is executing a serial region, the slave threads are in the OpenMP runtime waiting for the next parallel region.

  2. Load imbalance: When a thread finishes a parallel region, it waits in a barrier for the other threads to finish.

Intel® VTune™ Amplifier together with Intel Composer XE 2013 Update 2 or later help you understand where an OpenMP program is serial and where it is imbalanced and provide a mechanism to correlate the time spent in the OpenMP runtime with the source code of the program. The OpenMP runtime library in the Intel Composer XE contains frame markers that can be used by the VTune Amplifier to break out the time in OpenMP by parallel region and in the serial part of the code.

In the example above, the frame domain is an OpenMP region number as reported by the instrumented libiomp5.so contained in the Intel Composer XE. The name consists of the (mangled) name of the function containing the parallel region together with the beginning and ending line number of the region. You can also see the number of times the region was executed (Frame Count), and the total wall clock time spent in the region (Frame Time). There is also an entry for the serial part of the code (Outside any frame). You can expand the frames to see the functions called in each frame (or in the serial part) thus giving a function profile by parallel region. You can also see the time by each thread in the parallel region which will help you determine which threads, if any, are starved for work.

Configuring OpenMP Parallel Frame Analysis

You can use frame analysis for OpenMP regions on both the host and the Intel® Xeon Phi™ coprocessor cards.

To analyze OpenMP parallel frames on the host:

Set the KMP_FORKJOIN_FRAMES environment variable to 1 via the User-defined Environment Variables dialog box available from the Project Properties dialog box.

To configure frame collection on the Intel Xeon Phi coprocessor card, set up the user API collection by propagating the environment variables from the host to the Intel Xeon Phi coprocessor card.

Viewing OpenMP Data

For Intel runtime libraries supporting the OpenMP standard, the VTune Amplifier provides extended information on OpenMP functions. VTune Amplifier shows all synchronization related pragmas (for example, #pragma omp critical) as synchronization objects with the OMP- prefix. If you used the Intel® Runtime libraries for the development, you can run Locks and Waits analysis to get detailed information on OpenMP synchronization objects that limited the parallel performance of your multithreaded application.

VTune Amplifier provides data on the OpenMP parallel regions (frame domains) in the following views:

  • Summary window: Identify the most time-consuming OpenMP functions and use Frame Rate histograms to identify parallel regions with the highest number of slow frames.

  • Bottom-up pane: Select the Frame Domain grouping level and analyze CPU time spent in OpenMP frame domains. Focus on the [No frame domain - Outside any frame] nodes that represent the serial time spent outside any frame. This time should be as close to zero as possible since this is a serial component of the Amdahl's Law equation that limits scalability of your code as the number of available threads increases.

    Note

    By default, the Intel compiler does not add a source file name to region names, so the unknown string shows up in the OpenMP frame domain name. To get the source file name in the region name, use the -parallel-source-info=2 option during compilation.

  • Top-down Tree pane: Explore the logical program flow of OpenMP regions restored with the help of the stack stitching option. This option is enabled by default for OpenMP applications using Intel runtime libraries an compiled with the Intel® Compiler 13.1 Update 3 or higher (part of the Intel Composer XE 2013 Update 3).

  • Tasks and Frames pane: Correlate information on the threads activity and frame rate for each OpenMP region. Identify functions with the low frame rate, switch to the Bottom-up pane, locate these hot functions and double-click to open and edit the source.

Limitations

VTune Amplifier supports the analysis of parallel OpenMP regions with the following limitations:

  • Maximum number of supported parallel regions is 512, which means that no frames will be emitted for regions whose scope is reached after 512 other parallel regions are encountered.

  • Frames from nested parallelism are not supported. Only top-level teams emit frames.

  • Opening a source file from the Tasks and Frames window does not work.

  • VTune Amplifier does not support static linkage of OpenMP libraries.

See Also


Supplemental documentation specific to a particular Intel Studio may be available at <install-dir>\<studio>\documentation\ .

Inglese

Viewing all articles
Browse latest Browse all 1853

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>