Intel® VTune™ Amplifier XE 2013
A performance profiler for serial and parallel performance analysis. Overview, training, support.
New for Update 17!
- Optional update unless you need:
- Improved OpenMP* region analysis
- Added new analysis type “TSX Exploration” for 4th generation Intel® Core™ processors
- Extended Summary window
- Changed default Call Stack Mode default
- Updated product toolbar
- Added remote system configuration options
- Automatically disabled NMI Watchdog (nmi_watchdog) timer only during data collection
- Updated Event Reference for Intel microarchitectures code name Ivy Bridge and Haswell
- Added Ubuntu* 14.04 support
- Added stability and performance improvements
Note: We are now labeling analysis tool updates as "Recommended for all users" or "Optional update unless you need…". Recommended updates will be available about once a quarter for users who do not want to update frequently. Optional updates may be released more frequently, providing access to new processor support, new features, and critical fixes.
Resources
- Learn (“How to” videos, technical articles, documentation, …)
- Support
- Release Notes
Contents
File: vtune_amplifier_xe_2013_update17.tar.gz
Installer for Intel® VTune™ Amplifier XE 2013 Update 17 for Linux*
File: VTune_Amplifier_XE_2013_update17_setup.exe
Installer for Intel® VTune™ Amplifier XE 2013 Update 17 for Windows*
* Other names and brands may be claimed as the property of others.
Microsoft, Windows, Visual Studio, Visual C++, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries.
Next: What's New in Update 16?
Details:
Improved OpenMP* region analysis
Common problems of OpenMP* overhead in an OpenMP program is serial time and load imbalance. OpenMP is a fork-join parallel model, which means that an OpenMP program starts with a single master thread executing serial code. Parallel regions cause the master thread to fork into multiple threads, which then execute the parallel region. At the end of the parallel region, the threads join at a barrier, and then the master thread continues executing serial code. It is possible to write an OpenMP program more like an MPI program, where the master thread immediately forks to a parallel region and constructs such as barrier and single are used for synchronization. But it is far more common for an OpenMP program to consist of a sequence of parallel regions interspersed with serial code. In such a program, the time is spent waiting in the OpenMP runtime in two cases:
- Serial time: When the master thread is executing a serial region, the slave threads in the OpenMP runtime are waiting for the next parallel region.
- Load imbalance: When a thread finishes a parallel region, it waits in a barrier for the other threads to finish.
Intel® VTune™ Amplifier together with Intel Composer XE 2013 Update 2 or later helps you understand where an OpenMP program is serial and where it is imbalanced. It also provides a mechanism to correlate the time spent in the OpenMP runtime with the source code of the program. The OpenMP runtime library in the Intel Composer XE contains markers that can be used by the VTune Amplifier to break out the time in OpenMP by parallel region and serial code. The following paragraphs highlight the enhancements.
Summary pane: Use the OpenMPRegion Duration histogram to analyze instances of each OpenMP region, explore the time distribution per instance and identify Fast/Good/Slow region instances and focus on analysis of performance outlier instances in Grid/Timeline views. Initial distribution of region instances by Fast/Good/Slow categories is done as a ratio of 20/40/20 between min and max region time values.
Bottom-up pane: Select the OpenMP Region grouping level and analyze CPU, Spin and Overhead time spent in OpenMP regions. High Spin time values signal a parallel region imbalance. As a potential solution, you may set dynamic scheduling to reduce the imbalance. High Overhead time values can result from too fine-grain parallel work with a high scheduling cost. In this case consider increasing the parallel work executed by a working thread, for example, defining the region for an outer loop.
Top-down Tree pane: Explore the logical program flow of OpenMP regions. Call stacks of worker threads are properly joined with the corresponding fork point (OMP parallel for or OMP parallel directives) in the master thread so you can see full control flow graph for a hotspot in worker threads.
Timeline pane: Explore markers on the Timeline ruler area corresponding to OpenMP region instance duration. Hover over a marker to see the details on the region instance executed at this particular moment of time or click the marker to select the region on the timeline and filter data by region time.
Added new analysis type “TSX Exploration” for 4th generation Intel® Core™ processors
With Intel® Core™ processors based on the Intel microarchitecture code name Haswell, use the special VTune Amplifier analysis type TSX Exploration for tuning applications that use Intel® Transactional Synchronization Extensions (Intel® TSX). The analysis relies on performance counter-based profiling to understand transactional execution behavior and the causes of transactional aborts. For more information on Intel TSX, see Web Resources about Intel® Transactional Synchronization Extensions.
NOTE : You need to perform analysis on Haswell processors w/o the "K" designator, e.g., Intel® Core™ i7-4770K does not support Intel TSX.
The tuning process consists of 2 steps:
- Measuring transactional success
The first step is to measure the transactional success in an application. Select TSX Exploration analysis type and choose 1. Transactional success from the Analysis Step combo box, as shown below:
Note that three metrics are collected:
a) Clockticks – total number of unhalted cycles collected
b) Transactional Cycles– number of cycles spent during transactions. If it is near zero then the application is either not using Intel TSX-based synchronization or not using a synchronization library enabled for lock elision through the Intel TSX instructions.
c) Abort Cycles - number of cycles spent during transactions which were eventually aborted. If it is small relative to Transactional Cycles, then the transactional success rate is high and additional tuning is not required. If it is almost the same as Transactional Cycles (but not very small), then most transactional regions are aborting and lock elision is not going to be beneficial. The next step would be to identify the causes for transactional aborts and reduce them – see next step.- Sampling transactional aborts
Select the TSX Exploration analysis type and choose 2. Aborts option from the Analysis Step combo box, as shown below:
As a result of this analysis, you’ll see where the transaction aborts are happening and for what reason. Possible reasons include:
a) Instruction - Some instructions, such as CPUID and IO instructions, may cause a transactional execution to abort.
b) Data Conflict - A conflicting data access occurs if another logical processor either reads a location that is part of the transactional region's write-set or writes a location that is a part of either the read- or write-set of the transactional region. Since Intel TSX detects data conflicts at the granularity of a cache line, unrelated data locations placed in the same cache line will be detected as conflicts.
c) Capacity - Transactional aborts may also occur due to limited transactional resources. For example, the amount of data accessed in the region may exceed an implementation-specific capacity.
- Sampling transactional aborts
Extended Summary window with hyperlinks for Top Hotspots and performance metrics navigating to the Bottom-up grid view
The Summary Pane has been enriched with hyperlinks for Top Hotspots, performance metrics and General Exploration issues, which navigate a user to the Bottom-up grid view with the respective function item selected or column with the metric sorted.
Changed default Call Stack Mode default setting from “Only user function” to “User functions +1” for better understanding of library usage
Default settings for the Call Stack Mode drop-down menu on the filter bar have been changed to "User functions + 1".
When using VTune Amplifier with the default Call Stack Mode "Only user functions", some customers are often surprised that they do not see some library code in the results, while they are sure that there are MKL, IPP or some other library usage. These are usually considered as “system” by VTune Amplifier. This happens since in this mode we attribute all system code back to user code caller side. Attribution of everything to user functions created some confusion.
The User functions + 1 mode filters all system functions except those directly called from user functions, so a user can see which top function is hot and who is calling that.
NOTE: The changes will only be visible for newly created VTune Amplifier projects or if you never changed the Call Stack mode in your existing project, otherwise the Call Stack mode will be inherited from the project properties.
Updated the product toolbar providing quick access to the product documentation with the new Help button and to the Import dialog box (standalone only) with the Import Result button.
Added remote system configuration options
The Target tab of the Project Properties has been enhanced to specify a path to the VTune Amplifier installed on the remote machine and a path to a remote temporary directory used for storing performance results.
When collecting data remotely, the VTune Amplifier XE looks for the collectors on the target system in the default install location: /opt/intel/vtune_amplifier_xe_2013. It also temporary stores performance results on the target system in the /tmp directory.
If you installed the VTune Amplifier XE to a different location on target and need to specify another temporary directory, use the appropriate configuration options in the Project Properties:Target tab in GUI or command line collection knobs -target-install-dir and -target-tmp-dir:
Automatically disabled NMI Watchdog (nmi_watchdog) timer only during data collection
Non Maskable Interrupt (NMI) watchdog causes incorrect results in the PMU event-based sampling (EBS) analysis, therefore, previous releases of the VTune Amplifier XE have refused to perform EBS collection if the nmi_watchdog was ON. The user had to manually disable it.
Effective with VTune Amplifier XE 2013 Update 17, the nmi_watchdog timer is disabled automatically during EBS collection period, only. It is automatically re-enabled after collection completes.