If you enable the stack collection for a hardware event-based sampling analysis, the Intel® VTune™ Amplifier enhances the traditional event-based analysis providing various performance, parallelism and power efficiency metrics in correlation with each other, as well as with the actual code execution paths.
To interpret the performance data provided during the hardware event-based sampling analysis with stack collection enabled, you may follow the steps below:
Analyze Performance
Select the Hardware Event Counts viewpoint and click the PMU Events tab to open the PMU Events window. By default, the data in the grid are sorted by the Clockticks (CPU_CLK_UNHALTED
) event count providing primary hotspots on top of the list.
Click the plus sign to expand each hotspot node (a function, by default) into a series of call paths, along which the hotspot was executed. VTune Amplifier decomposes all hardware events per call path based on the frequency of the path execution.
The counts of the hardware events of all execution paths leading to a sampled node sum up to the event count of that node. For example, for the SerialSumTree
function, which is the top hotspot of the application, the CPU_CLK_UNHALTED
event count equals the sum of event counts for three calling sequences: 720 051 881 = 718 433 328 + 815 001 + 803 552.
Such a decomposition is extremely important if a hotspot is in a third-party library function whose code cannot be modified, or whose behavior depends on input parameters. In this case the only way of optimization is analyzing the callers and eliminating excessive invocations of the function, or learning which parameters/conditions cause most of the performance degradation.
Explore Parallelism
When the call stacks collection is enabled, the VTune Amplifier analyzes context switches and displays data on the threads activity using the context switch performance metrics.
Click the Synchronization Context Switches column header to sort the data by this metric. The synchronization hotspots with the highest number of context switches and high Wait time values typically signals a thread contention on this stack.
Select the Context Switch Time type in the drop-down menu of the Call Stack pane and explore the Timeline pane that shows each separate thread execution quantum. A dark-green bar represents a single thread activity quantum and light-green bars - thread inactivity periods (context switches). Hover over a context switch region in the Timeline pane to view details on its duration, start time and the reason of thread inactivity.
When you select a context switch region in the Timeline pane, the Call Stack pane displays a call sequence at which a preceding quantum was interrupted.
You may also select a hardware or software event from the Timeline drop-down menu and see how the event maps to the thread activity quanta (or to the inactivity periods).
Correlate data you obtained during the performance and parallelism analysis. Those execution paths that are listed as the performance hotspots with the highest event count and as the synchronization hotspots are obvious candidates for optimization. Your next step could be analyzing power metrics to understand the cost of such a synchronization scheme in terms of energy.
Note
The speed at which the data is generated (proportional to the sampling frequency and the intensity of thread synchronization/contention) may become greater than the speed at which the data is being saved to a trace file, so the profiler will try to adapt the incoming data rate to the outgoing data rate by not letting threads of a program being profiled be scheduled for execution. This will cause paused regions to appear on the timeline, even if no pause was explicitly requested. In ultimate cases, when this procedure fails to limit the incoming data rate, the profiler will begin losing sample records, but will still keep the counts of hardware events. If such a situation occurs, the hardware event counts of lost sample records will be attributed to a special node: [Events Lost on Trace Overflow].
Analyze Idle Power Consumption
If you configured your collection to analyze idle power consumption, the VTune Amplifier displays the following metrics in the result: Inactive Time, Idle Time, Idle Wakeups, and Cx Residency metrics.
The scheme below shows that when a thread becomes inactive, the system may start executing other threads (including threads from other applications) or even go to the idle state. In case a thread whose execution immediately follows a period of idleness belongs to a process being profiled, the profiler registers such a fact as an Idle Wakeup, and adds the duration of the idle period to the Idle Time.

Note
The system can spend more time in the idle state than reported in the profile, because the VTune Amplifier registers only those idle periods that were interrupted by threads of the application being profiled.
Modern processors (for example, based on Intel microarchitecture code name Nehalem or Sandy Bridge) implement special low-power states to assist the operating system in preserving energy during the state of idleness. Those low-power states are referred to as sleep states, or Cx-states, where x is a number denoting the depth of the processor ‘sleep’: the greater the number, the deeper the sleep, and the lower the power consumption. Due to the introduction of the sleep states, the operating system can shut various processor blocks (for example, various levels in the cache hierarchy) down, in case there is no task to execute. Though shutting the processor down comes at a cost: the deeper the sleep state, the longer it takes to go in and out of that state. So it is inefficient to use the deep sleep states for short time intervals, while lighter sleep states are faster to enter, but make the processor consume more power.
VTune Amplifier displays the amount of time spent in each of the low-power states as Cx Residency metrics, which can be used to determine whether the processor was effectively 'sleeping' during idle periods.
Note
The presence, actual number, type, and quality of low-power states are processor specific. Refer to your processor’s Software Developer Manual for details.
Analyze Active Power Consumption
Some Intel processors (for instance, those based on Intel microarchitecture code name Sandy Bridge) implement energy counters that can be used to evaluate the energy dissipated by processor cores, integrated graphics, DRAM, or the entire processor package while executing an application.
If you enabled the active power consumption analysis for your collection, the VTune Amplifier displays energy metrics as follows:
The Energy Core column shows how many micro-Joules of energy were dissipated while executing different functions and call paths on a processor core of an application being profiled. The Energy Pack column shows the energy consumed by the entire processor package. The Energy GFX column shows the energy consumed by the graphics. If the application does not produce any graphical output, the Energy GFX metric is zero, while the Energy Core and Energy Pack metrics contain non-zero values. The energy values correlate with the processor Clockticks event, and the amount of energy consumed by the processor package is always greater than the energy consumed by the cores.
The energy counters are neither core, nor logical processor specific. So, a thread accumulates energy counts for other threads simultaneously running on neighboring processors of the same package, and so the energy metrics tend to overcounting.
Note
The presence and type of energy counters are processor specific. Refer to your processor’s Software Developer Manual for details.