General Exploration analysis type uses event-based sampling collection and is targeted for the Intel® Xeon™ Phi coprocessor.
This analysis is a good way to triage hardware issues in programs running on the Intel Xeon Phi coprocessor. Once you have used Hotspots analysis to determine hotspots in your code, you can perform General Exploration analysis to understand how efficiently your code is utilizing the Intel Xeon Phi coprocessor architecture.
The Intel Xeon Phi coprocessor is ideally suited for highly parallel applications that feature a high ratio of computation to data access. It is composed of up to 61 CPU cores connected on-die via a bi-directional ring bus. Each core is capable of switching between up to 4 hardware threads in a round-robin manner, resulting in a total of up to 244 hardware threads available. Each core consists of an in-order, dual-issue x86 pipeline, a local L1 and L2 cache, and a separate vector processing unit (VPU). Being an in-order machine, the coprocessor can be sensitive to stalls on memory access so the round-robin scheduling of the threads and aggressive compiler-generated software prefetching are used to mitigate that. It is also important that each hardware thread uses the available vectorization width as much as possible.
To provide a dive into possible issues, the General Exploration analysis type provides ability to collect the following groups of metrics:
L1 cache usage , and estimated maximum latency for L1 cache misses
L2 cache usage. The data for additional L2 cache events should be used with caution since they include cache misses from software prefetch instructions.
Vectorization usage
TLB usage efficiency
All of the metrics in this analysis type measure activity within one Intel Xeon Phi coprocessor.
The metrics that you get after collecting one or several groups have programmed thresholds. When the value for the metric is outside the threshold, the cell corresponding to that hotspot will turn pink, giving you a hint when more investigation may be warranted. Memory bandwidth can be calculated using another profile as an additional performance metric. For details on tuning methodology and metrics, see the Optimization and Performance Tuning for Intel® Xeon Phi™ Coprocessors article.
To see the full list of events used for General Exploration analysis type:
- Click the
(standalone GUI)/
(Visual Studio IDE) New Analysis toolbar button.
The Analysis Type window opens.
- From the left pane, select Microarchitecture Analysis > General Exploration.
The General Exploration configuration pane opens on the right. The Details section provides a table with the processor events used for this analysis type.
Note
Analysis on the Intel Xeon Phi coprocessor is supported with the VTune Amplifier XE only. You can see a list of analysis types applicable to the coprocessor analysis only when you specify the Intel Xeon Phi coprocessor (native) or Intel Xeon Phi coprocessor (host launch)target system type in the Project Properties: Target tab.
You can choose to view General Exploration analysis results in any of the following viewpoints:
Viewpoint | Description |
---|---|
General Exploration | Helps identify where the application is not making the best use of available hardware resources. This viewpoint displays metrics derived from hardware events. The Summary window reports the overall metrics for the entire execution along with explanations of the metrics. From the Bottom-up and Top-down Tree windows you can locate the hardware issues in your application. Cells are highlighted when potential opportunities to improve performance are detected. Hover over the highlighted metrics in the grid to see explanations of the issues. |
Hardware Event Counts | Displays the event count for all collected processor events. While the Hardware Event Sample Counts viewpoint provides the actual number of samples collected for an event, Hardware Event Count viewpoint estimates the number of times this event occurred during the collection. |
Hardware Event Sample Counts | Displays the sample count for all collected processor events. While the Hardware Event Counts viewpoint estimates the number of times an event occurred during the collection, the Hardware Event Sample Counts viewpoint provides the actual number of samples collected for this event. |
Hardware Issues | Helps identify where the application is not making the best use of available hardware resources. This viewpoint displays metrics derived from hardware performance counters. Hover over the highlighted metrics values in the grid to read why the extreme value might represent a performance problem. |
Hotspots | Helps identify hotspots - code regions in the application that consume a lot of CPU time. |
Bandwidth | Helps identify where the application is generating significant bandwidth to DRAM. Memory bandwidth, in GB/sec, is plotted in the timeline, while events often associated with DRAM requests are shown in the grid. In the timeline, select a region of high bandwidth, and filter that region in. Use the grid to discover where in the code DRAM accesses are being generated. |
Task Time | Visualizes tasks, logical units of work on specific threads, based on ITT API annotations. Identify tasks with the highest execution time and analyze threads responsible for a particular task. |
Note
The Bandwidth viewpoint is available only if you enable the Analyze memory bandwidth option in the General Exploration configuration pane.
These viewpoints may include the following windows:
Summary window displays statistics on the overall application execution.
Bottom-up pane displays performance data per metric (event ratio/event count/sample count) for each hotspot function.
Top-down Tree window displays hotspot functions in the call tree, performance metrics for a function only (Self value) and for a function and its children together (Total value).
PMU Events window displays a count of PMU events selected for the analysis.
Uncore Events window displays a count of uncore events selected for the analysis. If there are no uncore events, the upper pane of the window is empty.
Tasks, Tasks over Time, and Tasks by Threads windows provide details on tasks specified in your code with the Task API.
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 |