Write Bandwidth analysis type uses event-based sampling collection and is targeted for Intel® processors code name Beckton or Eagleton.
The Write Bandwidth analysis type measures the data written to DRAM via the processor's integrated memory controller. As a result of the data collection, theIntel® VTune™ Amplifier displays the write bandwidth-over-time chart. Use this chart to identify regions of significant memory bandwidth and compare that bandwidth with your machine's theoretical bandwidth limits. This helps you determine whether the code is saturating available bandwidth, a significant performance limitation. Once you have identified and filtered in a region of significant bandwidth, use the memory-access-correlated grid metrics to identify the thread, process, module, source file, or function responsible for the performance bottleneck.
To determine theoretical bandwidth limits for your machine, use a benchmark that is bandwidth-limited by design (like the STREAM benchmarks).
To see the full list of events used for this analysis type:
Click the (standalone GUI)/ (Visual Studio IDE) New Analysis toolbar button.
The Analysis Type window opens.
From the left pane, select Microarchitecture Analysis > CPU Specific Analysis > Nehalem /Westmere Analysis > Write Bandwidth Analysis.
The Write Bandwidth Analysis configuration pane opens on the right. The Details section provides a table with the processor events used for this analysis type.
You can choose to view Write Bandwidth analysis results in any of the following viewpoints:
Viewpoint | Description |
---|---|
Hardware Event Counts | Displays the event count for all collected processor events. While the Hardware Event Sample Counts viewpoint provides the actual number of samples collected for an event, Hardware Event Count viewpoint estimates the number of times this event occurred during the collection. |
Hardware Event Sample Counts | Displays the sample count for all collected processor events. While the Hardware Event Counts viewpoint estimates the number of times an event occurred during the collection, the Hardware Event Sample Counts viewpoint provides the actual number of samples collected for this event. |
Hardware Issues | Helps identify where the application is not making the best use of available hardware resources. This viewpoint displays metrics derived from hardware performance counters. Hover over the highlighted metrics values in the grid to read why the extreme value might represent a performance problem. |
Bandwidth | Helps identify where the application is generating significant bandwidth to DRAM. Memory bandwidth, in GB/sec, is plotted in the timeline, while events often associated with DRAM requests are shown in the grid. In the timeline, select a region of high bandwidth, and filter that region in. Use the grid to discover where in the code DRAM accesses are being generated. |
Task Time | Visualizes tasks, logical units of work on specific threads, based on ITT API annotations. Identify tasks with the highest execution time and analyze threads responsible for a particular task. |
These viewpoints may include the following windows:
Summary window displays statistics on the overall application execution.
Bottom-up window displays performance data per memory-access-correlated metrics (event ratio/event count/sample count) for each program unit.
Top-down Tree window displays hotspot functions in the call tree, performance metrics for a function only (Self value) and for a function and its children together (Total value).
PMU Events window displays count for PMU events selected for the analysis.
Uncore Events window displays count for uncore events selected for the analysis. If there are no uncore events, the upper pane of the window is empty.
Tasks, Tasks over Time, and Tasks by Threads windows provide details on tasks specified in your code with the Task API.
Limitations
Unlike processors based on the Intel microarchitecture code name Sandy Bridge, Intel processors code name Beckton or Eagleton (based on Intel microarchitecture code name Nehalem) are multi-socket, and thus they are prone to NUMA (Non-Uniform Memory Access) issues. Each socket has its own memory DRAM chips, and, from the perspective of one socket, its own local DRAM chips are significantly faster to get to than the DRAMs of any other socket. This has ramifications for the Write Bandwidth analysis:
If the target application is generating lots of bandwidth, it may be bandwidth-limited. If it is, ensure (via core affinity or a similar approach) that the data is allocated on the same package where it will be most used.
BIOS settings can drastically affect your system's NUMA behavior. Before running the Bandwidth analysis, consider enabling NUMA in the BIOS. If not enabled, the VTune Amplifier may allocate pages of memory in a round-robin fashion, one page to each package's DRAM in turn. It may also think that the whole system has only one package. Choosing the proper BIOS settings may improve the performance by 20% or more. This setting is also required to correctly interpret collected bandwidth data.