Intel® VTune™ Amplifier provides mechanisms to analyze applications using Intel® HD Graphics for rendering, video processing, and computations, interpret collected performance data and get tuning advice. Consider following these steps to analyze your application with the VTune Amplifier:
Note
To analyze rendering activity, use Intel® Graphics Performance Analyzer.
Analyze GPU Usage
Configure the VTune Amplifier to analyze DirectX* pipeline events to explore GPU usage over time and understand whether your application or some of its phases are CPU or GPU bound. This is the least-intrusive analysis that is available for applications running on Intel HD Graphics as well as on other third-party GPUs supported by the VTune Amplifier.
To analyze GPU busyness, GPU software queue, and correlate CPU and GPU activity on your system, choose an analysis type in the Analysis Type window and enable the Analyze DirectX pipeline events option:
VTune Amplifier collects data and provides the analysis result on the Timeline pane of the Graphics window.
Theoretically, if the Timeline pane shows that the GPU is busy most of the time and having small idle gaps between busy intervals and the GPU software queue is rarely decreased to zero, your application is GPU bound. If the gaps between busy intervals are big and the CPU is busy during these gaps, your application is CPU bound. But such obvious situations are rare and you need a detailed analysis to understand all dependencies. For example, an application may be mistakenly considered GPU bound when GPU engines usage is serialized (for example, when GPU engines responsible for video processing and for rendering are loaded in turns). In this case, an ineffective scheduling on the GPU results from the application code running on the CPU.
When the GPU is intensely busy over time, you may look deeper and understand what kind of work it is running (rendering or computations), whether it is used effectively and whether there is some room for improvement. Such an analysis is possible with the hardware metrics collected by the VTune Amplifier for the Render and GPGPU engine of the Intel HD graphics.
The example below shows an analysis result for a GPU bound application. From the Summary window, you see that GPU Time is a substantial fraction of Elapsed time:
The Timeline pane for the same result shows no gaps on the GPU Usage band:
This example also demonstrates an activity on the Render and GPGPU engine (yellow color) as well as an activity on Video Codec engine (blue color).
Note
The Analyze DirectX pipeline events option introduces the least overhead during the collection, while the Analyze Processor Graphics hardware events adds medium overhead, and the Trace OpenCL kernels on Processor Graphics option adds the biggest overhead.
If you already identified that your application or some of its stages is GPU bound, GPU hardware metrics can provide you with a next level of details to analyze GPU activity and reason whether any performance improvements are possible. To collect GPU hardware metrics on the Render and GPGPU engine of Intel HD Graphics, enable the Analyze Processor Graphics hardware events option during analysis configuration:
Typically you are recommended to start with the Overview group of GPU event metrics that analyze general activity of GPU execution units, sampler, general memory and cache accesses, and then move to the Global/local memory accesses group to analyze accesses to different types of GPU memory. Global/local memory accesses metrics are most effective when you analyze computing work on a GPU with the Analyze DirectX pipeline events option enabled, which allows you to correlate GPU hardware metrics with an exact GPU load.
VTune Amplifier collects data for the selected analysis type and displays the collected data in the default viewpoint. To analyze GPU performance data, open the Graphics window and first focus on the Timeline pane. List of GPU metrics displayed in the Graphics window depends on the group of Processor Graphics hardware events selected during the analysis configuration.
The example below shows the Overview group metrics collected for the GPU bound application analyzed in the previous section:
The first metric to look at is GPU Core Activity: EU Array Idle metric. Idle cycles are wasted cycles. No threads are scheduled and the EUs' precious computational resources are not being utilized. If EU Array Idle is zero, the GPU is reasonably loaded and all EUs have threads scheduled on them.
In most cases the optimization strategy is to minimize the EU Array Stalled metric and maximize the EU Array Active. The exception is memory bandwidth-bound algorithms and workloads where optimization should strive to achieve a memory bandwidth close to the peak for the specific platform (rather than maximize EU Array Active).
Memory accesses are the most frequent reason for stalls. The importance of memory layout and carefully designed memory accesses cannot be overestimated. If the EU Array Stalled metric value is non-zero and correlates with the GPU L3 Misses, and if the algorithm is not memory bandwidth-bound, you should try to optimize memory accesses and layout.
Sampler accesses are expensive and can easily cause stalls. Sampler accesses are measured by the Sampler Is Bottleneck and Sampler Busy metrics.
Explore OpenCL™ Kernels Execution
If you know that your application uses OpenCL software technology and the GPU Compute Shader Activity metric in the Timeline pane confirms that your application is doing substantial computation work on the GPU, you may continue your analysis and capture the timing (and other information) of OpenCL kernels running on Intel HD Graphics. To run this analysis, enable the Trace OpenCL kernels on Processor Graphics option during configuration.
To view information about all OpenCL kernels running on the GPU, in the Graphics window switch Grouping to Computing Task Purpose / Computing Task (GPU) / Instance. VTune Amplifier identifies the following computing task purposes: Compute (kernels), Transfer (OpenCL routines responsible for transferring data from the host to a GPU), and Synchronization (for example, clEnqueueBarrierWithWaitList).
The corresponding columns show the overall time a kernel ran on the GPU and the average time for a single invocation (corresponding to one call of clEnqueueNDRangeKernel ), working group sizes, as well as averaged GPU hardware metrics collected for a kernel. Hover over a metric column header to read the metric description and view the formula used for the metric calculation. If a metric value for a computing task exceeds a threshold set up by Intel architects for the metric, this value is highlighted in pink, which signals a performance issue. Hover over such a value to read the issue description.
Analyze and optimize hot kernels with the longest Total Time values first. These include kernels characterized by long average time values and kernels whose average time values are not long, but they are invoked more frequently than the others. Both groups deserve attention.
To view details on OpenCL kernels submission and analyze the time spent in the queue, explore the Computing Queue data in the Timeline pane.
Analyze 3D Rendering Activity
If your application is GPU-bound but when you correlate the collected data, you see that the GPU is busy but the GPU Compute Shader Activity metric values are zero at the same time range, this means that your application was rendering objects rather than computing tasks. In this case, for further analysis, use the Intel® GPA Platform Analyzer.