Quantcast
Channel: C#
Viewing all articles
Browse latest Browse all 1853

Analyzing Applications Using Intel® HD Graphics

$
0
0

Intel® VTune™ Amplifier provides mechanisms to analyze applications using Intel® HD Graphics for rendering, video processing, and computations, interpret collected performance data and get tuning advice. Consider following these steps to analyze your application with the VTune Amplifier:

  1. Analyze GPU usage.

  2. Analyze performance per GPU hardware metrics.

  3. Explore OpenCL™ kernels execution.

Note

  • VTune Amplifier XE provides GPU analysis on Windows* platforms only. You cannot run GPU data collection via a Remote Desktop connection. To run the GPU data collection, run the VTune Amplifier from the target computer's console or access the computer via VNC.

  • VTune Amplifier for Systems provides GPU analysis only on Android* systems running on processors with Intel HD Graphics.

  • To monitor general GPU busyness over time, run the VTune Amplifier as an Administrator.

Analyze GPU Usage

Run the CPU/GPU Concurrency analysis to explore GPU usage over time and understand whether your application or some of its phases are CPU or GPU bound. This is the least-intrusive analysis that is available for applications running on Windows platforms with Intel HD Graphics as well as on other third-party GPUs supported by the VTune Amplifier.

VTune Amplifier collects data and provides the analysis result in the Platform Viewviewpoint with the Platform window opened by default. This window provides basic metrics to analyze GPU usage per DMA packet on a software queue and correlate this data with the CPU usage on the timeline.

Theoretically, if the Platform window shows that the GPU is busy most of the time and having small idle gaps between busy intervals and the GPU software queue is rarely decreased to zero, your application is GPU bound. If the gaps between busy intervals are big and the CPU is busy during these gaps, your application is CPU bound. But such obvious situations are rare and you need a detailed analysis to understand all dependencies. For example, an application may be mistakenly considered GPU bound when GPU engines usage is serialized (for example, when GPU engines responsible for video processing and for rendering are loaded in turns). In this case, an ineffective scheduling on the GPU results from the application code running on the CPU.

When the GPU is intensely busy over time, you may switch to the Graphics window and look deeper to understand what kind of work it is running (rendering or computations) per thread, whether it is used effectively and whether there is some room for improvement. Such an analysis is possible with hardware metrics collected by the VTune Amplifier for the Render and GPGPU engine of the Intel HD graphics and presented in the Graphics window > Timeline pane.

The example below shows an analysis result for a GPU bound application. From the Summary window, you see that GPU Time is a substantial fraction of Elapsed time:

In the Graphics window, the Timeline pane for the same result shows no gaps on the GPU Usage band:

This example also demonstrates an activity on the Render and GPGPU engine (yellow color) as well as an activity on Video Codec engine (blue color).

Note

You may configure any Algorithm analysis type to collect GPU usage data. To do this, select the Analyze GPU usage option in the analysis configuration. This option introduces the least overhead during the collection, while the Analyze Processor Graphics hardware events adds medium overhead, and the Trace OpenCL kernels on Processor Graphics option adds the biggest overhead.

Analyze GPU Hardware Metrics

If you already identified that your application or some of its stages are GPU bound, GPU hardware metrics can provide you with a next level of details to analyze GPU activity and reason whether any performance improvements are possible.

VTune Amplifier may collect two types of GPU event metrics on the Render and GPGPU engine of Intel HD Graphics: Overview and Global/local memory accesses. You may enable their collection for any Algorithm analysis type by selecting the Analyze Processor Graphics hardware events option during analysis configuration and specifying the required group. Typically you are recommended to start with the Overview group of events group that analyze general activity of GPU execution units, sampler, general memory and cache accesses, and then move to the Global/local memory accesses group to analyze accesses to different types of GPU memory. Global/local memory accesses metrics are most effective when you analyze computing work on a GPU with the Analyze GPU usage events option enabled, which allows you to correlate GPU hardware metrics with an exact GPU load. CPU/GPU Concurrency analysis automatically collects the Overview metrics.

VTune Amplifier collects data for the selected analysis type and displays the collected data in the default viewpoint. To analyze GPU performance data, open the Graphics window and focus on the Timeline pane. List of GPU metrics displayed in the Graphics window depends on the group of Processor Graphics hardware events selected during the analysis configuration.

The example below shows the Overview group metrics collected for the GPU bound application analyzed in the previous section:

The first metric to look at is GPU Execution Units: EU Array Idle metric. Idle cycles are wasted cycles. No threads are scheduled and the EUs' precious computational resources are not being utilized. If EU Array Idle is zero, the GPU is reasonably loaded and all EUs have threads scheduled on them.

In most cases the optimization strategy is to minimize the EU Array Stalled metric and maximize the EU Array Active. The exception is memory bandwidth-bound algorithms and workloads where optimization should strive to achieve a memory bandwidth close to the peak for the specific platform (rather than maximize EU Array Active).

Memory accesses are the most frequent reason for stalls. The importance of memory layout and carefully designed memory accesses cannot be overestimated. If the EU Array Stalled metric value is non-zero and correlates with the GPU L3 Misses, and if the algorithm is not memory bandwidth-bound, you should try to optimize memory accesses and layout.

Sampler accesses are expensive and can easily cause stalls. Sampler accesses are measured by the Sampler Is Bottleneck and Sampler Busy metrics.

Explore OpenCL™ Kernels Execution (VTune Amplifier XE only)

If you know that your application uses OpenCL software technology and the GPU Computing Threads Dispatch metric in the Timeline pane of the Graphics window confirms that your application is doing substantial computation work on the GPU, you may continue your analysis and capture the timing (and other information) of OpenCL kernels running on Intel HD Graphics. To run this analysis, enable the Trace OpenCL kernels on Processor Graphics option during Algorithm analysis configuration. CPU/GPU Concurrency analysis enables this option by default.

To view information about all OpenCL kernels running on the GPU, in the Graphics window switch Grouping to Computing Task Purpose / Computing Task (GPU) / Instance. VTune Amplifier identifies the following computing task purposes: Compute (kernels), Transfer (OpenCL routines responsible for transferring data from the host to a GPU), and Synchronization (for example, clEnqueueBarrierWithWaitList).

The corresponding columns show the overall time a kernel ran on the GPU and the average time for a single invocation (corresponding to one call of clEnqueueNDRangeKernel ), working group sizes, as well as averaged GPU hardware metrics collected for a kernel. Hover over a metric column header to read the metric description and view the formula used for the metric calculation. If a metric value for a computing task exceeds a threshold set up by Intel architects for the metric, this value is highlighted in pink, which signals a performance issue. Hover over such a value to read the issue description.

Analyze and optimize hot kernels with the longest Total Time values first. These include kernels characterized by long average time values and kernels whose average time values are not long, but they are invoked more frequently than the others. Both groups deserve attention.

To view details on OpenCL kernels submission and analyze the time spent in the queue, explore the Computing Queue data in the Timeline pane of the Graphics or Platform window.

Inglese

Viewing all articles
Browse latest Browse all 1853

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>