Working with Multiple Accelerators in C++ AMP
Combining task parallelism with data parallelism is often referred to as braided parallelism. This pattern has obvious applications when it comes to programming today’s heterogeneous computers. For maximum performance, your application should make use of all the available processors, both on the CPU and GPUs.
So far the examples in this chapter have used the CPU just to orchestrate work being executed on C++ AMP-enabled accelerators. Braided parallelism can be taken further because it allows you to leverage the power of both the CPU’s cores and any available GPUs. If some parts of your application lend themselves to massive data parallelism on the GPU but others are more suitable to execution on the CPU, then it’s possible to combine the PPL and C++ AMP to take advantage of both.
When deciding which parts are best placed on the GPU and which should remain on the CPU, you should think carefully about your application’s overall workflow. Even some data-parallel algorithms might be better suited to executing on the CPU. For example, if the algorithm doesn’t use enough data to keep the majority of the GPU’s threads occupied or can’t meet the restrictions required of code running in a C++ AMP kernel, it’s a poor fit for C++ AMP. You should also consider reorganizing your workflow to minimize both the number of data transfers between the GPU and CPU and the volume of data transferred.
The Cartoonizer case study in Chapter 10 illustrates using braided parallelism to process images and video using a task-parallel pipeline on the CPU combined with data-parallel image processing on the GPU. The pipeline on the CPU loads, reformats, and resizes images or video frames. The GPU(s) are used to cartoonize the images before the CPU finally displays the result. Here, PPL tasks running on the CPU execute part of the processing and orchestrate C++ AMP accelerators.
When designing a braided application, it’s important to consider the overall workflow of your application. It might be tempting to simply measure and profile your application and then to rewrite the data-parallelizable hotspots as C++ AMP kernels so that they can execute on the GPUs. Although this will certainly make some parts of your application run faster, Amdahl’s law will eventually limit overall application performance. Taking a more holistic view during (re)design will probably lead to finding more exploitable opportunities for parallelism and consequently better application performance.
The PPL, the Standard Library, and C++ AMP all provide support for creating parallel workflows using asynchronous methods. This allows you to create applications that execute work on both the CPU and GPU(s) concurrently, maximizing your application’s performance.
Although a full discussion of asynchronous programming and the Futures and Task Graph patterns are outside the scope of this book, good introductions to both can be found on MSDN “Parallel Programming with Microsoft Visual C++, 5: Futures” at http://msdn.microsoft.com/en-us/library/gg663533 and on the Berkeley Patterns Wiki at http://parlab.eecs.berkeley.edu/wiki/patterns/patterns.
Many of the tradeoffs and guidelines for designing braided parallel applications are the same as those for designing all parallel applications. There is a significant overhead associated with moving data to and from a discrete GPU, and the design should seek to minimize this. In some cases, this might mean reordering your application workflow to reduce the number of data copies. In others, it might mean implementing some parts of your workflow in C++ AMP even though they are more suited to a task-parallel implementation on the CPU.
The design should also account for the very different performance characteristics of GPUs for different workloads. They perform data-parallel work very efficiently but perform poorly when the workload can’t be (re)written in a data-parallel way. Some types of computation are hard to implement in a data-parallel manner—for example, code that makes heavy use of branching. These parts of your application might be better executed on the CPU.
The Cartoonizer case study in Chapter 10 covers a complete application implemented with braided parallelism.