C++ AMP provides a flexible model for selecting the right accelerator for your application. When no GPUs are available, your application can fall back on the CPU using the WARP accelerator (on Windows 8) to execute your data-parallel code. If your algorithm can be expressed more efficiently in a task-parallel way on the CPU, then your application can also provide an alternative implementation using the PPL. You will also need to implement a CPU version of your algorithm if you intend to support target machines running Windows 7 that do not have a C++ AMP-capable GPU.
C++ AMP and the PPL can also be combined to leverage the power of multiple GPUs and multiple CPU cores. The performance gains from running on more than one GPU can be very significant, provided the algorithm can be split efficiently between GPUs and the overhead of any synchronization and data copying minimized. Braided parallelism also provides more opportunities for taking advantage of both data parallelism on the GPU and task parallelism on the CPU to maximize application performance.
The NBody case study code from Chapter 2 shows how it’s possible to use C++ AMP to take advantage of multiple GPUs. The NBodyAmpMultiTiled class defined in NBodyAmpMultiTiled.h shows how to implement n-body on more than one GPU accelerator. In the NBody example, the particle update calculation is divided among the available GPUs. At the end of each time step the new particle positions and velocities are copied back onto the CPU and then the new data for all particles is sent to the GPUs. The Cartoonizer case study presented next in Chapter 10 also discusses using C++ AMP on more than one GPU in more detail. In this case, the Cartoonizer shares only image halo data among GPUs after each stage of the calculation.