Home > Sample chapters > Programming > C++

Working with Multiple Accelerators in C++ AMP

Page 1 of 7 Next >
This chapter from C++ AMP shows how to choose among different C++ AMP accelerators and select the best ones for your code. It also covers running C++ AMP on more than one accelerator and using Parallel Patterns Library (PPL) code running on the CPU to orchestrate the GPU accelerators or execute work more suited to the CPU.

So far, the examples in the book have covered using C++ AMP with a single accelerator: a single physical GPU, a WARP accelerator, or the reference (REF) accelerator. Each of these accelerator types is described in this chapter. Although using a single accelerator is probably the most common scenario today, the computer running your application might soon have more than one accelerator. This could be a combination of one or more discrete GPUs, a GPU integrated with the CPU, or both. If your application wants to use all of the available compute power, it needs to efficiently orchestrate executing parts of the work on each accelerator and combining the results to give the final answer. This chapter shows how to choose among different C++ AMP accelerators and select the best ones for your code. It also covers running C++ AMP on more than one accelerator and using Parallel Patterns Library (PPL) code running on the CPU to orchestrate the GPU accelerators or execute work more suited to the CPU. These strategies will maximize your application’s performance.

Choosing Accelerators

C++ AMP allows you to enumerate the available accelerators and choose the ones on which your application will run. Your application can also filter the accelerators based on their properties and select a default accelerator.

Enumerating Accelerators

Enumerating Accelerators

The following code uses accelerator::get_all() to enumerate all the available accelerators and print their device paths and descriptions to the console:

std::vector<accelerator> accls = accelerator::get_all();

std::wcout << "Found " << accls.size() << " C++ AMP accelerator(s):" << std::endl;
std::for_each(accls.cbegin(), accls.cend(), [](const accelerator& a)
    std::wcout << "  " << a.device_path << std::endl
        << "    " << a.description << std::endl << std::endl;

The description property provides a user-friendly name for the accelerator, but the device_path property provides a persistent unique identifier that is more useful for programmatically selecting accelerators. The device path is also persistent across processes and Microsoft Windows–based sessions, provided the hardware isn’t changed or the system reinstalled. For example, your application can use the device_path to refer to an accelerator selected by the user in a previous application session.

You can run this example by loading the Chapter9\Chapter9.sln solution. Build the sample in Release configuration and run it using Ctrl+F5 to start it without the debugger attached. Here’s some example output from this code:

Using device : NVIDIA GeForce GTX 570

Enumerating accelerators

Found 4 C++ AMP accelerator(s):
    NVIDIA GeForce GTX 570

    NVIDIA GeForce GTX 570

    Microsoft Basic Render Driver

    Software Adapter

    CPU accelerator

Found 2 C++ AMP hardware accelerator(s):
Has WARP accelerator: true

Looking for accelerator with display and 1MB of dedicated memory...
  Suitable accelerator found.

Setting default accelerator to one with display and 1MB of dedicated memory..
  Default accelerator is now: NVIDIA GeForce GTX 570

The list shows all the available C++ AMP accelerators. In this example, it shows the following accelerators:

  • Two GPUs, each with unique device paths and a description containing the GPU’s friendly name.
  • The WARP accelerator with description “Microsoft Basic Render Driver.”
  • The reference, or REF accelerator, also referred to as the “Software Adapter.”
  • The CPU accelerator.

Your application can select a device using one of the following device path names that are predefined as static properties on the C++ AMP accelerator class:

  • accelerator::direct3d_ref The REF accelerator, also called the Reference Rasterizer or “Software Adapter” accelerator. It emulates a generic graphics card in software on the CPU to provide Direct3D functionality. It is used for debugging and will also be the default accelerator if no other accelerators are available. As the name suggests, the REF accelerator should be considered the de facto standard if you suspect a bug with your hardware vendor’s driver. Typically, your application will not want to use the REF accelerator because it is much slower than hardware-based accelerators and will be slower than just running a C++ implementation of your algorithm on the CPU.
  • accelerator::cpu_accelerator The CPU accelerator can be used only for creating arrays that are accessible to the CPU and used for data staging. Your application can’t use this for executing C++ AMP code in the first release of C++ AMP. Further details on using the CPU accelerator to create staging arrays and host arrays are covered in Chapter 7, “Optimization.”
  • accelerator::direct3d_warp The WARP accelerator, or Microsoft Basic Render Driver, allows the C++ AMP run time to run on the CPU. The WARP accelerator uses the WARP software rasterizer, which is part of the Direct3D 11 run time. The WARP accelerator uses multicore and data-parallel Single Instruction Multiple Data (SIMD) instructions to execute data-parallel code very efficiently on the CPU. Your application can use WARP as a fallback when no physical GPU is present. The WARP accelerator supports only single-precision math, so it can’t be used for fallback for kernels that require double precision or limited double-precision kernels. An overview of WARP can be found in “Windows Advanced Rasterization Platform (WARP) Guide” on MSDN: http://msdn.microsoft.com/en-us/library/gg615082.aspx.
  • accelerator::default_accelerator The current default accelerator. See the next section for more information on the default accelerator.

Note that although the WARP accelerator runs directly on the CPU, it is also considered to be an emulated accelerator. The accelerator::is_emulated property is true for both the REF and WARP accelerators.

You can filter out accelerators by examining each accelerator’s properties, as shown in the following code:

std::vector<accelerator> accls = accelerator::get_all();
accls.erase(std::remove_if(accls.begin(), accls.end(), [](accelerator& a)
        return a.is_emulated;
    }), accls.end());
std::wcout << "Found " << accls.size() << " C++ AMP hardware accelerator(s):" << std::endl;

Now accls contains only the GPU accelerators available. Similarly, accelerator device paths can be used to test for the presence of a particular type of accelerator. For example, your application might check for the presence of a WARP accelerator and give the user an option to fall back on this if no C++ AMP-capable GPUs are present.

std::vector<accelerator> accls = accelerator::get_all();
bool hasWarp = std::find_if(accls.begin(), accls.end(), [=](accelerator& a)
        return a.device_path.compare(accelerator::direct3d_warp) == 0;
    }) != accls.end();
std::wcout << "Has WARP accelerator: " << (hasWarp ? "true" : "false") << std::endl;

The accelerator class also provides properties to query various attributes of an accelerator: the amount of dedicated memory, whether a display is attached, double-precision support, version number, and whether a debug layer is enabled. For example, the following code searches for a GPU accelerator with at least 2 MB of memory, limited double-precision support, and a connected display:

std::vector<accelerator> accls = accelerator::get_all();
bool found = std::find_if(accls.begin(), accls.end(), [=](accelerator& a)
       return !a.is_emulated && a.dedicated_memory >= 2048 &&
           a.supports_limited_double_precision && a.has_display;
    }) != accls.end();
std::wcout << "Suitable accelerator " << (found ? "found." : "not found.") << std::endl;

See Chapter 12, “Tips, Tricks, and Best Practices,” for further discussion of double, limited-double, and single-precision support. See the “accelerator Class” topic on MSDN for further details about the properties and methods on accelerator for filtering: http://msdn.microsoft.com/en-us/library/hh350895.

The Default Accelerator

The C++ AMP run time selects the default accelerator according to the following rules. If the application is being debugged under the GPU debugger, then the default accelerator is specified by the project properties setting (see Chapter 6, “Debugging”). When the application is not launched in debug mode, the CPPAMP_DEFAULT_ACCELERATOR environment variable, if defined, is used to determine the default accelerator. Otherwise, the default will be set to the nonemulated accelerator with the largest amount of dedicated memory. When more than one such accelerator has the same amount of dedicated memory, the first accelerator without a display is chosen. This is an implementation detail and might change in subsequent releases. Regardless of the implementation specifics, the C++ AMP run time will always try to pick the best accelerator as the default.

The C++ AMP run time sets the default accelerator when your code asks for it, either with an explicit call or by creating an array. The default accelerator is also set by a call to parallel_for_each that does not either explicitly specify an accelerator_view or capture an array or texture that would implicitly specify one. Before that point in your code, you can set the default accelerator yourself, using the accelerator::set_default() method. Calls to set_default() after the run time has already set a default will return false, indicating that the call failed to change the default accelerator. The following example sets the default accelerator to a GPU with 1 MB of memory and a connected display:

std::vectora<ccelerator> accls = accelerator::get_all();
std::vector<accelerator>::iterator usefulAccls = std::find_if(accls.begin(), accls.end(),
    [=](accelerator& a)
        return !a.is_emulated && (a.dedicated_memory >= 1024) && a.has_display;
if (usefulAccls != accls.end())
    std::wcout << "  Default accelerator is now "
        << accelerator(accelerator::default_accelerator).description << std::endl;
    std::wcout << "  No suitable accelerator available" << std::endl;

As discussed in Chapter 3, “C++ AMP Fundamentals,” all C++ AMP kernels run on an accelerator_view. An accelerator_view represents a logical, isolated view on a particular accelerator. If no accelerator_view is specified, an accelerator_view on the default accelerator is used. You can specify which accelerator to use by passing an accelerator_view associated with a particular accelerator to the invocation of parallel_for_each or by capturing an array or texture stored on the desired accelerator. In this example, accls is a std::vector containing two or more accelerator instances. The default accelerator is set to accls[0], but the array, onData1, is initialized with an additional accelerator_view parameter associating it with accls[1].default_view. The following kernel runs on accls[1] even though accls[0] is the default accelerator because the parallel_for_each captures dataOn1, an array associated with the default accelerator_view of accls[1]:

accelerator::set_default(accls[0].device_path);    // Accelerator 0 is now the default
array<int> dataOn1(10000, accls[1].default_view);

parallel_for_each(dataOn1.extent, [&dataOn1](index<1> idx) restrict(amp)
    dataOn1[idx] = // ...

If your kernel uses array_view rather than array, the accelerator_view must be passed as an additional parameter to the parallel_for_each. Again, the following kernel executes on accls[1]:

std::vector<int> dataOnCpu(10000, 1);
array_view<int, 1> dataView(1, dataOnCpu);

    dataView.extent, [dataView](index<1> idx) restrict(amp)
    dataView[idx] = // ...

Attempting to execute a kernel on one accelerator that contains references to an array stored on a different accelerator will result in a concurrency::runtime_exception. If the kernel references an array_view that wraps data stored on a different accelerator, the data will be implicitly copied onto the accelerator specified by the parallel_for_each invocation.