Overview and C++ AMP Approach

  • 9/15/2012

The C++ AMP Approach

C++ AMP is a library and a small language extension that enables heterogeneous computing within a single C++ application. (AMP stands for Accelerated Massive Parallelism.) Visual Studio has new tools and capabilities to support debugging and profiling C++ AMP applications, including GPU debugging and GPU concurrency visualization. With C++ AMP, mainstream C++ developers can use familiar tools to create applications that are portable and future-proof and that can achieve dramatic acceleration for data-parallel-friendly applications.

C++ AMP Brings GPGPU (and More) into the Mainstream

One mission of C++ AMP is to bring GPGPU programming to every developer whose applications can benefit from it. The video cards required to support it are now almost ubiquitous. The overarching mission, however, is larger than just GPGPU: C++ AMP is a way to harness heterogeneous computing platforms, such as GPUs and CPU vector units, and make them accessible to millions of mainstream developers in ways that are not otherwise possible. Although the shift to data-parallel programming—and especially to portable implementations expressed in C++—is an enormous undertaking, it is not the first such transformation that has happened to the software development experience.

Many of the techniques or technologies that change our industry and our world start out in research or academia and are used by only a tiny number of developers who use very specialized tools and are able to do very difficult things. To change the industry and the world, those techniques have to come out to the masses and be considered mainstream. This process has happened with other technologies—GUI interfaces, for example. At first only a few developers had the specialized skills required to work with controls, react to mouse events, and so on. As libraries, frameworks, and tools were developed and released, more and more developers were able to produce GUI applications, and they are now considered the norm. Some libraries, frameworks, and tools are more popular than others, and all contribute to the ecosystem that supports GUI development.

A similar process happened with object-oriented development. At first a few researchers were advocating a new way of designing and building software while the mainstream continued to develop procedural applications. As frameworks and tools have been developed and released, adoption has increased to a point where object-oriented development is considered the norm and used, to varying degrees, by essentially all developers in the majority of mainstream languages.

Such a change might be happening with touch and with natural user interfaces, and it is definitely happening with the concurrency revolution. The first phase was CPU concurrency. The second phase is heterogeneous concurrency. Bringing that ease and normality to heterogeneous computing will require tools, libraries, and frameworks. C++ AMP and Visual Studio are just what mainstream developers need to harness the power of the GPU and beyond.

An interesting possibility is that mainstream developers might find themselves benefiting from C++ AMP without directly using it. If library developers adopt C++ AMP, code that uses those libraries will gain the speedup without having to understand how it was done. The opportunity to create domain-specific libraries could be significant.

C++ AMP Is C++, Not C

There are a number of other approaches to GPGPU development and all of them involve C-like languages. Although C is a powerful and high-performance language, C++ is clearly the number one choice for performance-conscious developers who’d like to work in a modern programming language. C++ provides abstraction and type-safe genericity that enable developers to tackle larger problems and use more powerful libraries and constructs, and these features are available when using C++ AMP, too. You can use templates, overloading, and exceptions in the same way as you do in other parts of your applications.

Because C++ AMP is C++, not C and not a C-like language, the extra types you need for concurrent development are not extensions or additions to the language; they are template types. This gives you type-safe genericity—you can distinguish between an array of floats and an array of ints—while reducing your learning curve. Adding abstractions and useful types to C is one of the very problems C++ was designed to solve.

In the past, standard C++ (say, C++11) has supported only CPU programming. The C++ Parallel Patterns Library, PPL, offers a set of types and algorithms in the style of the Standard Library that support multicore development in C++. This lets C++ developers take advantage of new hardware by using the language and tools they are already using. C++ AMP brings that same comfort and convenience to heterogeneous computing.

C++ AMP Leverages Tools You Know

C++ AMP is fully supported by Visual Studio 2012 and will be usable on Windows machines right away. That alone will open the doors to all the developers who use C++ in Visual Studio. These developers will not need to learn a new tool or a new language to start using the power of the GPU. They will have to learn to think in a data-parallel way and to evaluate the costs, calculated in execution time or watts consumed, of their decisions about algorithms and data structures. Using familiar tools will make the overall skills gap one that can be bridged. Visual Studio provides IntelliSense, GPU debugging, profiling, and other features that enable developers to do far more than just write and compile code.

Visual Studio is popular even with developers who aren’t targeting Windows. What’s more, C++ AMP development is not necessarily restricted to Windows or to Visual Studio users; it has been released as an open specification, and work is underway for other vendors to add C++ AMP to their toolsets. For example, AMD will put it into their FSA reference compiler for Windows and non-Windows platforms.

C++ AMP Is Almost All Library

The key to writing in the language you know is to keep it as the language you know. C++ AMP is an extension to C++ and does include a couple of keywords that are not in C++11. However, it is just two keywords, not a large collection of language changes. Further, the new main keyword, restrict, is in use in C99 and is therefore a reserved word, one unlikely to cause collisions with existing codebases. Everything else that makes C++ AMP work involves a library of types and functions. Developers who are comfortable with the Standard Library or with PPL will immediately be comfortable with C++ AMP.

Here’s a simple example. Consider this traditional code for adding two vectors. None of this is parallel:

void AddArrays(int n, const int* const pA, const int* const pB, int* const pC)
{
    for (int i = 0; i < n; ++i)
    {
        pC[i] = pA[i] + pB[i];
    }
}

The preceding code is both easy to read and easy to understand. The following code shows the types of changes that make this operation massively parallel and leverage the GPU:

#include <amp.h>
using namespace concurrency;

void AddArrays(int n, const int* const pA, const int* const pB, int* const pC)
{
    array_view<int, 1> a(n, pA);
    array_view<int, 1> b(n, pB);
    array_view<int, 1> c(n, pC);

    parallel_for_each(c.extent, [=](index<1> idx) restrict(amp)
        {
            c[idx] = a[idx] + b[idx];
        });
}

As you can see, the code wasn’t really changed much. The changes include the following:

  1. Including amp.h to use the library
  2. Because the types and functions are in the concurrency namespace, adding a using statement to reduce your typing
  3. Using array views to manage copying the data to or from the accelerator
  4. Changing the language for to a library parallel_for_each and using a lambda as the last parameter to that function call
  5. Using the restrict(amp) clause to identify accelerator-compatible code

These are the only changes required. There are no changes to project settings or environment variables. There is no code elsewhere that this needs to call. This is the whole thing.

What happens behind the scenes? One simplified explanation is that the lambda, the kernel that is passed to the parallel_for_each, is compiled to HLSL when your application is compiled. The run time for C++ AMP, a DLL that is included with the Visual C++ redistributable package, compiles the HLSL bytecode to hardware-specific machine code at run time. You don’t need to know this to use C++ AMP; it is taken care of by the library.

In the code sample just presented, you don’t see any code to copy the two input arrays, pA and pB, to the accelerator or any code to copy the result back into pC. The array_view objects handle this. An array_view is a portable view that works with, and abstracts over, CPU and GPU memories, whether they are colocated on the same chip or are two parts. You can build an array_view wrapping a C-style array as in this example or wrapping over a std::vector, if that is where your data is.

You may also hint about copy requirements. Consider the following start of a function:

void MatrixMultiply(std::vector<float>& C,
    const std::vector<float>& A, const std::vector<float>& B,
    int M, int N, int W)
{
    array_view<const float, 2> a(M, W, A);
    array_view<const float, 2> b(W, N, B);
    array_view<float, 2> c(M, N, C);
    c.discard_data();

The first two array_view objects specify that they are arrays of const float. This means there is no need to sync them back from the accelerator after the processing is complete—they can take a one-way trip there. Similarly, the third array_view is of float, but although it is associated with C, the call to discard_data() indicates that whatever values happen to be in the memory are not meaningful to anyone, so there is no need to copy the initial values in C over to the accelerator. This makes setting up the array_view very quick. The results will be copied back from the accelerator when the array_view objects are accessed on the CPU or when they go out of scope, whichever happens first.

This hinting needs no new language keywords and can be accomplished just with template overloading. There is no new paradigm for the developer to learn.

The original mathematical logic (such as it is) remains untouched and perfectly readable. There’s no mention of polygons, triangles, meshes, vertices, textures, memory, or anything other than adding up matrix elements to get a sum. This is why C++ AMP can make heterogeneous computing mainstream.

The details of the parameters to the parallel_for_each and the use of the new restrict keyword will be in the case study in the next chapter.

C++ AMP Makes Portable, Future-Proof Executables

Once your code is compiled, the same executable can run on a variety of machines, as long as the machine has a DirectX 11 driver: Windows 7 and later or Windows Server 2008 R2 and later. You are not restricted to a particular vendor or video card family.

When coded appropriately, your application can react to the environment in which it’s running and take advantage of whatever acceleration is available. If the machine has hardware with a DX11 driver, it will speed up. Deployment is simply a matter of copying the executable and some dependent dynamic-link libraries (DLLs) (included in the Visual C++ redistributable) to the target machine.

For example, a single executable was written and copied to several different machines. It produces the following output on a virtual machine without access to the GPU:

CPU exec time: 112.206 (ms)
No accelerator available

And it produces the following output on a machine (more powerful than the laptop hosting the virtual machine) with a typical recent mainstream video card, the NVIDIA GeForce GT 420:

CPU exec time: 27.2373 (ms)
GPU exec time including copy-in/out: 19.8738 (ms)

This dramatic speed improvement is made possible by a simple query that establishes which accelerators are available:

std::vector<accelerator> accelerators = accelerator::get_all();

You can then check the returned vector. If it’s empty, no accelerators are available. It’s a best practice to always ensure that there is an accelerator before trying to execute code that depends on one. Getting into that habit enables your applications to work on a variety of target machines while imposing minimal restrictions on your end users. As a developer with Visual Studio installed, you will always have an accelerator (which might just be an emulator provided for debugging), so forgetting to check at run time for the existence of at least one accelerator could easily lead to the classic “works on my development machine” scenario.

C++ AMP not only makes executables that work on a variety of machines, but it’s also designed to be future-proof. In the future, code you write to take advantage of GPU acceleration might be deployed to the cloud and might run over a number of machines, or it could run multithreaded on the CPU only. Heterogeneity in the future will mean more than just CPU+GPU; therefore, C++ AMP is not just a GPU solution, but also a heterogeneous computing solution that supports efficient mapping of data-parallel algorithms to many hardware platforms.

With multicore programming now becoming mainstream, you can leverage 4, 8, or 16 cores on a relatively ordinary computer. With some additional effort, you could also leverage the vector unit on each of these cores (using SSE, AVX, or WARP). GPGPU programming means you can spread your work across hundreds of hardware threads today and even more in the near future. With the cloud, using Infrastructure as a Service (IaaS) or Hardware as a Service (HaaS) offerings, you could conceivably leverage tens of thousands of hardware threads. But imagine being able to combine the two and reach the GPU cores on those cloud machines, reaching tens of millions of hardware threads. What could that enable?