Overview and C++ AMP Approach

  • 9/15/2012

Technologies for CPU Parallelism

One way to reduce the amount of time spent in the sequential portion of your application is to make it less sequential—to redesign the application to take advantage of CPU parallelism as well as GPU parallelism. Although the GPU can have thousands of threads at once and the CPU far less, leveraging CPU parallelism as well still contributes to the overall speedup. Ideally, the technologies used for CPU parallelism and GPU parallelism would be compatible. A number of approaches are possible.


An important way to make processing faster is SIMD, which stands for Single Instruction, Multiple Data. In a typical application, instructions must be fetched one at a time and different instructions are executed as control flows through your application. But if you are performing a large data-parallel operation like matrix addition, the instructions (the actual addition of the integers or floating-point numbers that comprise the matrices) are the same over and over again. This means that the cost of fetching an instruction can be spread over a large number of operations, performing the same instruction on different data (for example, different elements of the matrices.) This can amplify your speed tremendously or reduce the power consumed to perform your calculation.

Vectorization refers to transforming your application from one that processes a single piece of data at a time, each with its own instructions, into one that processes a vector of information all at once, applying the same instruction to each element of the vector. Some compilers can do this automatically to some loops and other parallelizable operations.

Microsoft Visual Studio 2012 supports manual vectorization using SSE (Streaming SIMD Extensions) intrinsics. Intrinsics appear to be functions in your code, but they map directly to a sequence of assembly language instructions and do not incur the overhead of a function call. Unlike in inline assembly, the optimizer can understand intrinsics, allowing it to optimize other parts of your code accordingly. Intrinsics are more portable than inline assembly, but they still have some possible portability problems because they rely on particular instructions being available on the target architecture. It is up to the developer to ensure that the target machine has a chip that supports these intrinsics. Not surprisingly, there is an intrinsic for that: __cpuid() generates instructions that fill four integers with information about the capabilities of the processor. (It starts with two underscores because it is compiler-specific.) To check if SSE3 is supported, you would use the following code:

int CPUInfo[4] = { -1 };
__cpuid(CPUInfo, 1);
bool bSSEInstructions = (CpuInfo[3] >> 24 && 0x1);

Which intrinsic you would use depends on how you are designing your work to be more parallel. Consider the case in which you need to add many pairs of numbers. The single intrinsic _mm_hadd_epi32 will add four pairs of 32-bit numbers at once. You fill two memory-aligned 128-bit numbers with the input values and then call the intrinsic to add them all at once, getting a 128-bit result that you can split into the four 32-bit numbers representing the sum of each pair. Here is some sample code from MSDN:

#include <stdio.h>
#include <tmmintrin.h>

int main ()
    __m128i a, b;

    a.m128i_i32[0] = -1;
    a.m128i_i32[1] = 1;
    a.m128i_i32[2] = 0;
    a.m128i_i32[3] = 65535;
    b.m128i_i32[0] = -65535;
    b.m128i_i32[1] = 0;
    b.m128i_i32[2] = 128;
    b.m128i_i32[3] = -32;

    __m128i res = _mm_hadd_epi32(a, b);

    std::wcout << "Original a: " <<
    a.m128i_i32[0] << "\t" << a.m128i_i32[1] << "\t" <<
    a.m128i_i32[2] << "\t" << a.m128i_i32[3] << "\t" << std::endl;
    std::wcout << "Original b: " <<
    b.m128i_i32[0] << "\t" << b.m128i_i32[1] << "\t" <<
    b.m128i_i32[2] << "\t" << b.m128i_i32[3] << std::endl;
    std::wcout << "Result res: " <<
    res.m128i_i32[0] << "\t" << res.m128i_i32[1] << "\t" <<
    res.m128i_i32[2] << "\t" << res.m128i_i32[3] <<std::endl;

    return 0;

The first element of the result contains a0 + a1, the second contains a2 + a3, the third contains b0 + b1, and the fourth contains b2 + b3. If you can redesign your code to do your additions in pairs and to group the pairs into clumps of four, you can parallelize your code using this intrinsic. There are intrinsics to perform a variety of operations (including add, subtract, absolute value, negate—even do dot products using 16 x 16 8-bit integers) in several “widths,” or number of calculations at a time.

One drawback of vectorization with these intrinsics is that the readability and maintainability of the code falls dramatically. Typically, code is written “straight up” first, tested for correctness, and then, when profiling reveals areas of the code that are performance bottlenecks and candidates for vectorization, adapted to this less-readable state.

In addition, Visual Studio 2012 implements auto-vectorization and auto-parallelization of your code. The compiler will automatically vectorize loops if it is possible. Vectorization reorganizes a loop—for example, a summation loop—so that the CPU can execute multiple iterations at the same time. By using auto-vectorization, loops can be up to eight times faster when executed on CPUs that support SIMD instructions. For example, most modern processors support SSE2 instructions, which allow the compiler to instruct the processor to do math operations on four numbers at a time. The speedup is achieved even on single-core machines, and you don’t need to change your code at all.

Auto-parallelization reorganizes a loop so that it can be executed on multiple threads at the same time, taking advantage of multicore CPUs and multiprocessors to distribute chunks of the work to all available processors. Unlike auto-vectorization, you tell the compiler which loops to parallelize with the #pragma parallelize directive. The two features can work together so that a vectorized loop is then parallelized across multiple processors.


OpenMP (the MP stands for multiprocessing) is a cross-language, cross-platform application programming interface (API) for CPU-parallelism that has existed since 1997. It supports Fortran, C, and C++ and is available on Windows and a number of non-Windows platforms. Visual C++ supports OpenMP with a set of compiler directives. The effort of establishing how many cores are available, creating threads, and splitting the work among the threads is all done by OpenMP. Here is an example:

// size is a compile-time constant
double* x = new double[size];
double* y = new double[size + 1];
// get values into y
#pragma omp parallel for
for (int i = 1; i < size; ++i)
    x[i] = (y[i - 1] + y[i + 1]) / 2;

This code fragment uses vectors x and y and visits each element of y to build x. Adding the pragma and recompiling your program with the /openmp flag is all that is needed to split this work among a number of threads—one for each core. For example, if there are four cores and the vectors have 10,000 elements, the first thread might be given i values from 1 to 2,500, the second 2,501 to 5,000, and so on. At the end of the loop, x will be properly populated. The developer is responsible for writing a loop that is parallelizable, of course, and this is the truly hard part of the job. For example, this loop is not parallelizable:

for (int i = 1; i <= n; ++i)
    a[i] = a[i - 1] + b[i];

This code contains a loop-carried dependency. For example, to determine a[2502] the thread must have access to a[2501]—meaning the second thread can’t start until the first has finished. A developer can put the pragma into this code and not be warned of a problem, but the code will not produce the correct result.

One of the major restrictions with OpenMP arises from its simplicity. A loop from 1 to size, with size known when the loop starts, is easy to divide among a number of threads. OpenMP can only handle loops that involve the same variable (i in this example) in all three parts of the for-loop and only when the test and increment also feature values that are known at the start of the loop.

This example:

for (int i = 1; (i * i) <= n; ++i)

cannot be parallelized with #pragma omp parallel for because it is testing the square of i, not just i. This next example:

for (int i = 1; i <= n; i += Foo(abc))

also cannot be parallelized with #pragma omp parallel for because the amount by which i is incremented each time is not known in advance.

Similarly, loops that “read all the lines in a file” or traverse a collection using an iterator cannot be parallelized this way. You would probably start by reading all the lines sequentially into a data structure and then processing them using an OpenMP-friendly loop.

Concurrency Runtime (ConcRT) and Parallel Patterns Library

The Microsoft Concurrency Runtime is a four-piece system that sits between applications and the operating system:

  • PPL (Parallel Patterns Library) Provides generic, type-safe containers and algorithms for use in your code
  • Asynchronous Agents Library Provides an actor-based programming model and in-process message passing for lock-free implementation of multiple operations that communicate asynchronously
  • Task Scheduler Coordinates tasks at run time with work stealing
  • The Resource Manager Used at run time by the Task Scheduler to assign resources such as cores or memory to workloads as they happen

The PPL feels much like the Standard Library, leveraging templates to simplify constructs such as a parallel loop. It is made dramatically more usable by lambdas, added to C++ in C++11 (although they have been available in Microsoft Visual C++ since the 2010 release).

For example, this sequential loop:

for (int i = 1; i < size; ++i)
    x[i] = (y[i - 1] + y[i + 1]) / 2;

can be made into a parallel loop by replacing the for with a parallel_for:

#include <ppl.h>
// . . .
concurrency::parallel_for(1, size, [=](int i)
    x[i] = (y[i-1] + y[i+1])/2;

The third parameter to parallel_for is a lambda that holds the old body of the loop. This still requires the developer to know that the loop is parallelizable, but the library bears all the other work. If you are not familiar with lambdas, see the “Lambdas in C++11” section in Chapter 2, “NBody Case Study,” for an overview.

A parallel_for loop is subject to restrictions: it works with an index variable that is incremented from the start value to one less than the end value (an overload is available that allows incrementing by values other than 1) and doesn’t support arbitrary end conditions. These conditions are very similar to those for OpenMP. Loops that test if the square of the loop variable is less than some limit, or that increment by calling a function to get the increment amount, are not parallelizable with parallel_for, just as they are not parallelizable with OpenMP.

Other algorithms, parallel_for_each and parallel_invoke, support other ways of going through a data set. To work with an iterable container, like those in the Standard Library, use parallel_for_each with a forward iterator, or for better performance use a random access iterator. The iterations will not happen in a specified order, but each element of the container will be visited. To execute a number of arbitrary actions in parallel, use parallel_invoke—for example, passing three lambdas in as arguments.

It’s worth mentioning that the Intel Threading Building Blocks (TBB) 3.0 is compatible with PPL, meaning that using PPL will not restrict your code to Microsoft’s compiler. TBB offers “semantically compatible interfaces and identical concurrent STL container solutions” so that your code can move to TBB if you should need that option.

Task Parallel Library

The Task Parallel Library is a managed (.NET Framework) approach to parallel development. It provides parallel loops as well as tasks and futures for developers who use C#, F#, or VB. The CLR Thread Pool dispatches and manages threads. Managed developers have other parallel options, including PLINQ.

WARP—Windows Advanced Rasterization Platform

The Direct3D platform supports a driver model in which arbitrary hardware can plug into Microsoft Windows and execute graphics-related code. This is how Windows supports GPUs, from simple graphics tasks, such as rendering a bitmap to the screen, all the way to DirectCompute, which allows fairly arbitrary computations to occur on the GPU. However, this framework also allows for having graphics drivers that are implemented using CPU code. In particular, WARP is a software-only implementation of one such graphics device that is shipped together with the operating system. WARP is capable of executing both simple graphics tasks—as well as complicated compute tasks—on the CPU. It leverages both multithreading and vectorization in order to efficiently execute Direct3D tasks. WARP is often used when a physical GPU is not available, or for smaller data sets, where WARP often proves to be the more agile solution.

Technologies for GPU Parallelism

OpenGL, the Open Graphics Library, dates back to 1992 and is a specification for a cross-language, cross-platform API to support 2D and 3D graphics. The GPU calculates colors or other information specifically required to draw an image on the screen. OpenCL, the Open Computing Language, is based on OpenGL and provides GPGPU capabilities. It’s a language of its own similar in appearance to C. It has types and functionality that are not in C and is missing features that are in C. Using OpenCL does not restrict a developer to deployment on specific video cards or hardware. However, because it does not have a binary standard, you might need to deploy your OpenCL source to be compiled as you go or precompile for a specific target machine. A variety of tools are available to write, compile, test, and debug OpenCL applications.

Direct3D is an umbrella term for a number of technologies, including Direct2D and Direct3D APIs for graphics programming on Windows. It also includes DirectCompute, an API to support GPGPU that is similar to OpenCL. DirectCompute uses a nonmainstream language, HLSL (High Level Shader Language) that looks like C but has significant differences from C. HLSL is widely used in game development and has much the same capabilities as the OpenCL language. Developers can compile and run the HLSL parts of their applications from the sequential parts running on the CPU. As with the rest of the Direct3D family, the interaction between the two parts is done using COM interfaces. Unlike OpenCL, DirectCompute compiles to bytecode, which is hardware portable, meaning you can target more architectures. It is, however, Windows-specific.

CUDA, the Compute Device Unified Architecture, refers to both hardware and the language that can be used to program against it. It is developed by NVIDIA and can be used only when the application will be deployed to a machine with NVIDIA graphics cards. Applications are written in “CUDA C,” which is not C but is similar to it. The concepts and capabilities are similar to those of OpenCL and DirectCompute. The language is “higher level” than OpenCL and DirectCompute, providing simpler GPU invocation syntax that is embedded in the language. In addition, it allows you to write code that is shared between the CPU and the GPU. Also, a library of parallel algorithms, called Thrust, takes inspiration from the design of the C++ Standard Library and is aimed at dramatically increasing developer productivity for CUDA developers. CUDA is under active development and continues to gain new capabilities and libraries.

Each of these three approaches to harnessing the power of the GPU has some restrictions and problems. Because OpenCL is cross-platform, cross-hardware (at least in source code form), and cross-language, it is quite complicated. DirectCompute is essentially Windows-only. CUDA is essentially NVIDIA-only. Most important, all three approaches require learning not only a new API and a new way of looking at problems but also an entirely new programming language. Each of the three languages is “C-like” but is not C. Only CUDA is becoming similar to C++; OpenCL and DirectCompute cannot offer C++ abstractions such as type safety and genericity. These restrictions mean that mainstream developers have generally ignored GPGPU in favor of techniques that are more generally accessible.

Requirements for Successful Parallelism

When writing an application that will leverage heterogeneity, you are of course required to be aware of the deployment target. If the application is designed to run on a wide variety of machines, the machines might not all have video cards that support the workloads you intend to send to them. The target might even be a machine with no access to GPU processing at all. Your code should be able to react to different execution environments and at least work wherever it is deployed, although it might not gain any speedup.

In the early days of GPGPU, floating-point calculations were a challenge. At first, double-precision operations weren’t fully available. There were also issues with the accuracy of operations and error-handling in the math libraries. Even today, single-precision floating-point operations are faster than double-precision operations and always will be. It might be necessary to put some effort into establishing what precision your calculations need and whether the GPU can really do those faster than the CPU. In general, GPUs are converging to offer double-precision math and moving toward IEEE 754-compliant math, in addition to the quick-and-dirty math that they have supported in earlier generations of hardware.

It is also important to be aware of the time cost of moving input data to the GPU for processing and retrieving output results from the GPU. If this time cost exceeds the savings from processing the data on the GPU, you have complicated your application for no benefit. A GPU-aware profiler is a must to ensure that actual performance improvement is happening with production quantities of data.

Tool choice is significant for mainstream developers. Past GPGPU applications often had a small corps of users who might have also been the developers. As GPGPU moves into the mainstream, developers who are using the GPU for extra processing are also interacting with regular users. These users make requests for enhancements, want their application to adopt features of new platforms as they are released, and might require changes to the underlying business rules or the calculations that are being performed. The programming model, the development environment, and the debugger must all allow the developer to accommodate these kinds of changes. If you must develop different parts of your application in different tools, if your debugger can handle only the CPU (or only the GPU) parts of your application, or if you don’t have a GPU-aware profiler, you will find developing for a heterogeneous environment extraordinarily difficult. Tool sets that are usable for developers who support a single user or who only support themselves as a user are not necessarily usable for developers who support a community of nondeveloper users. What’s more, developers who are new to parallel programming are unlikely to write ideally parallelized code on the first try; tools must support an iterative approach so that developers can learn about the performance of their applications and the consequences of their decisions on algorithms and data structures.

Finally, developers everywhere would love to return to the days of the “free lunch.” If more hardware gets added to the machine or new kinds of hardware are invented, ideally your code could just benefit from that without having to change much—or at all. It might even be possible to benefit from improved hardware using the same executable that was deployed to the old hardware and not even need to recompile.