Working with Multiple Accelerators in C++ AMP

  • 9/15/2012

Using More Than One GPU

If your application detects more than one C++ AMP-capable GPU accelerator, the question becomes: How can you take advantage of this? The answer is to schedule work on all accelerators concurrently, allocating a portion of the total work to each accelerator, and finally to combine the results. This approach is often called the scatter-gather or master-worker pattern. The CPU divides the work and scatters it among the available workers. Workers complete their portion of the work and the result is gathered back up to the CPU master, which then either uses the final result or scatters more work to the GPU workers.

The following example calculates the weighted average of the elements in a matrix using a single parallel_for_each to execute the computation on the default C++ AMP accelerator. Each thread on the GPU calculates the weighted average of an element in matrix C from the corresponding elements in matrix A using a weighting function, WeightedAverage().

const int rows = 2000, cols = 2000; shift = 60;
std::vector<float> vA(rows * cols);
std::vector<float> vC(rows * cols);
std::iota(vA.begin(), vA.end(), 0.0f);

array_view<const float, 2> a(rows, cols, vA);
array_view<float, 2> c(rows, cols, vC);

extent<2> ext(rows - shift * 2, cols - shift * 2);
parallel_for_each(ext, [=](index<2> idx) restrict(amp)
    index<2> idc(idx[0] + shift, idx[1] + shift);
    c[idc] = WeightedAverage(idc, a, shift);

The WeightedAverage() function simply calculates an average using the weighted sum of the surrounding pixels, and the parameter shift specifies the size of the surrounding pixel window. This actual function isn’t that important; for the purposes of the example, it is just work being done on the GPU that depends on surrounding values in the matrix. Although this is a trivial example, it serves to demonstrate some of the complexities when partitioning a computation across more than one GPU. The Cartoonizer case study in Chapter 10, “Cartoonizer Case Study,” shows an example of a much more computationally intensive application that uses a similar algorithm.


In the diagram, matrix element [4, 3] is calculated based on the 24 surrounding elements with a shift parameter of 2 for a 8 x 8 matrix. Even on a single accelerator new values can be calculated only for matrix elements sufficiently far away from the edge of the matrix that a full window can be used to calculate the average. These elements are represented by the 8 x 10 shaded area on the next diagram. The border around the edge is called the halo; it holds read-only values that are required to correctly calculate new values for elements that lie inside the halo.

It’s possible to divide the work across several accelerators by creating array_view instances corresponding to subregions of the matrices and executing these on different accelerators. In this case, each accelerator must be passed to not only the elements for which it will calculate new values but also the halo elements. This increases the amount of data being transferred. For large arrays, where the halo width is much smaller than the overall matrix dimensions, this does not present a significant additional overhead. The following diagram shows the partitioning of the matrix onto two accelerators. Note that each accelerator is allocated a half of the computable matrix, a 4 x 10 region, and the halo elements needed to calculate the result. Now the accelerators can work in parallel to calculate the weighted sum of the respective portions of the matrix allocated to them.

A TaskData structure is used to track the work assigned to each C++ AMP accelerator. It stores the default accelerator_view for each accelerator and the start row and read and write extents of the submatrices that the accelerator will use for its part of the overall calculation. The writeExt holds the dimensions of the shaded rows, and writeOffset holds the number of offset rows to the top of the shaded areas.

struct TaskData
    int id;
    accelerator_view view;
    int startRow;
    extent<2> readExt;
    int writeOffset;
    extent<2> writeExt;

    TaskData(accelerator a, int i) : view(a.default_view), id(i) {}
    // ...

The TaskData structures are initialized to divide up the rows of the matrix between the available accelerators. The TaskData struct defines a static method to do this.

static std::vector<TaskData> Configure(const std::vector<accelerator>& accls,
    int rows, int cols, int shift)
    std::vector<TaskData> tasks;
    int startRow = 0;
    int rowsPerTask = int(rows / accls.size());
    int i = 0;
    std::for_each(accls.cbegin(), accls.cend(),
        [=, &tasks, &i, &startRow](const accelerator& a)
        TaskData t(a, i++);
        t.startRow = std::max(0, startRow - shift);
        int endRow = std::min(startRow + rowsPerTask + shift, rows);
        t.readExt = extent<2>(endRow - t.startRow, cols);
        t.writeOffset = shift;
        t.writeExt = extent<2>(t.readExt[0] - shift -
            ((endRow == rows || startRow == 0) ? shift : 0), cols);
        startRow += rowsPerTask;
    return tasks;

Your application can then create an array_view for the subregion of matrices of A and C and execute a C++ AMP kernel on each accelerator to calculate the values for the corresponding subregion of matrix C.

const int rows = 2000, cols = 2000; shift = 60;
std::vector<TaskData> tasks = TaskData::Configure(accls, rows, cols, shift);

std::vector<float> vA(rows * cols);
std::vector<float> vC(rows * cols);
std::iota(vA.begin(), vA.end(), 0.0f);

std::for_each(tasks.cbegin(), tasks.cend(), [&avCs](const TaskData& t)
    avCs.push_back(array<float, 2>(t.readExt, t.view));

std::for_each(tasks.cbegin(), tasks.cend(), [=](const TaskData& t)
    array_view<const float, 2> a(t.readExt, &vA[t.startRow * cols]);
    array_view<float, 2> c = avCs[];
    index<2> writeOffset(t.writeOffset, shift);
    parallel_for_each(t.view, t.writeExt, [=](index<2> idx) restrict(amp)
        index<2> idc = idx + writeOffset;
        c[idc] = WeightedAverage(idc, a, shift);

std::for_each(tasks.cbegin(), tasks.cend(), [=, &vC](const TaskData& t)
    array_view<float, 2> outData(t.writeExt, &vC[(t.startRow + t.writeOffset) * cols]);
    avCs[].section(index<2>(t.writeOffset, 0), t.writeExt).copy_to(outData);

This example uses a std::for_each to launch a kernel on each GPU and then a second loop to synchronize the results back to the CPU. The full implementation is in the MatrixMultiGpuSequentialExample function in Main.cpp.

If you run the sample on a machine with more than one C++ AMP-capable GPU, you will see output similar to the following. The exact times will vary based on the GPUs being used, as well as other factors, such as the type of CPU and the speed of the PCI bus and RAM on your computer.

Matrix weighted average 2000 x 2000 matrix, with 121 x 121 window
Matrix size 15625 KB

Single GPU matrix weighted average took                         1198.91 (ms)
2 GPU matrix weighted average (p_f_e) took                       649.923 (ms)
2 GPU matrix weighted average took                               652.042 (ms)

The matrix weighted average on two GPUs is faster, showing an improvement of 84 percent. This is not 100 percent because there is some overhead associated with distributing the calculation across two GPUs.

This is a small sample designed to make the code easier to read, but it doesn’t represent the sort of real workloads that will be able to take full advantage of more than one GPU and the CPU. The NBody and Cartoonizer case studies can both be run on multiple GPUs.