Working with Multiple Accelerators in C++ AMP

  • 9/15/2012

Swapping Data among Accelerators

The weighted average example does not share any data between accelerators during the calculation. The result for each matrix element depends on the surrounding elements, but each accelerator contains a halo of additional read-only elements. Iterative calculations that rely on neighboring elements stored on other GPUs will need to refresh updated elements before the next iteration step can proceed.

When using multiple GPUs, it’s often necessary to swap some or all of the data between steps in a calculation. Typically, the process of a calculation step looks like this:

  1. Divide current data among different GPUs.
  2. Calculate results on each GPU based on its local data.
  3. Swap some or all result data among GPUs by copying it to CPU memory and then back to the other GPUs.
  4. Go to step 2.

Depending on your application, you might need to share some or all of the result data after each calculation step. For example, the multi-GPU implementation in the NBody case study (in Chapter 2, “NBody Case Study”) shares all the data after each time step. In contrast, the Cartoonizer case study has only to share the edges of each subregion of the image being processed (see Chapter 10). In either case, there must be sufficient computation at each step to outweigh the cost of the additional data transfers.

Let’s consider a modified version of the original multi-GPU weighted average code that repeats the weighting calculation 10 times. This version of the code is defined in the LoopedMatrixMultiGpu() function in Main.cpp. Implementing the iterative algorithm efficiently on two accelerators requires swapping additional data, as illustrated in the following diagram:

At the end of each loop iteration, the new results from rows 4 and 5 on accelerator 1 must be copied into the halo cells in rows 0 and 1 of accelerator 2. Similarly, the new values from accelerator 2 must be copied into the halo of accelerator 1. No further calculation can take place while this is happening. The larger the averaging window, the more data will be transferred after each step of the calculation.

In this iterative example, the data is broken up and stored in separate array instances on each accelerator. The std::vector instances arrAs and arrCs store these array instances for each accelerator.

const int rows = 2000, cols = 2000; shift = 60;
std::vector<TaskData> tasks = TaskData::Configure(accls, rows, cols, shift);

std::vector<float> vA(rows * cols);
std::vector<float> vC(rows * cols);
std::iota(vA.begin(), vA.end(), 0.0f);

std::vector<array<float, 2>> arrAs;
std::vector<array<float, 2>> arrCs;

std::for_each(tasks.begin(), tasks.end(), [&](const TaskData& t)
    arrAs.push_back(array<float, 2>(t.readExt, &vA[t.startRow * cols], t.view));
    arrCs.push_back(array<float, 2>(t.readExt, t.view));

Two additional arrays on the CPU are used to swap the data among the array instances stored on each GPU.

array<float, 2> swapTop = array<float, 2>(extent<2>(shift, cols),
array_view<float, 2> swapViewTop = array_view<float, 2>(swapTop);
array<float, 2> swapBottom = array<float, 2>(extent<2>(shift, cols),
array_view<float, 2> swapViewBottom = array_view<float, 2>(swapBottom);

The new multiaccelerator code is shown below. The full source is in the LoopedMatrixMultiGpuExample() function in Main.cpp. During each loop iteration, it calculates the updates to the submatrices on each accelerator and then swaps just the upper and lower edges to update the halo elements of each matrix. Finally, it swaps the vector containing the results for each accelerator, arrCs, for the vector containing next inputs, arrAs.

for (int i = 0 ; i < iter; ++i)
    //  Calculate a portion of the result on each GPU

    std::for_each(tasks.cbegin(), tasks.cend(), [=, &arrAs, &arrCs, &vC](const TaskData& t)
        array<float, 2>& a = arrAs[];
        array<float, 2>& c = arrCs[];

        parallel_for_each(t.view, t.readExt, [=, &a, &c](index<2> idx) restrict(amp)
            c[idx] = a[idx];
            if ((idx[0] >= shift) && (idx[0] < (rows - shift)) &&
                (idx[1] >= shift) && (idx[1] < (cols - shift)))
                c[idx] = WeightedAverage(idx, a, shift);

    //  Swap edges

    std::vector<completion_future> copyResults((tasks.size() - 1) * 2);
    parallel_for(0, int(tasks.size() - 1), [=, &arrCs, &copyResults](size_t i)
        array_view<float, 2> bottomEdge =
            arrCs[i + 1].section(index<2>(tasks[i + 1].writeOffset, 0),
        array_view<float, 2> bottomEdge =
            arrCs[i + 1].section(index<2>(tasks[i +1].writeOffset,0),
        copyResults[i] = copy_async(topEdge, swapViewTop);
        copyResults[i + 1] = copy_async(bottomEdge, swapViewBottom);

    parallel_for_each(copyResults.begin(), copyResults.end(), [=](completion_future& f)
    { f.get(); });

    parallel_for(0, int(tasks.size() - 1), [=, &arrCs, &copyResults](size_t i)
        array_view<float, 2> topEdge =
            arrCs[i].section(index<2>(tasks[i].writeOffset + tasks[i].writeExt[0] - shift, 0),
        array_view<float, 2> bottomEdge = arrCs[i + 1].section(swapViewTop.extent);
        copyResults[i] = copy_async(swapViewTop, bottomEdge);
        copyResults[i + 1] = copy_async(swapViewBottom, topEdge);

    parallel_for_each(copyResults.begin(), copyResults.end(), [=](completion_future& f)
    { f.get(); });

    //  Swap results of this iteration with the input matrix
    std::swap(arrAs, arrCs);

The swapping steps use copy_async() rather than copy() to minimize the impact of any copy operations by parallelizing them as much as possible. The copying takes place in two phases: first, the edges are copied into CPU memory (operations A and C in the previous diagram), and then they are copied back to the other GPU (operations B and D in the diagram). Each copy operation returns a completion_future. After all the copy operations have been started, the program uses completion_future::get() to wait until all the copy operations have finished before starting the next phase.

The following example also uses copy_async() rather than copy() to move data from the CPU memory to the accelerator.

const int size = 1024 * 1024;
std::vector<float> vA(size, 0.0f);
array<float, 1> arrA(size);

std::cout << "Data copy of " << size << " bytes starting." << std::endl;
completion_future f = copy_async(vA.cbegin(), vA.cend(), arrA);
f.then([=] ()
    std::cout << "  Finished asynchronous copy!" << std::endl;
std::cout << "Do more work on this thread..." << std::endl;
std::cout << "Data copy completed." << std::endl;

The output from this code looks like this. The output from the main application thread “Do more work on this thread…” is displayed before the output from the completion_task::then function, “Finished asynchronous copy,” which executes after the copy is complete.

Data copy of 1048576 bytes starting.
Do more work on this thread...
  Finished asynchronous copy!
Data copy completed.

The example demonstrates two things: first, copy_async() allows the calling thread to continue to do more work without waiting for the copy to complete; and second, it uses the completion_task::then() method to specify further operations to execute after the task itself has completed.

On Windows 7, this asynchronous approach to copying is more important because it also minimizes lock contention. On Windows 7, C++ AMP copy operations from the GPU to CPU involve two locking operations: first a process-wide DirectX kernel lock is taken, followed by a read lock on the source data. If the C++ AMP kernel calculating the results still has a write lock on the data being copied, the copy operation will take the DirectX kernel lock and block when attempting to acquire a read lock on the source data until the C++ AMP kernel completes and frees the resource lock. This means that the copy operation holds the DirectX kernel lock for the entire duration of the C++ AMP kernel execution and the data transfer, preventing other threads from submitting work to other GPUs during this period. If your application is executing C++ AMP kernels from more than one CPU thread, this lock contention will result in serialization of kernels that were intended to run concurrently on different GPUs. The end result is that your application will not see the performance gains you expect from adding more GPUs.

The key to getting the best possible performance is to minimize the length of time during which the copy operation takes these locks. The code here can be rewritten to minimize the time that the copy call holds the process-wide DirectX kernel lock.

std::vector<float> resultData(100000, 0.0f);
array<float, 1> resultArr(resultData.size());

// parallel_for_each calculates resultArr data...

copy(resultArr, resultData.begin());

The following code does this by queuing the copy on another thread by using copy_async(). It then waits for all work on the accelerator view to complete before attempting to get the result of the copy. This means that the locks are taken for the shortest possible time.

// parallel_for_each calculates resultArr data...

completion_future f = copy_async(resultArr, resultData.begin());

On Windows 7, the accelerator_view::wait() method has some CPU impact because it is a spin wait, so you should only use this approach where the benefits of the improved concurrency when using multiple GPUs outweighs the additional load placed on the CPU.

Finally, after all the iterations have completed, the data is copied back into vC on the CPU.

array_view<float, 2> c(rows, cols, vC);
std::for_each(tasks.crbegin(), tasks.crend(), [=, &arrAs, &c](const TaskData& t)
    index<2> ind(t.writeOffset, shift);
    extent<2> ext(t.writeExt[0], t.writeExt[1] - shift * 2);
    array_view<float, 2> outData = c.section(ind, ext);
    arrAs[].section(ind, ext).copy_to(outData);

The output window shows the relative performance of this iterative averaging implementation compared to the single average version described previously. Based on the time for a single average calculation, you would expect the iterative version to take 5790 ms (10 times as long). In fact, it takes 6309 ms, or an additional 582 ms. This represents an overhead of roughly 9 percent.

Matrix weighted average 2000 x 2000 matrix, with 121 x 121 window
Matrix size 15625 KB

Single GPU matrix weighted average took                         1070.74 (ms)
2 GPU matrix weighted average (p_f_e) took                       579.947 (ms)
2 GPU matrix weighted average took                               585.191 (ms)

Weighted average executing 10 times

2 GPU matrix weighted average took                              6309.93 (ms)