Working with Multiple Accelerators in C++ AMP

  • 9/15/2012

Dynamic Load Balancing

The example above used two identical GPUs and assumed that no other applications were scheduling work on them. What if your application is running on a machine with two or more different GPUs? For example, your computer might have come with an integrated GPU on the motherboard but you added a more powerful discrete GPU, or other applications are using some of the available GPUs for other work. In both cases, scheduling the same amount of work on each of the available GPUs will not give the best results; the application’s performance will be limited by the slowest GPU.

The answer is to implement a load-balancing algorithm to allocate work between the available GPU accelerators. Your application can load-balance based on the relative performance of each GPU, using either the time taken for a kernel to run or the amount of work completed. In some cases, if the GPUs have wildly differing performance characteristics, then just using the best one or two might be the more efficient solution.

A common approach for doing this is the master-worker pattern. The master breaks the problem up into tasks and adds them to a queue. The master then assigns tasks from the queue to the available workers. Once a worker completes a task, it returns the results to the master. The master then assigns another task to the worker until no more tasks remain. Finally, it shuts down the workers and hands the results off to the application. The example shown here uses a work-stealing variation of master-worker in which worker GPUs take work from a master task queue on the CPU rather than waiting for it to be assigned to them.

The advantage of this approach is that it automatically load-balances across worker GPUs with different performance characteristics. It will also efficiently handle scheduling work on GPUs with varying workloads from other applications running on them. For effective load balancing there must be enough tasks to occupy all the available GPUs. Smaller tasks make for effective load balancing but add to the management overhead. Which task size you use largely depends on the application.

The following is a simple example to show the task partitioning of a C++ AMP kernel that modifies a one-dimensional array. The sample uses a Task type to track the range within the input data associated with each task.

typedef std::pair<size_t, size_t> Task;

inline size_t GetStart(Task t) { return t.first; }
inline size_t GetEnd(Task t) { return t.first + t.second; }
inline size_t GetSize(Task t) { return t.second; }

The example creates a concurrent_queue<Task> of tasks and then starts a thread for each accelerator using a parallel_for. Each thread pops tasks from the queue and executes a C++ AMP kernel to process the section of the array associated with the task. Once the queue is empty, the parallel_for completes.

    const size_t dataSize = 101000;
    const size_t taskSize = dataSize / 20;
    std::vector<int> theData(dataSize, 1);

    //  Divide the data up into tasks

    concurrent_queue<Task> tasks;
    for (size_t i = 0; i < theData.size(); i += taskSize)
        tasks.push(Task(i, std::min(i + taskSize, theData.size()) - i));

    //  Start a task for each accelerator

    parallel_for(0, int(accls.size()), [=, &theData, &tasks, &critSec](const unsigned i)
    {
        Task t;
        while (tasks.try_pop(t))
        {
            array_view<int> work(extent<1>(GetSize(t)), theData.data() + GetStart(t));
            parallel_for_each(accls[i].default_view, extent<1>(GetSize(t)),
            [=](index<1> idx) restrict(amp)
        {
            work[idx] = // ...
        });
        // Wait in order to stop synchronize from blocking the process
        accls[i].default_view.wait();
        work.synchronize();
    }
});

For clarity, the code that writes status updates to the console has been removed. You can see the full source code in Chapter8\main.cpp in the WorkStealingExample() function.

As you can see from the output when running a Debug build, tasks are executed on both the available GPUs, but GPU 1 does the majority of the work. In this case, GPU 0 is also being used by other processes and has a display connected to it.

Queued 20 tasks
 Starting tasks on 1: NVIDIA GeForce GTX 570
 Starting tasks on 0: NVIDIA GeForce GTX 570
  Finished task 0 - 5050 on 1
  Finished task 10100 - 15150 on 1
  Finished task 15150 - 20200 on 1
  Finished task 20200 - 25250 on 1
  Finished task 25250 - 30300 on 1
  Finished task 30300 - 35350 on 1
  Finished task 5050 - 10100 on 0
  Finished task 35350 - 40400 on 1
  Finished task 40400 - 45450 on 0
  Finished task 45450 - 50500 on 1
  Finished task 50500 - 55550 on 0
  Finished task 55550 - 60600 on 1
  Finished task 60600 - 65650 on 0
  Finished task 65650 - 70700 on 1
  Finished task 70700 - 75750 on 0
  Finished task 75750 - 80800 on 1
  Finished task 80800 - 85850 on 0
  Finished task 85850 - 90900 on 1
  Finished task 90900 - 95950 on 0
  Finished task 95950 - 101000 on 1
 Finished 7 tasks on 0
 Finished 13 tasks on 1

You might be considering using the WARP accelerator in conjunction with the GPUs to add another accelerator and thereby improve overall performance. Often, this might not result in any significant improvement in performance. First, the WARP accelerator usually achieves only a small fraction of the performance of a dedicated/physical GPU, so the additional CPU overhead and code complexity required to coordinate the additional WARP accelerator don’t result in any overall gains. Second, the CPU is already being used to coordinate the task parallelism to control the GPUs and copy data to and from them. Running a WARP accelerator workload on the CPU in combination with a physical GPU might degrade performance rather than improve it because the WARP accelerator uses CPU resources that would otherwise be used to distribute work to the GPUs. Therefore, it’s recommended that your application use the WARP only as a fallback CPU solution when no GPU accelerators are available.