Overview and C++ AMP Approach

  • 9/15/2012
In this chapter from C++ AMP, you’ll learn the basic differences between a CPU and a GPU, two of the possible components of a heterogeneous computing solution, and what kinds of problems are suitable for acceleration using these parallel techniques. Then you’ll review the CPU and GPU parallel techniques in use today, followed by an introduction to the concepts behind C++ AMP, to lay the groundwork for the details in the subsequent chapters.

Why GPGPU? What Is Heterogeneous Computing?

As developers, we are used to adjusting to a changing world. Our industry changes the world almost as a matter of routine. We learn new languages, adopt new methodologies, start using new user interface paradigms, and take for granted that it will always be possible to make our programs better. When it seems we will “hit a wall” following one path to making version n+1 better than version n, we find another path. The newest path some developers are about to follow is the path of heterogeneous computing.

In this chapter you’ll review some of the history of performance improvements to see what wall some developers are facing. You’ll learn the basic differences between a CPU and a GPU, two of the possible components of a heterogeneous computing solution, and what kinds of problems are suitable for acceleration using these parallel techniques. Then you’ll review the CPU and GPU parallel techniques in use today, followed by an introduction to the concepts behind C++ AMP, to lay the groundwork for the details in the subsequent chapters.

History of Performance Improvements

In the mid-seventies, computers intended for use by a single person were not the norm. The phrase “personal computer” dates back only to 1975. Over the decades that followed, the idea of a computer on every desk changed from an ambitious and perhaps impossible goal to something pretty ordinary. In fact, many desks today have more than one computer, and what’s more, so do many living rooms. A lot of people even carry a small computer in their pocket, in the form of a smartphone. For the first 30 years of that expansive growth, computers didn’t just get cheaper and more popular—they also got faster. Each year, manufacturers released chips that had a higher clock speed, more cache, and better performance. Developers got in the habit of adding features and capabilities to their software. When those additions made the software run more slowly, the developers didn’t worry much; in six months to a year, faster machines would be available and the software would again become fast and responsive. This was the so-called “free lunch” enabled by ever-improving hardware performance. Eventually, performance on the level of gigaFLOPS—billions of floating points operations per second—became attainable and affordable.

Unfortunately, this “free lunch” came to an end in about 2005. Manufacturers continued to increase the number of transistors that could be placed on a single chip, but physical limitations—such as dissipating the heat from the chip—meant that clock speeds could no longer continue to increase. Yet the market, as always, wanted more powerful machines. To meet that demand, manufacturers began to ship multicore machines, with two, four, or more CPUs in a single computer. “One user, one CPU” had once been a lofty goal, but after the free lunch era, users called for more than just one CPU core, first in desktop machines, then in laptops, and eventually in smartphones as well. Over the past five or six years, it’s become common to find a parallel supercomputer on every desk, in every living room, and in everyone’s pocket.

But simply adding cores didn’t make everything faster. Software can be roughly divided into two main groups: parallel-aware and parallel-unaware. The parallel-unaware software typically uses only half, a quarter, or an eighth of the cores available. It churns away on a single core, missing the opportunity to get faster every time users get a new machine with more cores. Developers who have learned how to write software that gets faster as more CPU cores become available achieve close to linear speedups; in other words, a speed improvement that comes close to the number of cores on the machine—almost double for dual-core machines, almost four times for four-core machines, and so on. Knowledgeable consumers might wonder why some developers are ignoring the extra performance that could be available to their applications.

Heterogeneous Platforms

Over the same five-year or six-year period that saw the rise of multicore machines with more than one CPU, the graphics cards in most machines were changing as well. Rather than having two or four CPU cores, GPUs were being developed with dozens, or even hundreds, of cores. These cores are very different from those in a CPU. They were originally developed to improve the speed of graphics-related computations, such as determining the color of a particular pixel on the screen. GPUs can do that kind of work faster than a CPU, and because modern graphics cards contain so many of them, massive parallelism is possible. Of course, the idea of harnessing a GPU for numerical calculations unrelated to graphics quickly became irresistible. A machine with a mix of CPU and GPU cores, whether on the same chip or not, or even a cluster of machines offering such a mix, is a heterogeneous supercomputer. Clearly, we are headed toward a heterogeneous supercomputer on every desk, in every living room, and in every pocket.

A typical CPU in early 2012 has four cores, is double hyper-threaded, and has about a billion transistors. A top end CPU can achieve, at peak, about 0.1 TFlop or 100 GFlops doing double-precision calculations. A typical GPU in early 2012 has 32 cores, is 32x-threaded, and has roughly twice as many transistors as the CPU. A top-end GPU can achieve 3 TFlop—some 30 times the peak compute speed of the CPU—doing single-precision calculations.

The reason the GPU achieves a higher compute speed lies in differences other than the number of transistors or even the number of cores. A CPU has a low memory bandwidth—about 20 gigabytes per second (GB/s)—compared to the GPU’s 150 GB/s. The CPU supports general code with multitasking, I/O, virtualization, deep execution pipelines, and random accesses. In contrast, the GPU is designed for graphics and data-parallel code with programmable and fixed function processors, shallow execution pipelines, and sequential accesses. The GPU’s speed improvements, in fact, are available only on tasks for which the GPU is designed, not on general-purpose tasks. Possibly even more important than speed is the GPU’s lower power consumption: a CPU can do about 1 gigaflop per watt (GFlop/watt) whereas the GPU can do about 10 GFlop/watt.

In many applications, the power required to perform a particular calculation might be more important than the time it takes. Handheld devices such as smartphones and laptops are battery-powered, so users often wisely choose to replace applications that use up the battery too fast with more battery-friendly alternatives. This can also be an issue for laptops, whose users might expect all-day battery life while running applications that perform significant calculations. It’s becoming normal to expect multiple CPUs even on small devices like smartphones—and to expect those devices to have a GPU. Some devices have the ability to power individual cores up and down to adjust battery life. In that kind of environment, moving some of your calculation to the GPU might mean the difference between “that app I can’t use away from the office because it just eats battery” and “that app I can’t live without.” At the other end of the spectrum, the cost of running a data center is overwhelmingly the cost of supplying power to that data center. A 20 percent saving on the watts required to perform a large calculation in a data center or the cloud can translate directly into bottom-line savings on a significant energy bill.

Then there is the matter of the memory accessed by these cores. Cache size can outweigh clock speed when it comes to compute speed, so the CPU has a large cache to make sure that there is always data ready to be processed, and the core will rarely have to wait while data is fetched. It’s normal for CPU operations to touch the same data repeatedly, giving a real benefit to caching approaches. In contrast, GPUs have smaller caches but use a massive number of threads, so some threads are always in a position to do work. GPUs can prefetch data to hide memory latency, but because that data is likely to be accessed only once, caching provides less benefit and is less necessary. For this approach to help, you ideally have a huge quantity of data and a fairly simple calculation that operates on the data.

Perhaps the most important difference of all lies in how developers program the two technologies. Many mainstream languages and tools exist for CPU programming. For power and performance, C++ is the number one choice, providing abstractions and powerful libraries without giving up control. For general-purpose GPU programming (GPGPU), the choices are far more restricted and in most cases involve a niche or exotic programming model. This restriction has meant that—until now—only a handful of fields and problems have been able to capitalize on the power of the GPU to tackle their compute-intensive number-crunching, and it has also meant that mainstream developers have avoided learning how to interact with the GPU. Developers need a way to increase the speed of their applications or to reduce the power consumption of a particular calculation. Today that might come from using the GPU. An ideal solution sets developers up to get those benefits now by using the GPU and later by using other forms of heterogeneous computation.

GPU Architecture

As mentioned earlier, GPUs have shallow execution pipelines, small cache, and a massive number of threads performing sequential accesses. These threads are not all independent; they are arranged in groups. These groups are called warps on NVIDIA hardware and wavefronts on AMD hardware. In this book, they are referred to as “warps.” Warps run together and can share memory and cooperate. Local memory can be read in as little as four clock cycles, while the larger (up to four GB) global memory might take 400–600 cycles. If a group of threads is blocked while reading, another group of threads executes. The GPU can switch these groups of threads extremely fast. Memory is read in a way that provides huge speed advantages when adjacent threads use adjacent memory locations. But if some threads in a group are accessing memory that is nowhere near the memory being accessed by other threads in that group, performance will suffer.

There’s a large dissimilarity between CPU and GPU architectures. Developers using higher-level languages have generally been able to ignore CPU architecture. Lower-level tools such as operating systems and optimizing compilers must have that kind of architectural knowledge, but the compiler and the operating system insulate many “ordinary” applications from hardware details. Best practices or rules of thumb that you might hold as self-evident are perhaps not self-evident; even on the CPU, a simple integer addition that causes a cache miss might take far longer than a disk read that accessed only the buffered file contents from a nearby cache.

Some developers, finding themselves writing highly performance-sensitive applications, might need to learn just how many instructions can be executed in the time lost to a cache miss or how many clock cycles it takes to read a byte from a file (millions, in many cases). At the moment, this kind of knowledge is unavoidable when working with non-CPU architectures such as the GPU. The layers of protection that compilers and operating systems provide for CPU programming are not entirely in place yet. For example, you might need to know how many threads are in a warp or the size of their shared memory cache. You might arrange your computation so that iterations involve adjacent memory and avoid random accesses. To understand the speedups your application can achieve, you must understand, at least at a conceptual level, the way the hardware is organized.

Candidates for Performance Improvement through Parallelism

The GPU works best on problems that are data-parallel. Sometimes it’s obvious how to split one large problem up into many small problems that a processor can work on independently and in parallel. Take matrix addition, for example: each element in the answer matrix can be calculated entirely independently of the others. Adding a pair of 100 x 100 matrices will take 10,000 additions, but if you could split it among 10,000 threads, all the additions could be done at once. Matrix addition is naturally data-parallel.

In other cases, you need to design your algorithm differently to create work that can be split across independent threads. Consider the problem of finding the highest value in a large collection of numbers. You could traverse the list one element at a time, comparing each element to the “currently highest” value and updating the “currently highest” value each time you come across a larger one. If 10,000 items are in the collection, this will take 10,000 comparisons. Alternatively, you could create some number of threads and give each thread a piece of the collection to work on. 100 threads could take on 100 items each, and each thread would determine the highest value in its portion of the collection. That way you could evaluate every number in the time it takes to do just 100 comparisons. Finally, a 101st thread could compare the 100 “local highest” numbers—one from each thread—to establish the overall highest value. By tweaking the number of threads and thus the number of comparisons each thread makes, you can minimize the elapsed time to find the highest value in the collection. When the comparisons are much more expensive than the overhead of making threads, you might take an extreme approach: 5,000 threads each compare two values, then 2,500 threads each compare the winners of the first round, 1,250 threads compare the winners of the second round, and so on. Using this approach, you’d find the highest value in just 14 rounds—the elapsed time of 14 comparisons, plus the overhead. This “tournament” approach can also work for other operations, such as adding all the values in a collection, counting how many values are in a specified range, and so on. The term reduction is often used for the class of problems that seek a single number (the total, minimum, maximum, or the like) from a large data set.

It turns out that any problem set involving large quantities of data is a natural candidate for parallel processing. Some of the first fields to take this approach include the following:

  • Scientific modeling and simulation Physics, biology, biochemistry, and similar fields use simple equations to model immensely complicated situations with massive quantities of data. The more data included in the calculation, the more accurate the simulation. Testing theories in a simulation is feasible only if the simulation can be run in a reasonable amount of time.
  • Real-time control systems Combining data from myriad sensors, determining where operation is out of range, and adjusting controls to restore optimal operation are high-stakes processes. Fire, explosion, expensive shutdowns, and even loss of life are what the software is working to avoid. Usually the number of sensors being read is limited by the time it takes to make the calculations.
  • Financial tracking, simulation, and prediction Highly complicated calculations often require a great deal of data to establish trends or identify gaps and opportunities for profit. The opportunities must be identified while they still exist, putting a firm upper limit on the time available for the calculation.
  • Gaming Most games are essentially a simulation of the real world or a carefully modified world with different laws of physics. The more data you can include in the physics calculations, the more believable the game is—yet performance simply cannot lag.
  • Image processing Whether detecting abnormalities in medical images, recognizing faces on security camera footage, confirming fingerprint matches, or performing any of dozens of similar tasks, you want to avoid both false negatives and false positives, and the time available to do the work is limited.

In these fields, when you achieve a 10x speedup in the application that is crunching the numbers, you gain one of two abilities. In the simplest case, you can now include more data in the calculations without the calculations taking longer. This generally means that the results will be more accurate or that end users of the application can have more confidence in their decisions. Where things really get interesting is when the speedup makes possible things that were impossible before. For example, if you can perform a 20-hour financial calculation in just two hours, you can do that work overnight while the markets are closed, and people can take action in the morning based on the results of that calculation. Now, what if you were to achieve a 100x speedup? A calculation that formerly required 1,000 hours—over 40 days—is likely to be based on stale data by the time it completes. However, if that same calculation takes only 10 hours—overnight—the results are much more likely to still be meaningful.

Time windows aren’t just a feature of financial software—they apply to security scanning, medical imaging, and much more, including a rather scary set of applications in password cracking and data mining. If it took 40 days to crack your password by brute force and you changed it every 30 days, your password was safe. But what happens when the cracking operation takes only 10 hours?

A 10x speedup is relatively simple to achieve, but a 100x speedup is much harder. It’s not that the GPU can’t do it—the problem is the contribution of the nonparallelizable parts of the application. Consider three applications. Each takes 100 arbitrary units of time to perform a task. In one, the nonparallelizable parts (say, sending a report to a printer) take up 25 percent of the total time. In another, they require only 1 percent, and in the third, only 0.1 percent. What happens as you speed up the parallelizable part of each of these applications?

App1

App2

App3

% sequential

25%

1%

0.1%

Original

Sequential time

25

1

0.1

Parallel time

75

99

99.9

Total time

100

100

100

10x

Sequential time

25

1

0.1

Parallel time

7.5

9.9

9.99

Total time

32.5

10.9

10.09

Speedup

3.08

9.17

9.91

100x

Sequential time

25

1

0.1

Parallel time

0.75

0.99

0.999

Total time

25.75

1.99

1.099

Speedup

3.88

50.25

90.99

Infinite

Sequential time

25

1

0.1

Parallel time

0

0

0

Total time

25

1

0.1

Speedup

4.00

100.00

1000.00

With a 10x speedup in the parallel part, the first application now spends much more time in the sequential part than in the parallelizable part. The overall speedup is a little more than 3x. Finding a 100x speedup in the parallel part doesn’t help much because of the enormous contribution of the sequential part. Even an infinite speedup, reducing the time in the parallel part to zero, can’t erase the sequential part and limits the overall speedup to 4x. The other two applications fare better with the 10x speedup, but the second app can’t benefit from all of the 100x speedup, gaining only 50x overall. Even with an infinite speedup, the second app is limited to 100x overall.

This seeming paradox—that the contribution of the sequential part, no matter how small a fraction it is at first, will eventually be the final determiner of the possible speedup—is known as Amdahl’s Law. It doesn’t mean that 100x speedup isn’t possible, but it does mean that choosing algorithms to minimize the nonparallelizable part of the time spent is very important for maximum improvement. In addition, choosing a data-parallel algorithm that opens the door to using the GPGPU to speed up the application might result in more overall benefit than choosing a very fast and efficient algorithm that is highly sequential and cannot be parallelized. The right decision for a problem with a million data points might not be the right decision for a problem with 100 million data points.