One simple way to think about GPU kernels is in terms of two numbers: how many operations are performed, and how much data is moved. We call the ratio of these two numbers the "arithmetic intensity" of the kernel,
The benefit of this idea is that the arithmetic intensity of the kernel can be compared to the arithmetic intensity of GPU hardware itself (which is defined by the ratio of its theoretical floating point throughput and memory throughput). For example, using the specs for a Titan V GPU, we have:
When
Similarly, when
Let's look at some common GPU calculations and estimate their arithmetic intensities. We'll assume that the all values in these kernels are 32-bit.
| Kernel | Operations | Memory (bytes) | |
|---|---|---|---|
| axpy (n) | 2 | 12 | |
| dot product (n) | 2 | 8 | |
| dense mat-vec (n) | 2 | 4 | |
| sparse mat-vec ( | 2 | 4 | |
| dense matix multiplication | 2 | 12 |
The first 4 of these kernels have arithmetic intensities that are smaller than 1, and practically independent of problem size. As a result, since the Titan V's arithmetic intensity is
However, the matrix multiplication kernel has an arithmetic intensity that grows without bound as
One way to interpret the arithmetic intensity model is that the GPU is essentially a processor (with some compute throughput) attached to a memory bus (with some maximum data throughput). From there, kernel execution can be thought of as data moving through the memory bus, which is eventually operated on by the processor.
Below is an interactive simulation of this process. Try varying the hardware and recipe (i.e. kernel) specifications to see how the system behaves.
What happens when
What happens to the steady-state recipe throughput if you increase the device's memory throughput for a compute-bound kernel?