smish.dev
arithmetic_intensity

Arithmetic Intensity

One simple way to think about GPU kernels is in terms of two numbers: how many operations are performed, and how much data is moved. We call the ratio of these two numbers the "arithmetic intensity" of the kernel, α, measured in opsbyte.

The benefit of this idea is that the arithmetic intensity of the kernel can be compared to the arithmetic intensity of GPU hardware itself (which is defined by the ratio of its theoretical floating point throughput and memory throughput). For example, using the specs for a Titan V GPU, we have:

αgpu:=10 TFLOP/s650 GB/s15.4FLOPbyte

When αkernelαgpu, then the GPU is capable of processing data faster than it arrives. As a result, we expect this kernel to be bottlenecked by the memory subsystems on this GPU and say that the kernel is "memory-bound".

Similarly, when αkernelαgpu, then the data arrives faster than the GPU can process it. This leads to high compute utilization, but low memory throughput and we say that the kernel is "compute-bound".

 

Kernel Examples

Let's look at some common GPU calculations and estimate their arithmetic intensities. We'll assume that the all values in these kernels are 32-bit.

KernelOperationsMemory (bytes)αkernel (op / byte)
axpy (n)2n12n16
dot product (n)2n8n14
dense mat-vec (n)2n24(n2+2n)12 + 14n
sparse mat-vec (β=nnzrow)2βn4(2β+2)n14 + β4n
dense matix multiplication2n312n2n6

The first 4 of these kernels have arithmetic intensities that are smaller than 1, and practically independent of problem size. As a result, since the Titan V's arithmetic intensity is αgpu15, those 4 kernels are expected to be strongly memory-bound on that hardware.

However, the matrix multiplication kernel has an arithmetic intensity that grows without bound as n gets larger. This means that when n is small the calculation will be memory bound, but as n gets bigger the arithmetic intensity eventually exceeds αgpu and becomes compute-bound.

 

Interactive Simulation

One way to interpret the arithmetic intensity model is that the GPU is essentially a processor (with some compute throughput) attached to a memory bus (with some maximum data throughput). From there, kernel execution can be thought of as data moving through the memory bus, which is eventually operated on by the processor.

Below is an interactive simulation of this process. Try varying the hardware and recipe (i.e. kernel) specifications to see how the system behaves.