arithmetic_intensity

Arithmetic Intensity

$\alpha$ $\frac{\text{ops}}{\text{byte}}$ .

The benefit of this idea is that the arithmetic intensity of the kernel can be compared to the arithmetic intensity of GPU hardware itself (which is defined by the ratio of its theoretical floating point throughput and memory throughput). For example, using the specs for a Titan V GPU, we have:

α_{g p u} : = \frac{10 TFLOP/s}{650 GB/s} \approx 15.4 \frac{FLOP}{byte}

$\alpha_{kernel} \ll \alpha_{gpu}$ , then the GPU is capable of processing data faster than it arrives. As a result, we expect this kernel to be bottlenecked by the memory subsystems on this GPU and say that the kernel is "memory-bound".

$\alpha_{kernel} \gg \alpha_{gpu}$ , then the data arrives faster than the GPU can process it. This leads to high compute utilization, but low memory throughput and we say that the kernel is "compute-bound".

Kernel Examples

Let's look at some common GPU calculations and estimate their arithmetic intensities. We'll assume that the all values in these kernels are 32-bit.

Kernel	Operations	Memory (bytes)	$\alpha_{kernel}$ (op / byte)
axpy (n)	$n$	$n$	$\frac{1}{6}$
dot product (n)	$n$	$n$	$\frac{1}{4}$
dense mat-vec (n)	$n^2$	$(n^2 + 2n)$	$\frac{1}{2}$ $\frac{1}{4n}$
$\beta = \frac{nnz}{row}$ )	$\beta n$	$(2 \beta + 2) n$	$\frac{1}{4}$ $\frac{\beta}{4 n}$
dense matix multiplication	$n^3$	$n^2$	$\frac{n}{6}$

$\alpha_{gpu} \approx 15$ , those 4 kernels are expected to be strongly memory-bound on that hardware.

$n$ $n$ $n$ $\alpha_{gpu}$ and becomes compute-bound.

Interactive Simulation

One way to interpret the arithmetic intensity model is that the GPU is essentially a processor (with some compute throughput) attached to a memory bus (with some maximum data throughput). From there, kernel execution can be thought of as data moving through the memory bus, which is eventually operated on by the processor.

Below is an interactive simulation of this process. Try varying the hardware and recipe (i.e. kernel) specifications to see how the system behaves.

$\alpha_{gpu} \ll \alpha_{recipe}$ $\alpha_{gpu} \gg \alpha_{recipe}$ ?
What happens to the steady-state recipe throughput if you increase the device's memory throughput for a compute-bound kernel?

Memory Throughput (3 item/s) Compute Throughput (5 op/s) Overclock (1x)

Recipe Takes 4 Items Recipe Takes 10 Operations