smish.dev
report

Problem Statement

This kernel comes from someone on /r/CUDA who writes:

For completeness, the code in that gist is given below, verbatim:

 

After looking at this code a little bit: here are a few things I notice:

With that in mind, when I run the original code with the assumed problem size mentioned above I get a baseline runtime of 201us on a GTX Titan V.

 

Optimizations

When I look at the kernel code, the first thing that jumps out at me is that the data loads are not coalesced (i.e. adjacent threads within a warp are not accessing adjacent memory locations):

Depending on the value of descriptor_dim, each thread within a warp could potentially be accessing its own cache line! For small values of descriptor_dim this kernel is memory-bound, which means that its performance is dictated by

This is a common issue and there are a few ways to address it.

Transpose the Data Layout

Loads are said to be "coalesced" when the stride associated with thread index is 1. However, in this kernel it's the descriptor index that has stride 1. So, one way to coalesce these loads is to "transpose" the layout of the input arrays:

This approach is very simple and brings our runtime down to 52.2us, about a 4x speedup. The downside is that changing the data layout of the input arrays affects how other parts of the code access that data as well.

Transpose the Data in Shared Memory

If you don't want to change the data layout, another option is to load the input arrays into __shared__ memory buffer and perform the transposition inside the kernel.

In this version, we start by allocating shared memory buffers for local "tiles" of values to process. Adjacent threads within a warp access global memory with unit stride when copying the data to the shared buffer, so this transfer is coalesced.

The __syncthreads() prevents threads from accessing shared memory locations before the values have been written.

This version of the kernel takes 70.6us to finish, a significant speedup over the original, but slower than explicitly transposing the data layout. Part of this is related to the fact that this version fixed the issue with strided access to global memory, but now we have strided access to shared memory! Depending on the value of descriptor_dim, this can cause bank conflicts which result in reduced throughput to shared memory.

We can avoid the bank conflicts by transposing the layout of the shared memory buffer as follows:

This small change brings the runtime down to 45.1us, the best total speedup so far.

An question for the reader: why did I transpose the layout of one of the shared memory buffers but not the other?

Bytes In Flight

It's important for memory-bound kernels to saturate the memory bus as much as practical to maximize throughput. One way to do this is to increase the amount of data processed by an individual thread, so that it can issue multiple memory transactions before stalling.

With this in mind, here's a version of the kernel where each thread is responsible for writing a small rectangular chunk of the output array as opposed to just 1 entry:

When the compiler sees accesses to 8-byte, 16-byte or 32-byte aligned types it can emit optimized vector load instructions that generate multiple memory transactions. More information about vectorized loads can be found here. The vec class is a C++ type that satisfies those alignment requirements so that the compiler can issue vectorized loads (although other types like float4, double2 with appropriate alignment also work).

An additional benefit of this implementation is that there some of the intermediate values in the summation process are loaded into registers and reused multiple times. Accessing shared memory is relatively fast, but registers are even faster.

This version of the kernel runs in 34.7us, an 6.5x speedup over the original implementation.

Summary

Here, we looked at a couple simple optimizations that resulted in a significant speedup:

If we run the problem again with descriptor_dim = 1 to ensure that this is a completely memory-bound kernel, we find that the runtime drops to 17.4us, which is close to the optimistic theoretical estimate of

(2048×1024)Bytes600GBps14μs

for my GPU. So, there is still potential for further optimization, but these changes have brought us "close" to the theoretical best-case performance, at least for the case where descriptor_dim is significantly smaller than the output matrix dimensions.

There are also many other detailed analyses of how to optimize dense matrix-matrix multiplication on GPUs which could directly apply to this kernel as well!

Code for the kernel implementations is available here:

https://github.com/samuelpmish/cuda_kernel_optimization_examples/tree/main/pairwise_distance_kernel