smish.dev
symmetric_spmv

Problem Statement

Large, sparse matrices appear frequently in HPC applications. A significant fraction of the overall memory footprint is often spent storing such matrices. Fortunately, many of these matrices also have symmetries which give us a way to avoid actually storing every single entry.

For example, the entries above the diagonal in a symmetric matrix are the same as the ones below the diagonal Aij=Aji. So, one common convention is to only store either the upper or lower triangular part of the matrix, which reduces the memory usage by (about) a factor of 2. However, omitting these entries tends to make certain operations more challenging to implement.

In this document, we'll look at a couple of ways to implement sparse matrix vector multiplication (sparse matvec or SpMV) for symmetric or skew-symmetric matrices. For reference, we will adopt the following notation in the remainder of the document:


Here are some basic stats about the matrices used to test the following SpMV implementations.

 Poisson3D (512x512x512)af_shell10.mtx Cube_Coup_dt0.mtx
n13421772815080652164760
nnz(A)93795123252259885124406070

 

cuSPARSE

cuSPARSE provides performant implementations of SpMV kernels, but at the time of writing it does not directly support symmetric storage. So, if we have an upper triangular matrix B:=D+U, and we wish to compute the action of A, we have two main options:

  1. Form the full matrix A:=B+BD, and call a single kernel to compute y:=Ax

  2. Modify the diagonal entries B^:=B12D, call separate kernels to compute y1:=B^x and y2:=B^x and add them together

Option 1 has two main downsides: it defeats the point of only storing half the matrix in the first place and roughly half of the memory accesses (associated with loading the matrix coefficients) are redundant.

Option 2 lets us reduce the memory footprint by only storing a triangular matrix, but it still has a performance issue since the matrix coefficients are accessed from global memory twice (once by each kernel). The answer from option 2 is also not deterministic, since cuSPARSE's implementation of the transpose action of a CSR matrix does not guarantee reproducible results.

 

The times of these two approaches (as measured on an A100) for 3 different test matrices are given below.

 Poisson3D (512x512x512)af_shell10.mtxCube_Coup_dt0.mtx
cuSPARSE (option 1)13.3 ms0.53 ms1.21 ms
cuSPARSE (option 2)17.8 ms0.64 ms1.35 ms

 

Custom Kernels

Naive SpMV

It's relatively straightforward to write a naive SpMV implementation:

This implementation doesn't address the symmetric storage problem yet, but I include it here to serve as a starting point that we can modify to support symmetric storage. As expected, this implementation is a little bit slower than cuSPARSE (option 1).

 Poisson3D (512x512x512)af_shell10.mtxCube_Coup_dt0.mtx
cuSPARSE (option 1)13.3 ms0.53 ms1.21 ms
cuSPARSE (option 2)17.8 ms0.64 ms1.35 ms
naive SpMV14.5 ms0.58 ms1.36 ms

 

Symmetric SpMV

If we intentionally avoid storing the elements below the diagonal, we still need to account for them in the SpMV if we're going to get the right answer. The kernel assumes that the CSR matrix passed in stores only the upper triangle of a symmetric matrix and as a result, its implementation differs from the previous one in the following ways:

  1. When matrix coefficient Aij is read, its contribution to row i is made in the same way as before. However, if that entry was not on the diagonal, we also need to account for the contribution that should be made to row j from the term Ajixi. So, we include an additional atomicAdd to make that contribution while avoiding a data race.

  2. Since the contributions to a row can come from other CTAs now, we need to make the local contributions to a row use an atomicAdd as well.

 

When we profile this kernel, we see that this implementation is actually faster than cuSPARSE on the test problems. This is almost entirely due to the fact that SpMV is a memory-bound kernel. By only reading matrix entries once rather than twice, we significantly reduce the total memory traffic, which improves the runtime.

 Poisson3D (512x512x512)af_shell10.mtxCube_Coup_dt0.mtx
cuSPARSE (option 1)13.3 ms0.53 ms1.21 ms
cuSPARSE (option 2)17.8 ms0.64 ms1.35 ms
naive SpMV14.5 ms0.58 ms1.36 ms
symmetric SpMV8.47 ms0.48 ms0.97 ms

However, this implementation uses floating-point atomics, so the order of summations is nondeterministic, which means that the output vector is also not reproducible, like cuSPARSE (option 2).

 

Compensated Symmetric SpMV

The outputs from the previous symmetric SpMV implementation vary run to run. Using the af_shell10 matrix as an example, if we run SpMV_sym multiple times on the same inputs and compute the relative error with the outputs computed using a long double precision, we observe

The errors are small for this specific matrix, but the answers are different every single time. Nondeterministic results like this can make replicating bugs challenging or hurt convergence of iterative methods.

Although guaranteeing total reproducibility for symmetric storage SpMV is challenging, one step in that direction is to use compensated arithmetic to improve the accuracy of our implementation. The idea is that: if we can make the outputs accurate to more digits than the underlying floating-point representation allows, the resulting output will be effectively deterministic.

The idea behind compensated arithmetic is to represent numbers by a pair of floating-point numbers, rather than an individual number. One of the numbers in the pair represents the "actual" value, and the other number tries to account for small errors incurred by floating-point arithmetic. After calculations have finished, the value and correction terms are combined to produce a more accurate estimate of the "true" value.

If this interests you, I recommend checking out https://github.com/jhueckelheim/force and Hueckelheim's related paper for more details.

 

With this idea in mind, we can apply compensated summation to the atomics in our kernel to try to mitigate the output variation that comes from adding contributions in different orders:

 

If we run this version multiple times we see that the outputs are actually the same between runs.

It's important to note that this reproducibility is not a guarantee-- there exist inputs to this implementation that generate errors too large to be accounted for by compensated arithmetic. However, it is also true that there are certain matrix and vector inputs that will actually be guaranteed to be deterministic, but it is probably best to not assume that is true in general.

In terms of performance, the compensated version of the kernel is slower than the uncompensated version, due to the extra atomicAdds and also a second kernel to combine y and y_corr to produce the final output.

 Poisson3D (512x512x512)af_shell10.mtxCube_Coup_dt0.mtx
cuSPARSE (option 1)13.3 ms0.53 ms1.21 ms
cuSPARSE (option 2)17.8 ms0.64 ms1.35 ms
naive SpMV14.5 ms0.58 ms1.36 ms
symmetric SpMV8.47 ms0.48 ms0.97 ms
symmetric SpMV (comp.)13.6 ms0.85 ms1.60 ms

 

Summary

We walked through the process of implementing several SpMV kernels that can take advantage of symmetric (or with a minor modification, skew-symmetric) storage. We showed that with only minor modifications, it was possible to support symmetric storage, and that the resulting kernel performs well, sometimes even surpassing the existing optimized library implementations.

We also showed how compensated arithmetic can be used to improve the accuracy of outputs, which can help mitigate some of the effects of nondeterministic order of summation that frequently arises with parallel processors.

Complete sources for the code used in this study are available here. The examples include the ability to import Matrix Market files, so give it a shot on your own datasets!