Large, sparse matrices appear frequently in HPC applications. A significant fraction of the overall memory footprint is often spent storing such matrices. Fortunately, many of these matrices also have symmetries which give us a way to avoid actually storing every single entry.
For example, the entries above the diagonal in a symmetric matrix are the same as the ones below the diagonal
In this document, we'll look at a couple of ways to implement sparse matrix vector multiplication (sparse matvec or SpMV) for symmetric or skew-symmetric matrices. For reference, we will adopt the following notation in the remainder of the document:
Here are some basic stats about the matrices used to test the following SpMV implementations.
| Poisson3D (512x512x512) | af_shell10.mtx | Cube_Coup_dt0.mtx | |
|---|---|---|---|
| 134217728 | 1508065 | 2164760 | |
| nnz( | 937951232 | 52259885 | 124406070 |
cuSPARSE provides performant implementations of SpMV kernels, but at the time of writing it does not directly support symmetric storage. So, if we have an upper triangular matrix
Form the full matrix
Modify the diagonal entries
Option 1 has two main downsides: it defeats the point of only storing half the matrix in the first place and roughly half of the memory accesses (associated with loading the matrix coefficients) are redundant.
Option 2 lets us reduce the memory footprint by only storing a triangular matrix, but it still has a performance issue since the matrix coefficients are accessed from global memory twice (once by each kernel). The answer from option 2 is also not deterministic, since cuSPARSE's implementation of the transpose action of a CSR matrix does not guarantee reproducible results.
The times of these two approaches (as measured on an A100) for 3 different test matrices are given below.
| Poisson3D (512x512x512) | af_shell10.mtx | Cube_Coup_dt0.mtx | |
|---|---|---|---|
| cuSPARSE (option 1) | 13.3 ms | 0.53 ms | 1.21 ms |
| cuSPARSE (option 2) | 17.8 ms | 0.64 ms | 1.35 ms |
It's relatively straightforward to write a naive SpMV implementation:
xxxxxxxxxx__global__ void SpMV_naive(int n, template <typename Value, int THREADS_PER_ROW>__global__ void SpMV_naive(int n, const int* row_offsets, const int* col_indices, const Value* values, const Value* x, Value* y) {
const int lane = threadIdx.x; const int row = blockIdx.x * blockDim.y + threadIdx.y; if (row >= n) return;
Value sum = Value(0);
int start = row_offsets[row] + lane; int end = row_offsets[row + 1]; for (int entry = start; entry < end; entry += THREADS_PER_ROW) { sum += values[entry] * x[col_indices[entry]]; }
const unsigned int mask = rowThreadMask<THREADS_PER_ROW>(); for (int offset = THREADS_PER_ROW / 2; offset > 0; offset /= 2) { sum += __shfl_down_sync(mask, sum, offset, THREADS_PER_ROW); }
if (lane == 0) { y[row] = sum; }
}This implementation doesn't address the symmetric storage problem yet, but I include it here to serve as a starting point that we can modify to support symmetric storage. As expected, this implementation is a little bit slower than cuSPARSE (option 1).
| Poisson3D (512x512x512) | af_shell10.mtx | Cube_Coup_dt0.mtx | |
|---|---|---|---|
| cuSPARSE (option 1) | 13.3 ms | 0.53 ms | 1.21 ms |
| cuSPARSE (option 2) | 17.8 ms | 0.64 ms | 1.35 ms |
| naive SpMV | 14.5 ms | 0.58 ms | 1.36 ms |
If we intentionally avoid storing the elements below the diagonal, we still need to account for them in the SpMV if we're going to get the right answer. The kernel assumes that the CSR matrix passed in stores only the upper triangle of a symmetric matrix and as a result, its implementation differs from the previous one in the following ways:
When matrix coefficient atomicAdd to make that contribution while avoiding a data race.
Since the contributions to a row can come from other CTAs now, we need to make the local contributions to a row use an atomicAdd as well.
xxxxxxxxxxtemplate <typename Value, int THREADS_PER_ROW>__global__ void SpMV_sym(int n, const int* row_offsets, const int* col_indices, const Value* values, const Value* x, Value* y) {
const int lane = threadIdx.x; const int row = blockIdx.x * blockDim.y + threadIdx.y; if (row >= n) return;
const Value x_row = x[row]; Value row_sum = Value(0);
int start = row_offsets[row] + lane; int end = row_offsets[row + 1]; for (int entry = start; entry < end; entry += THREADS_PER_ROW) { const int col = col_indices[entry]; const Value value = values[entry]; row_sum += value * x[col];
// this is the main difference from the previous kernel if (col != row) atomicAdd(&y[col], value * x_row); }
const unsigned int mask = rowThreadMask<THREADS_PER_ROW>(); for (int offset = THREADS_PER_ROW / 2; offset > 0; offset /= 2) { row_sum += __shfl_down_sync(mask, row_sum, offset, THREADS_PER_ROW); }
if (lane == 0) // this is an atomicAdd rather than an assignment now atomicAdd(&y[row], row_sum); }
}
When we profile this kernel, we see that this implementation is actually faster than cuSPARSE on the test problems. This is almost entirely due to the fact that SpMV is a memory-bound kernel. By only reading matrix entries once rather than twice, we significantly reduce the total memory traffic, which improves the runtime.
| Poisson3D (512x512x512) | af_shell10.mtx | Cube_Coup_dt0.mtx | |
|---|---|---|---|
| cuSPARSE (option 1) | 13.3 ms | 0.53 ms | 1.21 ms |
| cuSPARSE (option 2) | 17.8 ms | 0.64 ms | 1.35 ms |
| naive SpMV | 14.5 ms | 0.58 ms | 1.36 ms |
| symmetric SpMV | 8.47 ms | 0.48 ms | 0.97 ms |
However, this implementation uses floating-point atomics, so the order of summations is nondeterministic, which means that the output vector is also not reproducible, like cuSPARSE (option 2).
The outputs from the previous symmetric SpMV implementation vary run to run. Using the af_shell10 matrix as an example, if we run SpMV_sym multiple times on the same inputs and compute the relative error with the outputs computed using a long double precision, we observe
xprecision implementation run ||y - y_ref||_2 / ||y_ref||_2fp64 SpMV_sym 1 1.5588051464354728e-16fp64 SpMV_sym 2 1.5538645830555818e-16fp64 SpMV_sym 3 1.5551923705916996e-16fp64 SpMV_sym 4 1.5557633776700968e-16fp64 SpMV_sym 5 1.5597018499116526e-16fp64 SpMV_sym 6 1.5505143598172276e-16fp64 SpMV_sym 7 1.5612006875822011e-16fp64 SpMV_sym 8 1.5585246871798236e-16fp64 SpMV_sym 9 1.5582669190842955e-16fp64 SpMV_sym 10 1.5611788107187223e-16fp64 SpMV_sym 11 1.5596608630125432e-16fp64 SpMV_sym 12 1.5534243295451767e-16fp64 SpMV_sym 13 1.5620270794584186e-16fp64 SpMV_sym 14 1.5552747836533278e-16fp64 SpMV_sym 15 1.5554549532073915e-16fp64 SpMV_sym 16 1.5598397398829132e-16
The errors are small for this specific matrix, but the answers are different every single time. Nondeterministic results like this can make replicating bugs challenging or hurt convergence of iterative methods.
Although guaranteeing total reproducibility for symmetric storage SpMV is challenging, one step in that direction is to use compensated arithmetic to improve the accuracy of our implementation. The idea is that: if we can make the outputs accurate to more digits than the underlying floating-point representation allows, the resulting output will be effectively deterministic.
The idea behind compensated arithmetic is to represent numbers by a pair of floating-point numbers, rather than an individual number. One of the numbers in the pair represents the "actual" value, and the other number tries to account for small errors incurred by floating-point arithmetic. After calculations have finished, the value and correction terms are combined to produce a more accurate estimate of the "true" value.
If this interests you, I recommend checking out https://github.com/jhueckelheim/force and Hueckelheim's related paper for more details.
With this idea in mind, we can apply compensated summation to the atomics in our kernel to try to mitigate the output variation that comes from adding contributions in different orders:
x
template <typename Value>__device__ Value additionError(Value a, Value b, Value sum) { const Value corr = sum - a; return ((sum - corr) - a) + (corr - b);}
template <typename Value>__device__ void atomicAddCompensated(Value* y, Value* y_corr, int index, Value incr) { const Value old = atomicAdd(&y[index], incr); const Value sum = old + incr; atomicAdd(&y_corr[index], additionError(old, incr, sum));}
template <typename Value, int THREADS_PER_ROW>__global__ void SpMV_sym_compensated(int n, const int* row_offsets, const int* col_indices, const Value* values, const Value* x, Value* y, Value* y_corr) {
// ...
int start = row_offsets[row] + lane; int end = row_offsets[row + 1]; for (int entry = start; entry < end; entry += THREADS_PER_ROW) { // ...
if (col != row) atomicAddCompensated(y, y_corr, col, value * x_row); }
// ...
if (lane == 0) atomicAddCompensated(y, y_corr, row, row_sum); }
}
If we run this version multiple times we see that the outputs are actually the same between runs.
xprecision implementation run ||y - y_ref||_2 / ||y_ref||_2fp64 SpMV_sym_compensated 1 9.5777771970717702e-17fp64 SpMV_sym_compensated 2 9.5777771970717702e-17fp64 SpMV_sym_compensated 3 9.5777771970717702e-17fp64 SpMV_sym_compensated 4 9.5777771970717702e-17fp64 SpMV_sym_compensated 5 9.5777771970717702e-17fp64 SpMV_sym_compensated 6 9.5777771970717702e-17fp64 SpMV_sym_compensated 7 9.5777771970717702e-17fp64 SpMV_sym_compensated 8 9.5777771970717702e-17fp64 SpMV_sym_compensated 9 9.5777771970717702e-17fp64 SpMV_sym_compensated 10 9.5777771970717702e-17fp64 SpMV_sym_compensated 11 9.5777771970717702e-17fp64 SpMV_sym_compensated 12 9.5777771970717702e-17fp64 SpMV_sym_compensated 13 9.5777771970717702e-17fp64 SpMV_sym_compensated 14 9.5777771970717702e-17fp64 SpMV_sym_compensated 15 9.5777771970717702e-17fp64 SpMV_sym_compensated 16 9.5777771970717702e-17
It's important to note that this reproducibility is not a guarantee-- there exist inputs to this implementation that generate errors too large to be accounted for by compensated arithmetic. However, it is also true that there are certain matrix and vector inputs that will actually be guaranteed to be deterministic, but it is probably best to not assume that is true in general.
In terms of performance, the compensated version of the kernel is slower than the uncompensated version, due to the extra atomicAdds and also a second kernel to combine y and y_corr to produce the final output.
| Poisson3D (512x512x512) | af_shell10.mtx | Cube_Coup_dt0.mtx | |
|---|---|---|---|
| cuSPARSE (option 1) | 13.3 ms | 0.53 ms | 1.21 ms |
| cuSPARSE (option 2) | 17.8 ms | 0.64 ms | 1.35 ms |
| naive SpMV | 14.5 ms | 0.58 ms | 1.36 ms |
| symmetric SpMV | 8.47 ms | 0.48 ms | 0.97 ms |
| symmetric SpMV (comp.) | 13.6 ms | 0.85 ms | 1.60 ms |
We walked through the process of implementing several SpMV kernels that can take advantage of symmetric (or with a minor modification, skew-symmetric) storage. We showed that with only minor modifications, it was possible to support symmetric storage, and that the resulting kernel performs well, sometimes even surpassing the existing optimized library implementations.
We also showed how compensated arithmetic can be used to improve the accuracy of outputs, which can help mitigate some of the effects of nondeterministic order of summation that frequently arises with parallel processors.
Complete sources for the code used in this study are available here. The examples include the ability to import Matrix Market files, so give it a shot on your own datasets!