symmetric

Problem Statement

Large, sparse matrices appear frequently in HPC applications. A significant fraction of the overall memory footprint is often spent storing such matrices. Fortunately, many of these matrices also have symmetries which give us a way to avoid actually storing every single entry.

$A_{ij} = A_{ji}$ . So, one common convention is to only store either the upper or lower triangular part of the matrix, which reduces the memory usage by (about) a factor of 2. However, omitting these entries tends to make certain operations more challenging to implement.

In this document, we'll look at a couple of ways to implement sparse matrix vector multiplication (sparse matvec or SpMV) for symmetric or skew-symmetric matrices. For reference, we will adopt the following notation in the remainder of the document:

$\bold{A}$ $n$ rows and columns
$\bold{U}$ $\bold{A}$
$\bold{D}$ $\bold{A}$
$\bold{x}, \bold{y}$ $\mathbb{R}^n$

Here are some basic stats about the matrices used to test the following SpMV implementations.

	Poisson3D (512x512x512)	af_shell10.mtx	Cube_Coup_dt0.mtx
$n$	134217728	1508065	2164760
$\bold{A}$ )	937951232	52259885	124406070

cuSPARSE

$\bold{B} \coloneqq \bold{D} + \bold{U}$ $\bold{A}$ , we have two main options:

$\bold{A} \coloneqq \bold{B} + \bold{B}^\top - \bold{D}$ $\bold{y} \coloneqq \bold{A} \; \bold{x}$
$\bold{\hat{B}} \coloneqq \bold{B} - \frac{1}{2} \bold{D}$ $\bold{y}_1 \coloneqq \bold{\hat{B}} \; \bold{x}$ $\bold{y}_2 \coloneqq \bold{\hat{B}}^\top \bold{x}$ and add them together

Option 1 has two main downsides: it defeats the point of only storing half the matrix in the first place and roughly half of the memory accesses (associated with loading the matrix coefficients) are redundant.

Option 2 lets us reduce the memory footprint by only storing a triangular matrix, but it still has a performance issue since the matrix coefficients are accessed from global memory twice (once by each kernel). The answer from option 2 is also not deterministic, since cuSPARSE's implementation of the transpose action of a CSR matrix does not guarantee reproducible results.

The times of these two approaches (as measured on an A100) for 3 different test matrices are given below.

	Poisson3D (512x512x512)	af_shell10.mtx	Cube_Coup_dt0.mtx
cuSPARSE (option 1)	13.3 ms	0.53 ms	1.21 ms
cuSPARSE (option 2)	17.8 ms	0.64 ms	1.35 ms

Custom Kernels

Naive SpMV

It's relatively straightforward to write a naive SpMV implementation:


xxxxxxxxxx
__global__ void SpMV_naive(int n, 
template <typename Value, int THREADS_PER_ROW>
__global__ void SpMV_naive(int n, 
                           const int* row_offsets,
                           const int* col_indices, 
                           const Value* values,
                           const Value* x, 
                           Value* y) {

  const int lane = threadIdx.x;
  const int row = blockIdx.x * blockDim.y + threadIdx.y;
  if (row >= n) return;

  Value sum = Value(0);

  int start = row_offsets[row] + lane;
  int end = row_offsets[row + 1];
  for (int entry = start; entry < end; entry += THREADS_PER_ROW) {
    sum += values[entry] * x[col_indices[entry]];
  }

  const unsigned int mask = rowThreadMask<THREADS_PER_ROW>();
  for (int offset = THREADS_PER_ROW / 2; offset > 0; offset /= 2) {
    sum += __shfl_down_sync(mask, sum, offset, THREADS_PER_ROW);
  }

  if (lane == 0) {
    y[row] = sum;
  }

}

This implementation doesn't address the symmetric storage problem yet, but I include it here to serve as a starting point that we can modify to support symmetric storage. As expected, this implementation is a little bit slower than cuSPARSE (option 1).

	Poisson3D (512x512x512)	af_shell10.mtx	Cube_Coup_dt0.mtx
cuSPARSE (option 1)	13.3 ms	0.53 ms	1.21 ms
cuSPARSE (option 2)	17.8 ms	0.64 ms	1.35 ms
naive SpMV	14.5 ms	0.58 ms	1.36 ms

Symmetric SpMV

If we intentionally avoid storing the elements below the diagonal, we still need to account for them in the SpMV if we're going to get the right answer. The kernel assumes that the CSR matrix passed in stores only the upper triangle of a symmetric matrix and as a result, its implementation differs from the previous one in the following ways:

$A_{ij}$ $i$ $j$ $A_{ji}\; x_i$ . So, we include an additional atomicAdd to make that contribution while avoiding a data race.
Since the contributions to a row can come from other CTAs now, we need to make the local contributions to a row use an atomicAdd as well.


xxxxxxxxxx
template <typename Value, int THREADS_PER_ROW>
__global__ void SpMV_sym(int n, 
                         const int* row_offsets,
                         const int* col_indices,
                         const Value* values, 
                         const Value* x,
                         Value* y) {

  const int lane = threadIdx.x;
  const int row = blockIdx.x * blockDim.y + threadIdx.y;
  if (row >= n) return;

  const Value x_row = x[row];
  Value row_sum = Value(0);

  int start = row_offsets[row] + lane;
  int end = row_offsets[row + 1];
  for (int entry = start; entry < end; entry += THREADS_PER_ROW) {
    
    const int col = col_indices[entry];
    const Value value = values[entry];
    row_sum += value * x[col];

    // this is the main difference from the previous kernel
    if (col != row) atomicAdd(&y[col], value * x_row);
    
  }

  const unsigned int mask = rowThreadMask<THREADS_PER_ROW>();
  for (int offset = THREADS_PER_ROW / 2; offset > 0; offset /= 2) {
    row_sum += __shfl_down_sync(mask, row_sum, offset, THREADS_PER_ROW);
  }

  if (lane == 0) 
    // this is an atomicAdd rather than an assignment now
    atomicAdd(&y[row], row_sum);
  }

}

When we profile this kernel, we see that this implementation is actually faster than cuSPARSE on the test problems. This is almost entirely due to the fact that SpMV is a memory-bound kernel. By only reading matrix entries once rather than twice, we significantly reduce the total memory traffic, which improves the runtime.

	Poisson3D (512x512x512)	af_shell10.mtx	Cube_Coup_dt0.mtx
cuSPARSE (option 1)	13.3 ms	0.53 ms	1.21 ms
cuSPARSE (option 2)	17.8 ms	0.64 ms	1.35 ms
naive SpMV	14.5 ms	0.58 ms	1.36 ms
symmetric SpMV	8.47 ms	0.48 ms	0.97 ms

However, this implementation uses floating-point atomics, so the order of summations is nondeterministic, which means that the output vector is also not reproducible, like cuSPARSE (option 2).

Compensated Symmetric SpMV

The outputs from the previous symmetric SpMV implementation vary run to run. Using the af_shell10 matrix as an example, if we run SpMV_sym multiple times on the same inputs and compute the relative error with the outputs computed using a long double precision, we observe


x
precision    implementation         run      ||y - y_ref||_2 / ||y_ref||_2
fp64         SpMV_sym               1        1.5588051464354728e-16        
fp64         SpMV_sym               2        1.5538645830555818e-16        
fp64         SpMV_sym               3        1.5551923705916996e-16        
fp64         SpMV_sym               4        1.5557633776700968e-16        
fp64         SpMV_sym               5        1.5597018499116526e-16        
fp64         SpMV_sym               6        1.5505143598172276e-16        
fp64         SpMV_sym               7        1.5612006875822011e-16        
fp64         SpMV_sym               8        1.5585246871798236e-16        
fp64         SpMV_sym               9        1.5582669190842955e-16        
fp64         SpMV_sym               10       1.5611788107187223e-16        
fp64         SpMV_sym               11       1.5596608630125432e-16        
fp64         SpMV_sym               12       1.5534243295451767e-16        
fp64         SpMV_sym               13       1.5620270794584186e-16        
fp64         SpMV_sym               14       1.5552747836533278e-16        
fp64         SpMV_sym               15       1.5554549532073915e-16        
fp64         SpMV_sym               16       1.5598397398829132e-16 
fp64         SpMV_sym               13       1.5620270794584186e-16

The errors are small for this specific matrix, but the answers are different every single time. Nondeterministic results like this can make replicating bugs challenging or hurt convergence of iterative methods.

Although guaranteeing total reproducibility for symmetric storage SpMV is challenging, one step in that direction is to use compensated arithmetic to improve the accuracy of our implementation. The idea is that: if we can make the outputs accurate to more digits than the underlying floating-point representation allows, the resulting output will be effectively deterministic.

The idea behind compensated arithmetic is to represent numbers by a pair of floating-point numbers, rather than an individual number. One of the numbers in the pair represents the "actual" value, and the other number tries to account for small errors incurred by floating-point arithmetic. After calculations have finished, the value and correction terms are combined to produce a more accurate estimate of the "true" value.

If this interests you, I recommend checking out https://github.com/jhueckelheim/force and Hueckelheim's related paper for more details.

With this idea in mind, we can apply compensated summation to the atomics in our kernel to try to mitigate the output variation that comes from adding contributions in different orders:


x
template <typename Value>
__device__ Value additionError(Value a, Value b, Value sum) {
  const Value corr = sum - a;
  return ((sum - corr) - a) + (corr - b);
}

template <typename Value>
__device__ void atomicAddCompensated(Value* y, Value* y_corr, int index, Value incr) {
  const Value old = atomicAdd(&y[index], incr);
  const Value sum = old + incr;
  atomicAdd(&y_corr[index], additionError(old, incr, sum));
}

template <typename Value, int THREADS_PER_ROW>
__global__ void SpMV_sym_compensated(int n, 
                                     const int* row_offsets,
                                     const int* col_indices,
                                     const Value* values, 
                                     const Value* x,
                                     Value* y,
                                     Value* y_corr) {

  // ...

  int start = row_offsets[row] + lane;
  int end = row_offsets[row + 1];
  for (int entry = start; entry < end; entry += THREADS_PER_ROW) {
    
    // ...

    if (col != row) atomicAddCompensated(y, y_corr, col, value * x_row);
    
  }

  // ... 

  if (lane == 0) 
    atomicAddCompensated(y, y_corr, row, row_sum);
  }

}
    atomicAddCompensated(y, y_corr, row, row_sum);

If we run this version multiple times we see that the outputs are actually the same between runs.


x
precision    implementation            run      ||y - y_ref||_2 / ||y_ref||_2
fp64         SpMV_sym_compensated      1        9.5777771970717702e-17        
fp64         SpMV_sym_compensated      2        9.5777771970717702e-17        
fp64         SpMV_sym_compensated      3        9.5777771970717702e-17        
fp64         SpMV_sym_compensated      4        9.5777771970717702e-17        
fp64         SpMV_sym_compensated      5        9.5777771970717702e-17        
fp64         SpMV_sym_compensated      6        9.5777771970717702e-17        
fp64         SpMV_sym_compensated      7        9.5777771970717702e-17        
fp64         SpMV_sym_compensated      8        9.5777771970717702e-17        
fp64         SpMV_sym_compensated      9        9.5777771970717702e-17        
fp64         SpMV_sym_compensated      10       9.5777771970717702e-17        
fp64         SpMV_sym_compensated      11       9.5777771970717702e-17        
fp64         SpMV_sym_compensated      12       9.5777771970717702e-17        
fp64         SpMV_sym_compensated      13       9.5777771970717702e-17        
fp64         SpMV_sym_compensated      14       9.5777771970717702e-17        
fp64         SpMV_sym_compensated      15       9.5777771970717702e-17        
fp64         SpMV_sym_compensated      16       9.5777771970717702e-17
fp64         SpMV_sym_compensated      5        9.5777771970717702e-17

It's important to note that this reproducibility is not a guarantee-- there exist inputs to this implementation that generate errors too large to be accounted for by compensated arithmetic. However, it is also true that there are certain matrix and vector inputs that will actually be guaranteed to be deterministic, but it is probably best to not assume that is true in general.

In terms of performance, the compensated version of the kernel is slower than the uncompensated version, due to the extra atomicAdds and also a second kernel to combine y and y_corr to produce the final output.

	Poisson3D (512x512x512)	af_shell10.mtx	Cube_Coup_dt0.mtx
cuSPARSE (option 1)	13.3 ms	0.53 ms	1.21 ms
cuSPARSE (option 2)	17.8 ms	0.64 ms	1.35 ms
naive SpMV	14.5 ms	0.58 ms	1.36 ms
symmetric SpMV	8.47 ms	0.48 ms	0.97 ms
symmetric SpMV (comp.)	13.6 ms	0.85 ms	1.60 ms

Summary

We walked through the process of implementing several SpMV kernels that can take advantage of symmetric (or with a minor modification, skew-symmetric) storage. We showed that with only minor modifications, it was possible to support symmetric storage, and that the resulting kernel performs well, sometimes even surpassing the existing optimized library implementations.

We also showed how compensated arithmetic can be used to improve the accuracy of outputs, which can help mitigate some of the effects of nondeterministic order of summation that frequently arises with parallel processors.

Complete sources for the code used in this study are available here. The examples include the ability to import Matrix Market files, so give it a shot on your own datasets!