Templated FEM kernels: Benefits and Drawbacks

Many of the low-level FEM kernels in mfem have the following form:

That is, they are function templates where the polynomial order (T_D1D) and quadrature rule (T_Q1D) are template parameters. This has the benefit of making it easy to statically allocate data structures in shared memory (note: MFEM_FORALL_3D is a macro that contains a kernel launch)

The downside is that making this kernel depend on two template parameters means there's potentially about a hundred possible combinations of T_D1D (polynomial order) and T_Q1D (quadrature points per dimension) that a user could request. Covering each of those ~100 possible specializations would mean compiling and optimizing each one independently, having some sort of elaborate case/switch statement to dispatch the appropriate specialization, and making the binary considerably larger.

The approach described above of covering all possible combinations is clearly not practical, and mfem opts for a compromise instead:

The difference in performance between mfem's templated kernels and the dynamic fallback option is significant, especially for low order:


So, although this compromise is preferable to compiling each kernel a hundred times, it still compiles each kernel ~10 times, and admits the possibility of considerable loss of performance if the user requests a kernel that is not part of the curated set.


The question is: why not just implement the kernel where polynomial order and quadrature rules are specified dynamically, but without overallocating shared memory (the approach used by mfem's fallback kernel)?


An Alternative Implementation

The main part of the current mfem kernel implementation that necessitates compile-time knowledge of polynomial order and quadrature rule is the static allocation of shared memory:

So, what if we just allocated the shared memory dynamically?

where ndview is a non-owning, GPU-compatible multidimensional array class (since C++ notoriously lacks such a container in the standard library). Its full implementation is only 30 lines of code, and is given at the end of the document.

Now that there is no requirement to have the kernel take template parameters, one kernel can cover all of the possible combinations of polynomial order and quadrature rule.

By making this small modification, the performance of the new kernel is close to parity with the templated mfem kernels:


Furthermore, making a few other small modifications (most notably, the possibility of processing multiple elements per thread block) can make the alternative dynamic kernel implementation competitive or better than the templated implementation in mfem:



Note: all performance numbers are taken from tests on a NVIDIA GV100 GPU.




Templates are a powerful tool. They enable developers to make certain information available at compile time, which can reveal additional optimizations that improve performance.

However, that doesn't mean that templates are a requirement for performant code.

In this document, we looked at a real-world example where template parameters were assumed to be necessary for getting the best kernel performance, but that wasn't the case. Refactoring the kernel to use fewer templates significantly reduced binary size and compile time, while retaining performance parity.

The takeaway here isn't to never use templates, but to pick the right tool for the job. Like Chandler Carruth emphasizes in this wonderful talk: given the choice between two equivalent implementations, prefer the simpler abstraction!



ndview Implementation




Portability-Layer Wrapper


One naive way to include dynamic shared memory allocation in a "forall" framework (e.g. RAJA, Kokkos, etc)