One of the libraries we use at work has some large mesh data files that take up about 200 MB in its git repo. These files are used by the library in some of its performance tests, but unless you're running those benchmarks or actively developing in that library, those files aren't really needed. However, when cloning the repo, there's no good way to opt-out of downloading those files, so you end up downloading those large files regardless of whether or not you need them.
CMake has a module, ExternalData, that can help address this problem by associating data files with certain targets, and only downloading those files when those targets are built. This makes downloading the large files opt-in, so the costs are only incurred when those files are needed.
CMake's documentation on ExternalData is good, but there aren't very many example projects of this module in use, so this post will walk through an example of how to use this feature.
Imagine we have a C++ project and one of its tests depends on some data file:
xxxxxxxxxx// test.cpp
int main(int argc, char ** argv) { std::cout << "checking if file " << argv[1] << " has the appropriate size ... "; auto size = std::filesystem::file_size(argv[1]); if (size != 64000) { std::cout << "test failed" << std::endl; } else { std::cout << "test passed" << std::endl; }}We could register a test with CMake that runs this executable with a given data file:
xxxxxxxxxxcmake_minimum_required(VERSION 3.18) project(external_data_example_project LANGUAGES C CXX) set(CMAKE_CXX_STANDARD 17) # for std::filesystem ... if (BUILD_TESTS) enable_testing() add_executable(check_file_size test.cpp) add_test(NAME check_file_size_is_64KB COMMAND check_file_size data_file.bin) endif() If we did that, then after building we could run ctest and it would run the command we specified:
xxxxxxxxxx% ctest --verbose...test 1Start 1: check_file_size_is_64KB1: Test command: /path/to/check_file_size "data_file.bin"1: Test timeout computed to be: 100000001: checking if file data_file.bin has the appropriate size ... test passed1/1 Test #1: check_file_size_is_64KB .......... Passed 0.00 sec
So far so good, except this test assumes that data_file.bin is already available. Ideally, the user wouldn't have to download this file if BUILD_TESTS was off, and ExternalData lets us do that by making a few small changes. First, we need to hash the data file with one of the supported algorithms. Using MD5 in this example, we can execute
xxxxxxxxxx% md5 -q data_file.bin &> data_file.bin.md5 | tee data_file.bin.md58b8fcc5fd365c2001a49f4e4e64284e2
to hash the data file, and write that value to the file data_file.bin.md5. This will be the file that we keep in our repo, and it serves as a placeholder for the actual data. Next, rename the actual data file with the hash value that was calculated above, and put it somewhere (e.g. on the local filesystem, or in a separate repo on github) in a directory named MD5
xxxxxxxxxx% mv data_file.bin /path/to/somewhere/MD5/8b8fcc5fd365c2001a49f4e4e64284e2
For this example, I put this data file in a github repo, here.
Then, we tell CMake how to locate the actual data files from their hashes, by using ExternalData:
xxxxxxxxxx # use the ExternalData module include(ExternalData)
# tell CMake where to find the actual data files for this project set(ExternalData_URL_TEMPLATES "https://raw.githubusercontent.com/samuelpmish/ExternalData/main/%(algo)/%(hash)")and finally, we use ExternalData_Add_Test instead of add_test to register the test. Note how the data file name is wrapped by DATA{...}
xxxxxxxxxx# add_test(# NAME check_file_size_is_64KB # COMMAND check_file_size data_file.bin)
# call ExternalData_Add_Test instead of add_test# and wrap to the data file names in DATA{...}ExternalData_Add_Test(MY_DATA NAME check_file_size_is_64KB COMMAND check_file_size DATA{data_file.bin}) ExternalData_Add_Target(MY_DATA) Now, if we configure CMake with -DBUILD_TESTS=TRUE and build, we will see CMake fetch the data files associated with the targets that are being built.
xxxxxxxxxx% make...-- Fetching "https://raw.githubusercontent.com/samuelpmish/ExternalData/main/MD5/8b8fcc5fd365c2001a49f4e4e64284e2"-- [download 10% complete]-- [download 26% complete]-- [download 31% complete]-- [download 44% complete]-- [download 48% complete]-- [download 57% complete]-- [download 87% complete]-- [download 91% complete]-- [download 97% complete]-- [download 100% complete]-- Downloaded object: "/path/to/cmake_external_data_example/build/ExternalData/Objects/MD5/8b8fcc5fd365c2001a49f4e4e64284e2"[100%] Built target MY_DATA
If we had -DBUILD_TESTS=FALSE then these downloads would have been skipped, because the targets that required the data files were not being built. Just like before, we can run ctest, and verify that everything is working
xxxxxxxxxx% ctest --verbose...test 1Start 1: check_file_size_is_64KB1: Test command: /path/to/cmake_external_data_example/build/check_file_size "/path/to/cmake_external_data_example/build/data_file.bin"1: Test timeout computed to be: 100000001: checking if file /path/to/cmake_external_data_example/build/data_file.bin has the appropriate size ... test passed1/1 Test #1: check_file_size_is_64KB .......... Passed 0.19 sec100% tests passed, 0 tests failed out of 1
The complete example repo can be found on github here.
All in all, this feature of CMake does provide a way to manage these large files, but it seems like a considerable amount of work to set up and use. In practice, I'd rather just put the data files in a separate repo that is optionally included by something like ExternalProject or as a git submodule.