One of the libraries we use at work has some large mesh data files that take up about 200 MB in its git repo. These files are used by the library in some of its performance tests, but unless you're running those benchmarks or actively developing in that library, those files aren't really needed. However, when cloning the repo, there's no good way to opt-out of downloading those files, so you end up downloading those large files regardless of whether or not you need them.
CMake has a module, ExternalData
, that can help address this problem by associating data files with certain targets, and only downloading those files when those targets are built. This makes downloading the large files opt-in, so the costs are only incurred when those files are needed.
CMake's documentation on ExternalData
is good, but there aren't very many example projects of this module in use, so this post will walk through an example of how to use this feature.
Imagine we have a C++ project and one of its tests depends on some data file:
xxxxxxxxxx
// test.cpp
int main(int argc, char ** argv) {
std::cout << "checking if file " << argv[1] << " has the appropriate size ... ";
auto size = std::filesystem::file_size(argv[1]);
if (size != 64000) {
std::cout << "test failed" << std::endl;
} else {
std::cout << "test passed" << std::endl;
}
}
We could register a test with CMake that runs this executable with a given data file:
xxxxxxxxxx
cmake_minimum_required(VERSION 3.18)
project(external_data_example_project LANGUAGES C CXX)
set(CMAKE_CXX_STANDARD 17) # for std::filesystem
...
if (BUILD_TESTS)
enable_testing()
add_executable(check_file_size test.cpp)
add_test(NAME check_file_size_is_64KB COMMAND check_file_size data_file.bin)
endif()
If we did that, then after building we could run ctest
and it would run the command we specified:
xxxxxxxxxx
% ctest --verbose
...
test 1
Start 1: check_file_size_is_64KB
1: Test command: /path/to/check_file_size "data_file.bin"
1: Test timeout computed to be: 10000000
1: checking if file data_file.bin has the appropriate size ... test passed
1/1 Test #1: check_file_size_is_64KB .......... Passed 0.00 sec
So far so good, except this test assumes that data_file.bin
is already available. Ideally, the user wouldn't have to download this file if BUILD_TESTS
was off, and ExternalData
lets us do that by making a few small changes. First, we need to hash the data file with one of the supported algorithms. Using MD5
in this example, we can execute
xxxxxxxxxx
% md5 -q data_file.bin &> data_file.bin.md5 | tee data_file.bin.md5
8b8fcc5fd365c2001a49f4e4e64284e2
to hash the data file, and write that value to the file data_file.bin.md5
. This will be the file that we keep in our repo, and it serves as a placeholder for the actual data. Next, rename the actual data file with the hash value that was calculated above, and put it somewhere (e.g. on the local filesystem, or in a separate repo on github) in a directory named MD5
xxxxxxxxxx
% mv data_file.bin /path/to/somewhere/MD5/8b8fcc5fd365c2001a49f4e4e64284e2
For this example, I put this data file in a github repo, here.
Then, we tell CMake how to locate the actual data files from their hashes, by using ExternalData
:
xxxxxxxxxx
# use the ExternalData module
include(ExternalData)
# tell CMake where to find the actual data files for this project
set(ExternalData_URL_TEMPLATES "https://raw.githubusercontent.com/samuelpmish/ExternalData/main/%(algo)/%(hash)")
and finally, we use ExternalData_Add_Test
instead of add_test
to register the test. Note how the data file name is wrapped by DATA{...}
xxxxxxxxxx
# add_test(
# NAME check_file_size_is_64KB
# COMMAND check_file_size data_file.bin)
# call ExternalData_Add_Test instead of add_test
# and wrap to the data file names in DATA{...}
ExternalData_Add_Test(MY_DATA
NAME check_file_size_is_64KB
COMMAND check_file_size DATA{data_file.bin})
ExternalData_Add_Target(MY_DATA)
Now, if we configure CMake with -DBUILD_TESTS=TRUE
and build, we will see CMake fetch the data files associated with the targets that are being built.
xxxxxxxxxx
% make
...
-- Fetching "https://raw.githubusercontent.com/samuelpmish/ExternalData/main/MD5/8b8fcc5fd365c2001a49f4e4e64284e2"
-- [download 10% complete]
-- [download 26% complete]
-- [download 31% complete]
-- [download 44% complete]
-- [download 48% complete]
-- [download 57% complete]
-- [download 87% complete]
-- [download 91% complete]
-- [download 97% complete]
-- [download 100% complete]
-- Downloaded object: "/path/to/cmake_external_data_example/build/ExternalData/Objects/MD5/8b8fcc5fd365c2001a49f4e4e64284e2"
[100%] Built target MY_DATA
If we had -DBUILD_TESTS=FALSE
then these downloads would have been skipped, because the targets that required the data files were not being built. Just like before, we can run ctest
, and verify that everything is working
xxxxxxxxxx
% ctest --verbose
...
test 1
Start 1: check_file_size_is_64KB
1: Test command: /path/to/cmake_external_data_example/build/check_file_size "/path/to/cmake_external_data_example/build/data_file.bin"
1: Test timeout computed to be: 10000000
1: checking if file /path/to/cmake_external_data_example/build/data_file.bin has the appropriate size ... test passed
1/1 Test #1: check_file_size_is_64KB .......... Passed 0.19 sec
100% tests passed, 0 tests failed out of 1
The complete example repo can be found on github here.
All in all, this feature of CMake does provide a way to manage these large files, but it seems like a considerable amount of work to set up and use. In practice, I'd rather just put the data files in a separate repo that is optionally included by something like ExternalProject
or as a git submodule.