The Terraform plugin cache does not support↗ concurrent terraform init
runs.
This is a massive inefficiency for Terraform users past a certain scale, since all we can do is disable caching or serialize all inits, both of which are suboptimal past a certain number of providers or concurrent plan/apply operations, respectively.
From what I could find in Terraform’s circles, this is a known problem↗ and there’s intent of fixing it↗, but it’s complicated↗. And since it seems like we’re a long way from a native solution, we had to get creative.
Introducing OverlayFS
OverlayFS↗ is a Linux filesystem which combines the contents of multiple read-only directories with a writable layer on top, into a single volume.
Kubernetes and Docker use it, CoreOS used it, live Linux distributions use it, etc.
But what brings such a low-level tool like OverlayFS this high up the stack?
We’re going to use OverlayFS to give each of our concurrent Terraform inits the illusion that they’re sharing the same Terraform plugin cache, but redirect writes to their respective writable local layers on top, which we can then sync back if necessary.
Setup
The following steps need to be performed for each terraform init
.
1. Before Terraform init
First, we’ll need the following directories:
mkdir --parents /path/to/cache .terraform/cache/{upperdir,workdir,mountdir}
Now, let’s mount our overlay:
mount -t overlay overlay -o 'lowerdir=/path/to/cache,upperdir=.terraform/cache/upperdir,workdir=.terraform/cache/workdir' .terraform/cache/mountdir
This creates an OverlayFS using:
-
/path/to/cache
as the central cache (lowerdir
); -
.terraform/cache/upperdir
as the local cache (upperdir
); and -
mounts the union of both at
.terraform/cache/mountdir
.
workdir
is internal to OverlayFS and we shall not touch it.
2. During Terraform init
Now, let’s enable Terraform’s plugin cache:
export TF_PLUGIN_CACHE_DIR=".terraform/cache/mountdir"
If you don’t use .terraform.lock.hcl
files, you also have to set:
export TF_PLUGIN_CACHE_MAY_BREAK_DEPENDENCY_LOCK_FILE="true"
These only need to be set for terraform init
, all subsequent Terraform commands will go off what they find in the .terraform
directory.
3. After Terraform init
Once terraform init
is completed, we’ll need to sync back any new providers that might’ve been downloaded into the local cache, back to the central cache:
if find .terraform/cache/upperdir -type f -print -quit | grep -q .; then
flock --timeout 60 /path/to/cache \
rsync --archive --ignore-existing .terraform/cache/upperdir/ /path/to/cache/
fi
Here’s a breakdown of what this does:
-
rsync
syncs whatever plugins in the local cache are missing from the central cache; -
we then wrap that with
flock
to serialize writes to the central cache, waiting for the lock no more than it is worth it (60 seconds) for the providers we’re feeding back; -
and finally, we only do this if there’s anything at all we can
find
in our local cache.
It’s okay to write to the central cache while other Terraform inits are running, changes to lowerdir
are not reflected in already-mounted overlay filesystems.
You can tweak the flock
timeout setting to your own particular threshold, you make that trade-off, 60 seconds is just the default I chose.
Note: if you’re really going to try this, email me, the code here optimizes for illustrating the mechanism, rather than performance. There are some better ways to do it.
4. After Terraform plan/apply
Once you’re done with Terraform, make sure to clean up:
umount .terraform/cache/mountdir
This is to prevent “device or resource busy” errors when deleting .terraform
.
You should do this in a trap↗, this is code is to illustrate the mechanism.
Results
This implements a sort of write-back↗ Terraform plugin cache using OverlayFS:
-
providers are downloaded only if they are missing from the cache;
-
concurrent Terraform inits write to their own local cache, fixing the original problem;
-
and then we sync those back (if any) to the central cache for future inits to reuse.
If two concurrent inits are missing the same provider version, we will download it twice, but since we serialize rsync
it will only be written back once to the central cache.