Hi there,
We are currently deploying multiple dragonfly clusters which are configured to snapshot to S3. We are seeing a very odd problem with one of those clusters.
We have created a bucket - let’s call it df-snapshots and in each of the dragonfly instances, we define the snapshots as:
snapshot:
cron: '*/5 * * * *'
dir: s3://df-snapshots/clusterX
where X is a number from 1 to 22.
You will notice that there is no trailing slash on the dir config and we found that our cluster1 cluster was failing to start up - it would log that it was searching for snapshot and then would eventually timeout and the readiness probes would kill it.
Without the slash in place, we realised that this was trying to read from cluster1* (which, at the time had a lot of snapshots in them). We resolved this by adding the slash to the dir and it looked like the problem was resolved - we restarted the pods and they came up instantly.
However, we later found that the same issue was happening again - pods failing while searching for snapshot on startup.
We can replicate this by deleting the cluster1 folder from S3, this will allow the pod to start with the log line:
W20260128 11:34:07.311956 1 server_family.cc:945] Load snapshot: No snapshot found
but as soon as a snapshot is created in that location and we roll the pod, we get the same hanging behaviour.
We can edit the DF config to point to a completely different folder - e.g. cluster123 and it appears to work just fine.
While trying to figure this out, I temporarily upgraded to 1.36.0 (we’re currently on 1.31.0) and saw an equally perplexing issue. In this case, pointing to the cluster1 folder works just fine - I can see snapshots being created and the pods can restart. However, the pods always indicate that
W20260128 11:34:07.311956 1 server_family.cc:945] Load snapshot: No snapshot found
What is going on here?