Unusual snapshot behaviour

Hi there,

We are currently deploying multiple dragonfly clusters which are configured to snapshot to S3. We are seeing a very odd problem with one of those clusters.

We have created a bucket - let’s call it df-snapshots and in each of the dragonfly instances, we define the snapshots as:

  snapshot:
    cron: '*/5 * * * *'
    dir: s3://df-snapshots/clusterX

where X is a number from 1 to 22.
You will notice that there is no trailing slash on the dir config and we found that our cluster1 cluster was failing to start up - it would log that it was searching for snapshot and then would eventually timeout and the readiness probes would kill it.
Without the slash in place, we realised that this was trying to read from cluster1* (which, at the time had a lot of snapshots in them). We resolved this by adding the slash to the dir and it looked like the problem was resolved - we restarted the pods and they came up instantly.
However, we later found that the same issue was happening again - pods failing while searching for snapshot on startup.
We can replicate this by deleting the cluster1 folder from S3, this will allow the pod to start with the log line:
W20260128 11:34:07.311956 1 server_family.cc:945] Load snapshot: No snapshot found
but as soon as a snapshot is created in that location and we roll the pod, we get the same hanging behaviour.
We can edit the DF config to point to a completely different folder - e.g. cluster123 and it appears to work just fine.

While trying to figure this out, I temporarily upgraded to 1.36.0 (we’re currently on 1.31.0) and saw an equally perplexing issue. In this case, pointing to the cluster1 folder works just fine - I can see snapshots being created and the pods can restart. However, the pods always indicate that
W20260128 11:34:07.311956 1 server_family.cc:945] Load snapshot: No snapshot found

What is going on here?

Hi @adochan,

Thanks for sharing the issue with us. I suppose you are running Dragonfly using the K8s operator? Do you mind sharing more information, and I will ask the engineering team to check.

  • Dragonfly version: 1.31.0 or 1.36.0
  • Dragonfly K8s operator version: (i.e., 1.4.0)
  • OS: (i.e., Ubuntu 20.04)
  • Kernel: (i.e., using the command uname -a)
  • Containerized?: Kubernetes

Here you go:

Dragonfly:
image: docker.dragonflydb.io/dragonflydb/dragonfly:v1.31.0

Operator:
image: docker.dragonflydb.io/dragonflydb/operator:v1.2.1

OS:
Ubuntu 22.04.5 LTS

Kernel:
Linux vector-state-gen1-0 6.12.63-84.121.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Dec 31 02:07:30 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Yeah this is running on k8s - EKS specifically.

Hi @joezhou_df,

Did you look into this at all? I just returned to it and retested after significantly reducing the volume of data in our s3 bucket and now it appears to be functioning better - it’s now identifying the snapshots in the bucket in around 30 seconds and starting successfully. It seems like maybe there is some bad S3 path traversal happening in there somewhere.

Thanks