My Dragonfly instance is configured with --cache_mode
and --maxmemory=160G
, running on a GCP Compute Engine VM with 22 vCPUs and 176 GB of RAM (c3-highmem-22
, Intel Sapphire Rapids, x86_64, Debian 6.1.140-1).
I’ve noticed that while my .dfs
snapshots are around 50 GB, the restore process works smoothly. However, once the snapshot size grows above ~150 GB, Dragonfly detects the file, starts loading, and then immediately gets stuck, typically before even reaching 1 GB of memory usage.
Initially, I suspected snapshot corruption and rebuilt the cache from scratch. But after using the SAVE
command and restarting Dragonfly, the issue reappeared with the newly generated snapshot.
I’ve tried several approaches to work around this:
- Reduced proactor threads to 6, then to 4
- Changed disk type
- Recreated the VM from scratch
- Tuned various runtime parameters
The only configurations that successfully restore large snapshots (~150 GB+) are:
- Setting
--proactor_threads=1
- Or enabling
--force_epoll=true
Currently, I’m using --force_epoll
, since restoring with a single thread is too slow. But I understand that Dragonfly prefers the default io_uring
I/O engine for performance.
I’d appreciate any insight into why this might be happening and whether there’s a way to restore large .dfs
snapshots using multiple threads without disabling io_uring
.
For referrence, this is how I start my Dragonfly instance:
docker run -d --name "app_cache" \
--network host \
--log-driver=gcplogs \
--log-opt gcp-log-cmd=true \
-v /data:/data \
-m "172g" \
docker.dragonflydb.io/dragonflydb/dragonfly \
--port "6379" \
--logtostderr \
--force_epoll=true \
--dir /data \
--cache_mode \
--maxmemory 160G
Thanks in advance!