Unable to restore from .dfs unless --force_epoll=true

My Dragonfly instance is configured with --cache_mode and --maxmemory=160G , running on a GCP Compute Engine VM with 22 vCPUs and 176 GB of RAM (c3-highmem-22 , Intel Sapphire Rapids, x86_64, Debian 6.1.140-1).

I’ve noticed that while my .dfs snapshots are around 50 GB, the restore process works smoothly. However, once the snapshot size grows above ~150 GB, Dragonfly detects the file, starts loading, and then immediately gets stuck, typically before even reaching 1 GB of memory usage.

Initially, I suspected snapshot corruption and rebuilt the cache from scratch. But after using the SAVE command and restarting Dragonfly, the issue reappeared with the newly generated snapshot.

I’ve tried several approaches to work around this:

  • Reduced proactor threads to 6, then to 4
  • Changed disk type
  • Recreated the VM from scratch
  • Tuned various runtime parameters

The only configurations that successfully restore large snapshots (~150 GB+) are:

  • Setting --proactor_threads=1
  • Or enabling --force_epoll=true

Currently, I’m using --force_epoll, since restoring with a single thread is too slow. But I understand that Dragonfly prefers the default io_uring I/O engine for performance.

I’d appreciate any insight into why this might be happening and whether there’s a way to restore large .dfssnapshots using multiple threads without disabling io_uring.

For referrence, this is how I start my Dragonfly instance:

docker run -d --name "app_cache" \
    --network host \
    --log-driver=gcplogs \
    --log-opt gcp-log-cmd=true \
    -v /data:/data \
    -m "172g" \
    docker.dragonflydb.io/dragonflydb/dragonfly \
    --port "6379" \
    --logtostderr \
    --force_epoll=true \
    --dir /data \
    --cache_mode \
    --maxmemory 160G

Thanks in advance!

Hey @hgorni,

Please take a look at this GitHub issue.

There seems to be a bug in the kernel version you’re using. For Dragonfly Cloud, we run on 6.8.x versions. Is it possible to upgrade on your side?

This could be related to the maxmemory issue you have in the other thread as well.