Unresponsive during full_sync

Hi. We are running Dragonfly 1.12.0 with 2 nodes set up as master-replica. Works great in general, but found that whenever there’s a full-sync happening, the master node would go unresponsive during the entire full-sync duration. This appears to be the case for both to the clients (mostly doing MGET and SET) and the Sentinel nodes.

  1. Is this unresponsiveness expected?
  2. Let’s say we are willing to accept some data loss, is there any way to skip or turn off full-sync? Does this even make sense?

hi @excellent-wolf thanks for reporting the issue

by unresponsiveness you mean executing MGET and SET will get stuck?

it never returns?

i chatted with our team, and unresponsiveness should not be expected

@excellent-wolf how do you determine the stage in the replication? is your replica unstable for some reason, so that it keeps reconnecting? because full sync is supposed to happen once in the beginning of the replication, and that’s it
(that is of course not to say that the server will be unresponsive during that process)

  • We took the replica node out to do some upgrade, full sync started upon rejoining. Lasted for about 40 seconds.
  • During those 40 seconds it took long enough for most (if not all) of the MGET and SET requests timed out. We set the client side time out to 200 ms, usually it only takes 1-2 ms.
  • Also Sentinel logs keep reporting sdown/odown on the master node during the full-sync period.

do you have scheduled snapshot saves? if so, do you experience similar behavior during save?

@temperate-echidna we only do snapshots on shutdown. Can’t really say if we have the same behavior or not.

Now that we are sure this behavior is not expected we’ll try to come up with some reproducible example and get back to you.

@excellent-wolf can you please run redis-cli -3 --raw DEBUG OBJHIST

histogram.txt (2.95 KB)

Hi @excellent-wolf can you please share INFO ALL output?

We do not expect the server to be unresponsive and would like to investigate it. Therefor need to understand your server state and configuration which I can see in info all

Hi. Just to put some closure on this… so we spent some time trying to reproduce the issue, and found that it happens reliably when we have enough SETs mixed in with the MGETs. I was going to report back to you all on how exactly can we reproduce this, but just ran the test against the latest v1.15.0 and the problem seems to be gone already!

Glad to hear this but I would like still see if you guys use swap disk and maybe this is what caused it

please let us know if it happens again, we have not done any specific fixes AFAIK to solve this issue in 1.15

@inclusive-numbat No I don’t see swap being used.
Some more information I can share:

  • We were still able to reproduce this as of 1.14.5.
  • If we only do MGETs without SETs then there’s no problem.
  • We have one test setup with two clients, and SETs came from only one of them. The client with only MGET still experienced the issue.

The minimal setup is to run two dragonfly as containers, can be in the same of different hosts, with “–cache_mode=true” flag. Put in some data (we used a snapshot with 30M keys, which should have the same distribution as the histogram above), run a load test client on one node, and about halfway through the load test run, trigger full sync by doing the replicaof command.

thank you @excellent-wolf this is useful. What kind of loadtest client have you run?