Hi. We are running Dragonfly 1.12.0 with 2 nodes set up as master-replica. Works great in general, but found that whenever there’s a full-sync happening, the master node would go unresponsive during the entire full-sync duration. This appears to be the case for both to the clients (mostly doing MGET and SET) and the Sentinel nodes.
- Is this unresponsiveness expected?
- Let’s say we are willing to accept some data loss, is there any way to skip or turn off full-sync? Does this even make sense?
hi @excellent-wolf thanks for reporting the issue
by unresponsiveness you mean executing MGET and SET will get stuck?
i chatted with our team, and unresponsiveness should not be expected
@excellent-wolf how do you determine the stage in the replication? is your replica unstable for some reason, so that it keeps reconnecting? because full sync is supposed to happen once in the beginning of the replication, and that’s it
(that is of course not to say that the server will be unresponsive during that process)
do you have scheduled snapshot saves? if so, do you experience similar behavior during save?
@temperate-echidna we only do snapshots on shutdown. Can’t really say if we have the same behavior or not.
Now that we are sure this behavior is not expected we’ll try to come up with some reproducible example and get back to you.
@excellent-wolf can you please run redis-cli -3 --raw DEBUG OBJHIST
Hi @excellent-wolf can you please share INFO ALL output?
We do not expect the server to be unresponsive and would like to investigate it. Therefor need to understand your server state and configuration which I can see in info all
Hi. Just to put some closure on this… so we spent some time trying to reproduce the issue, and found that it happens reliably when we have enough SETs mixed in with the MGETs. I was going to report back to you all on how exactly can we reproduce this, but just ran the test against the latest v1.15.0 and the problem seems to be gone already!
Glad to hear this but I would like still see if you guys use swap disk and maybe this is what caused it
please let us know if it happens again, we have not done any specific fixes AFAIK to solve this issue in 1.15
@inclusive-numbat No I don’t see swap being used.
Some more information I can share:
- We were still able to reproduce this as of 1.14.5.
- If we only do MGETs without SETs then there’s no problem.
- We have one test setup with two clients, and SETs came from only one of them. The client with only MGET still experienced the issue.
The minimal setup is to run two dragonfly as containers, can be in the same of different hosts, with “–cache_mode=true” flag. Put in some data (we used a snapshot with 30M keys, which should have the same distribution as the histogram above), run a load test client on one node, and about halfway through the load test run, trigger full sync by doing the replicaof command.
thank you @excellent-wolf this is useful. What kind of loadtest client have you run?