Unresponsive during full_sync

excellent-wolf · February 6, 2024, 2:11pm

Hi. We are running Dragonfly 1.12.0 with 2 nodes set up as master-replica. Works great in general, but found that whenever there’s a full-sync happening, the master node would go unresponsive during the entire full-sync duration. This appears to be the case for both to the clients (mostly doing MGET and SET) and the Sentinel nodes.

Is this unresponsiveness expected?
Let’s say we are willing to accept some data loss, is there any way to skip or turn off full-sync? Does this even make sense?

realistic-ferret · February 6, 2024, 2:48pm

hi @excellent-wolf thanks for reporting the issue

realistic-ferret · February 6, 2024, 2:49pm

by unresponsiveness you mean executing MGET and SET will get stuck?

realistic-ferret · February 6, 2024, 2:49pm

it never returns?

realistic-ferret · February 6, 2024, 2:50pm

i chatted with our team, and unresponsiveness should not be expected

temperate-echidna · February 6, 2024, 2:54pm

@excellent-wolf how do you determine the stage in the replication? is your replica unstable for some reason, so that it keeps reconnecting? because full sync is supposed to happen once in the beginning of the replication, and that’s it
(that is of course not to say that the server will be unresponsive during that process)

excellent-wolf · February 6, 2024, 3:35pm

We took the replica node out to do some upgrade, full sync started upon rejoining. Lasted for about 40 seconds.
During those 40 seconds it took long enough for most (if not all) of the MGET and SET requests timed out. We set the client side time out to 200 ms, usually it only takes 1-2 ms.
Also Sentinel logs keep reporting sdown/odown on the master node during the full-sync period.

temperate-echidna · February 6, 2024, 5:44pm

do you have scheduled snapshot saves? if so, do you experience similar behavior during save?

excellent-wolf · February 7, 2024, 1:59pm

@temperate-echidna we only do snapshots on shutdown. Can’t really say if we have the same behavior or not.

excellent-wolf · February 7, 2024, 2:00pm

Now that we are sure this behavior is not expected we’ll try to come up with some reproducible example and get back to you.

inclusive-numbat · February 7, 2024, 3:13pm

@excellent-wolf can you please run redis-cli -3 --raw DEBUG OBJHIST

excellent-wolf · February 7, 2024, 4:06pm

histogram.txt (2.95 KB)

dignified-nautilus · February 8, 2024, 5:32pm

Hi @excellent-wolf can you please share INFO ALL output?

dignified-nautilus · February 8, 2024, 5:35pm

We do not expect the server to be unresponsive and would like to investigate it. Therefor need to understand your server state and configuration which I can see in info all

excellent-wolf · March 5, 2024, 9:15am

Hi. Just to put some closure on this… so we spent some time trying to reproduce the issue, and found that it happens reliably when we have enough SETs mixed in with the MGETs. I was going to report back to you all on how exactly can we reproduce this, but just ran the test against the latest v1.15.0 and the problem seems to be gone already!

inclusive-numbat · March 5, 2024, 2:22pm

Glad to hear this but I would like still see if you guys use swap disk and maybe this is what caused it

inclusive-numbat · March 5, 2024, 2:32pm

please let us know if it happens again, we have not done any specific fixes AFAIK to solve this issue in 1.15

excellent-wolf · March 5, 2024, 5:05pm

@inclusive-numbat No I don’t see swap being used.
Some more information I can share:

We were still able to reproduce this as of 1.14.5.
If we only do MGETs without SETs then there’s no problem.
We have one test setup with two clients, and SETs came from only one of them. The client with only MGET still experienced the issue.

excellent-wolf · March 5, 2024, 5:19pm

The minimal setup is to run two dragonfly as containers, can be in the same of different hosts, with “–cache_mode=true” flag. Put in some data (we used a snapshot with 30M keys, which should have the same distribution as the histogram above), run a load test client on one node, and about halfway through the load test run, trigger full sync by doing the replicaof command.

inclusive-numbat · March 5, 2024, 6:26pm

thank you @excellent-wolf this is useful. What kind of loadtest client have you run?

Topic		Replies	Views
blocking on set until replication lag catches up? Dragonfly Technical	6	40	January 17, 2024
dragonfly response time Dragonfly Technical	4	38	June 30, 2023
Redis Sentinel Dragonfly Technical	2	83	February 14, 2024
Redis Sentinal slave priority 0? Dragonfly Technical	2	48	February 7, 2024
Replication with Redis causing error on sync Dragonfly Technical	2	73	November 4, 2024

Unresponsive during full_sync

Related topics