Strange behavior with ZRANGEBYSCORE

Hi!

We have recently migrated from Redis to DragonflyDB, and we are experiencing some weird behavior with ZRANGEBYSCORE.

Indeed, for one of our functionalities, we have sorted sets holding numeric identifiers that are also used as score. We usually query these sorted sets with ZREVRANGEBYSCORE or ZRANGEBYSCORE. We have so far not noticed any issue with the former, but there are responses we don’t understand with the latter.

Indeed, ZRANGEBYSCORE seems to sometimes return empty results when it should not.

For instance, the following query returns an empty result:

ZRANGEBYSCORE feed:home:109241526707263377 (113925066244287253 (+inf WITHSCORES LIMIT 0 5

While this other request, with a higher min, so that should include strictly less values, returns something:

ZRANGEBYSCORE feed:home:109241526707263377 (113928342598657450 (+inf WITHSCORES LIMIT 0 5
 1) "113928362768143367"
 2) "113928362768143360"
 3) "113928379038194137"
 4) "113928379038194140"
 5) "113928414353135798"
 6) "113928414353135790"
 7) "113928434576300296"
 8) "113928434576300290"
 9) "113928465903069141"
10) "113928465903069140"

If that’s of significance, 113925066244287253 should be the score (and value) of an existing member of the set.

One of the user who reported us this issue figured out a range of values for which this was broken for their set. We ran a few more queries according to their findings.

The following queries return an empty set:

ZRANGEBYSCORE feed:home:109241526707263377 (113925073952926167 (+inf WITHSCORES LIMIT 0 5
ZRANGEBYSCORE feed:home:109241526707263377 (113925066244287241 (+inf WITHSCORES LIMIT 0 5

While those queries return different values:

> ZRANGEBYSCORE feed:home:109241526707263377 (113925073952926168 (+inf WITHSCORES LIMIT 0 5
 1) "113925075936347721"
 2) "113925075936347730"
 3) "113925099958545694"
 4) "113925099958545700"
 5) "113925103091804133"
 6) "113925103091804130"
 7) "113925103802178321"
 8) "113925103802178320"
 9) "113925111911762497"
10) "113925111911762500"
> ZRANGEBYSCORE feed:home:109241526707263377 (113925066244287240 (+inf WITHSCORES LIMIT 0 5
 1) "113925066244287253"
 2) "113925066244287250"
 3) "113925073952926174"
 4) "113925073952926180"
 5) "113925075936347721"
 6) "113925075936347730"
 7) "113925099958545694"
 8) "113925099958545700"
 9) "113925103091804133"
10) "113925103091804130"

Hi @Claire,

Thank you for reporting this. A few questions please:

  • Can you please provide your Dragonfly version & hosting environment? (Linux version, distro, containerized or not, etc.)
  • If possible, providing a reproducible environment can help. Or if those numbers can be easier to read. I really appreciate your explanation, but I still find the numbers a bit hard to read… Sorry about that. :sweat_smile:
  • And you mentioned that it’s never an issue for the ZREVRANGEBYSCORE command?

Appreciate your time and concern.

Hi!

We are running DragonflyDB 1.26.1 on our own servers. They are containerized (Kubernetes 1.30.5) running on Ubuntu 22.04, amd64.

We have not seen issues with ZREVRANGEBYSCORE so far, but maybe the issues would manifest in ways we haven’t noticed yet.

Sorry about the number being difficult to read! In this set we store identifiers that are large numerical numbers (encoding, among other things, time of creation) and they have that number as a score (due to the numbers being large, members and their score are not always strictly equal, but that is not an issue).

As for giving a reproducible environment… I’m not sure how to do that.

Indeed, we have exported the set with redis-cli -D "" --raw dump 'feed:home:109241526707263377' > timeline_key.dump at the time we noticed the issue.

Restoring it (to the new key debug:dragonfly:timeline) in DragonflyDB does give correct results:

ZRANGEBYSCORE debug:dragonfly:timeline (113925066244287253 (+inf WITHSCORES LIMIT 0 5
 1) "113925073952926174"
 2) "113925073952926180"
 3) "113925075936347721"
 4) "113925075936347730"
 5) "113925099958545694"
 6) "113925099958545700"
 7) "113925103091804133"
 8) "113925103091804130"
 9) "113925103802178321"
10) "113925103802178320"

If that helps, unless I’m missing something, the operations we do on such sets are ZADD key id id, ZREM key id, ZREMRANGEBYRANK key 0 -801, ZREVRANGE key 40 40 WITHSCORES, ZRANGEBYSCORE and ZREVRANGEBYSCORE.

Just to make things clear, we don’t know what is causing this, but it’s something that is happening frequently enough that many users (so different sorted sets at different keys, but with the same kind of operations) have been reporting it since our switch from Redis to DragonflyDB last week.

We are strongly considering moving back to Redis as a result, but is there a way we could export e.g. the in-memory version of sets causing issues? (Since the RDB dump mentioned earlier seems to not be enough to reproduce the issue)

Tracking this issue because I am seeing the same behavior. I can only add that it happens “after some time” and doesn’t seem to be consistent.

Folks, we would really like to help, I am still trying to figure out how we can catch this.How did you identify queries that returned wrong results? Were these queries reproducible?

A possible reason for this is that the run-time structure of sorted changed once it’s dumped and restored.

We got reports from users and Mastodon app developers that timelines were not updating, like mastodon.{social,online} returning empty JSON arrays for some home timeline queries with `min_id` set · Issue #33805 · mastodon/mastodon · GitHub

We identified that those apps were using pagination though our min_id attribute (to get statuses that are newer than the last known status), then looked at this specific code path, which led us to reproducing the issue directly on DragonflyDB using the queries above.

Yes we can reproduce them in our production environment, but not on another DragonflyDB instance that we recreated from a dump.

As additional data, it does not seem directly related to our environment because another Mastodon admin discussed the same issue a few weeks ago, when they tried to migrate their own instance: Dustin Rue: "Having swapped real Redis back in, at this point …" - Château Deluxe

Hi @roman, I’m the original reporter of this issue for Mastodon (mastodon.{social,online} returning empty JSON arrays for some home timeline queries with `min_id` set · Issue #33805 · mastodon/mastodon · GitHub).

I discovered it because I develop the client Pachli (https://pachli.app) and noticed that Pachli was not showing any new posts. I was able to issue queries to the Mastodon API endpoints to confirm that no results were being shown. In the posts above Claire has converted those Mastodon API queries into the underlying Dragonfly queries.

When I observed the issue happening (and to be clear, it is still happening) it was 100% reproducible. I would notice the timeline in the app get stuck because – by chance – an ID for a post in my Mastodon timeline was one where querying for any posts newer than that ID would return no results.

I could repeat the queries multiple times, separated by tens of minutes, and get no results when results definitely should have been returned.

But incrementing or decrementing the ID by some amount would return results. You can see me experimenting with that in the linked GitHub issue.

Hope that helps.

I am sorry about this, I must say it’s first time we hear about such issue.
@nikclayton thank you for providing the context!
@renchap can you please confirm:

  1. You can easily find and run a problematic query
  2. you have an “admin” access to dragonfly server and you can read its logs

Can you please run “DEBUG OBJECT <key>” where the key is the sorted set key that contains the problematic set?

Yes I have admin access to the server and can read the logs.

I am just back from FOSDEM so I do not have all the information yet on how to find such a query, but once @Claire is back online we will find a problematic key and run this.

I am available at Discord at all times. Please look for @romange.

This is what I would like to ask from you once you identify a key/query that returns an empty (invalid) response.

  1. Record the query (i.e. ZRANGEBYSCORE debug:dragonfly:timeline (113925066244287253 (+inf WITHSCORES LIMIT 0 5 )
  2. DEBUG OBJECT <key>
  3. ZCARD <key>
  4. ZRANGE <key> -1 -1 WITHSCORES

to identify if Dragonfly can correctly identify the maximum score
5. Try to tweak the query to get the results similar to what @Claire did and provide the result.

Also, if you still have timeline_key.dump or any other dump with the problematic set - this might help as well.

127.0.0.1:6379> ZRANGEBYSCORE feed:home:1288245 (113933650512566899 (+inf WITHSCORES LIMIT 0 5
(empty array)
127.0.0.1:6379> ZRANGEBYSCORE feed:home:1288245 (113933650512566899 (+inf WITHSCORES LIMIT 0 5
(empty array)
127.0.0.1:6379> DEBUG OBJECT feed:home:1288245
encoding:btree bucket_id:54 slot:2 shard:6
127.0.0.1:6379> ZCARD feed:home:1288245
(integer) 800
127.0.0.1:6379> ZRANGE feed:home:1288245 -1 -1 WITHSCORES
1) "113941710586812899"
2) "113941710586812900"

I sent you the dump on Discord.

Thanks, here is what I suggest:
I am going to build a custom build with the “self-analysis” code.

  1. The code will be trigerred every time ZRANGEBYSCORE returns an empty response for queries in the form of (num, inf) and it will check the maximum item in the tree.
  2. If the maximum number in the tree is greater or equal than num, the code will identify the inconsistent state.
  3. Once such state is identified, it will try to dump enough information into the error log so we could narrow down or maybe even solve the problem.

The custom build will be based on 1.26.2 to make it close to your version.
Will you be able to update your server, once the new container is released?

1 Like

That works for me. Is there a way to only trigger this when the client is localhost, so it is only triggered by my manual testing? Otherwise it may result in a lot of data in the error log, as this server handles a lot of commands per second?

Also I need to correct the informations above: we made changes to our infra and moved DragonflyDB out of Kubernetes (the operator had an unrelated issue and caused a data wipe somehow, this is unrelated and we will investigate this next).

The current setup is DragonflyDB 1.26.1 installed from the official .deb, running on Ubuntu 24.04.1, without any container.

Good that you told me, this will require a different release.

I will make sure the error logs won’t be triggered frequently.

Update: it seems we found the bug. I’ve pushed the fix to the main branch and we will follow up with a patch release in a few days :pray:

3 Likes

We’ve released version 1.26.3 , which includes a fix for this issue. Thank you again for your help in resolving it quickly.

3 Likes