Redis/Dragonfly stream does not deliver chunks, unreliable behaviour, or anything I am missing

Thank you so much!

Context

Basically, let me start with a bit of context. I have a python microservice, that streams message chunks (llm output), and a nestjs service that consumes those.

The producer uses XADD, while the consumer code has an infinite loop running XREADGROUP
There are two consumer group, each having a single replica. One saves the chunks to db, and one emits to frontend via socket.io

Observations

Everything was working fine, but suddenly, last month i started observing that chunks are missing in my nestjs logs. The producer service logs show that they are produced, and using Redis Insight GUI i can see that those chunks are sent to redis streams. But they were not recieved by the consumer, because at the point of receipt they are logged.

What I tried

I played around with the COUNT and BLOCK argument in XREADGROUP, and it seemed to be fixed, but nothing was concrete, i was getting mixed results. Again after rolling back the changes it was still fixed.

Things i tried: Forced restart the nestjs service, restart the redis, and then instead of redis started using dragonfly.
After switching to dragonfly, it started working perfectly again, even after testing thoroughly, could not reproduce it.
Devil strikes back
Again, yesterday, the same chunk not recieved problem happened. I tried mix of things like restarting etc, but after deploying a fresh redis (this time i switched from dragonfly->redis), it was solved.

New observations

The total size of old stream chunks was around 500 Kb from last month. Its just few devs using the prototype, and redis doesnt delete the acknowledged stream messages by default. May be I need to use XTRIM

The ram and cpu consumption is also super low. Currently using Coolify for the entire deployments. May be how coolify manages all deployments and network could have unstability.

Other things I tried

I am totally unable to figure out the root cause of this behavior, because there is not much load on the system. I tried brainstorming with Claude/Gemini, and added some logs for error in the consumer code, but those logs never got printed.

Conclusion

What kind of monitoring setup can I do to understand this ? From anyone’s experience, what can cause such problems.

For reference I have attached the code pertaining to the producer and consumer, stripping off all other unneccessary details

It seems that the issue is on the consumer side, and it happens to both Redis and Dragonfly.

When that’s the case, I highly doubt the application itself may have some logic errors. Here are a few suggestions I can think of right now:

  • The as assertion is only for compile time. Do you think it’s possible that as RedisStreamResponse actually fails silently during runtime since the type definition doesn’t capture all possible response shapes?
  • Working directly with streams can be hard. If feasible, maybe you can consider using a mature library/framework like BullMQ, etc., which offloads you from working directly with streams.

Just my two cents above, and I hope it helps. I have a gut feeling that something is wrong with the application code. But on the other hand the loop looks fine to me, so it could be something subtle.