Thank you so much!
Context
Basically, let me start with a bit of context. I have a python microservice, that streams message chunks (llm output), and a nestjs service that consumes those.
The producer uses XADD, while the consumer code has an infinite loop running XREADGROUP
There are two consumer group, each having a single replica. One saves the chunks to db, and one emits to frontend via socket.io
Observations
Everything was working fine, but suddenly, last month i started observing that chunks are missing in my nestjs logs. The producer service logs show that they are produced, and using Redis Insight GUI i can see that those chunks are sent to redis streams. But they were not recieved by the consumer, because at the point of receipt they are logged.
What I tried
I played around with the COUNT and BLOCK argument in XREADGROUP, and it seemed to be fixed, but nothing was concrete, i was getting mixed results. Again after rolling back the changes it was still fixed.
Things i tried: Forced restart the nestjs service, restart the redis, and then instead of redis started using dragonfly.
After switching to dragonfly, it started working perfectly again, even after testing thoroughly, could not reproduce it.
Devil strikes back
Again, yesterday, the same chunk not recieved problem happened. I tried mix of things like restarting etc, but after deploying a fresh redis (this time i switched from dragonfly->redis), it was solved.
New observations
The total size of old stream chunks was around 500 Kb from last month. Its just few devs using the prototype, and redis doesnt delete the acknowledged stream messages by default. May be I need to use XTRIM
The ram and cpu consumption is also super low. Currently using Coolify for the entire deployments. May be how coolify manages all deployments and network could have unstability.
Other things I tried
I am totally unable to figure out the root cause of this behavior, because there is not much load on the system. I tried brainstorming with Claude/Gemini, and added some logs for error in the consumer code, but those logs never got printed.
Conclusion
What kind of monitoring setup can I do to understand this ? From anyone’s experience, what can cause such problems.
For reference I have attached the code pertaining to the producer and consumer, stripping off all other unneccessary details