Hi there, I’ve been undertaking a migration to Dragonfly at my company, and I’ve been trying to find the easiest way to corroborate metrics we’re seeing in prometheus with the live conditions on the machine. I’ve noticed that the LATENCY
command is mostly unimplemented. Is that something that’s planned? In addition, would it be possible to add an error like const char kUnimplementedErr[] = "currently unimplemented";
here: https://github.com/dragonflydb/dragonfly/blob/main/src/facade/facade.cc#L72-L87.
That would make it more clear here: https://github.com/dragonflydb/dragonfly/blob/main/src/server/server_family.cc#L2204-L2214 When I see syntax error
I think that I’ve done something wrong, which contrasts with the information provided in HELP LATENCY
suggesting that all of the subcommands are present (might be also worth adding a note in the HELP
that these are non-functional). There is a log line, but if you aren’t reading the logs then the server message is all that you see. I’d be happy to help create a PR for this if it’s something you’re interested in. I’m unsure of other places where this pattern is used so maybe it’s a greater undertaking to add this error and keep everything uniform. Love the product, love the idea, keep up the good work!
I’m also happy to move this to a GitHub issue.
Also I’m aware of the recent changes surrounding latency metrics (that aren’t in a published version yet) on the prometheus endpoint and would love a bit of clarity about what units I should use in Grafana for the current version we’re on: v1.12.1
and metric dragonfly_commands_duration_seconds_total
Could you clarify regarding LATENCY command, which scenarios would you like to see?
probably LATENCY DOCTOR
, LATENCY LATEST
, and LATENCY HISTOGRAM
although I’m now realizing I may have been mistaken in running HELP LATENCY
because it appears that might run on the client rather than the server. I’ve been running the HELP
command from redis-cli
this whole time.
in general, more statistics on latency both in application and metrics would be appreciated
dragonfly_commands_duration_seconds_total has wrong units but will be fixed in our next version
yeah I saw that yesterday, thanks!
in our latest version the reported numbers are in ms instead of seconds
there are no immediate plans to implement latency
command. having said that, please open an issue and explain why you need it in addition to the metrics. Basically, if the reasoning makes sense and several folks vote on the issue we raise its priority
certainly, I’ll consider doing that in the next few days. Long story short, We recieved new baremetal hardware for our dragonfly instance to run on, and I was a bit confused with the latency in synthetic testing I was doing. That combined with the confusing prometheus numbers led me to the latency
command. I still need to run with memtier_benchmark
instead of my scripts to get a better idea of performance
sure. if you need help with configuring memtier to get realistic results let us know
I will! thanks again for everything y’all are doing
may I ask you what domain or area your company in?
yeah we’re in Mobile Adtech mostly the Mobile Measurement Partners space (e.g. providing marketing teams insights into the performance of their ad material and app interactions)
got it. and why dragonfly?
I’ve been following the project for quite a while, picked it up initially in a HackerNews post, discussed with our CTO about opportunities for new technology to enter our stack… We maintain a 50/50ish split between colocation and cloud solutions and in our colocation we are fighting against aging infrastructure. Most of our machines are from 2014-2017 and showing their age at this point. We’ve historically been a Redis/Memcached house and our traffic volumes and use-cases dictated running multiple redis-server instances on one machine (up to 10 in some cases)… I wasn’t around when the original architecture decisions were made that ended us up in that place, but I’m looking to centralize caching and modernize our infrastructure where I can. Dragonfly was a super convenient option given it’s “drop-in” ability and familiar interfaces.
most of our caching is to take load off our databases and support real-time fraud products