Let's say I have java metrics being pushed to Sumo Logic for all my worker machines. I wish to set up alerting for if the 'used' JVM memory reaches a certain threshold in comparison to the total memory of each machine.
However, I only want this alerting to occur when a certain number of my worker population are concurrently running out of memory.
In an attempt to set this up, I came up with 2 queries.
Query 1: Returns a count of unique sourceHosts in a 15minute window, giving me a total count of my worker machines.
_sourceName="/var/log/jvm_metrics.log"
| parse regex "used=(?<used>\d*)" nodrop
| parse regex "total=(?<total>\d*)" nodrop
| count_distinct(_sourceHost) as total_workers
Query 2: Returns a count of any worker machines that have a ratio higher than a certain threshold for used memory/total memory.
_sourceName="/var/log/jvm_metrics.log"
| parse regex "used=(?<used>\d*)" nodrop
| parse regex "total=(?<total>\d*)" nodrop
| num(used)/1000/1000 as used
| num(total)/1000/1000 as total
| used/total as ratio
| where ratio > 0.9
| count_distinct(_sourceHost) as issue_workers
My plan was to create a second ratio showing how many of my workers were having JVM memory issues. However I can't work out how to merge these two queries. I have researched joins and subqueries but have had no luck getting anything working.
I want to combine these two queries in order to find how many issue workers there compared to the total workers. If it goes higher than a threshold, say 20%, then it should alert my monitor.
For your end goal (throw an alert if the number of workers running out of memory is greater than X) it would be better to use SLOs.
SLOs will calculate the ration between "issue" and "total" workers for a given window (15m), in addition to a pre-built dashboard and option to add multiple monitors. I've found the pre-built dashboard quite useful and SLOs to be easier to manage overall.
The SLO would be "Window-Based", the "Unsuccessful Events" query would be the count of problematic workers:
_sourceName="/var/log/jvm_metrics.log"
| parse regex "used=(?<used>\d*)" nodrop
| parse regex "total=(?<total>\d*)" nodrop
| count_distinct(_sourceHost) as total_workers by _timeslice
And the "Total Events" query as the count of all workers:
_sourceName="/var/log/jvm_metrics.log"
| parse regex "used=(?<used>\d*)" nodrop
| parse regex "total=(?<total>\d*)" nodrop
| num(used)/1000/1000 as used
| num(total)/1000/1000 as total
| used/total as ratio
| where ratio > 0.9
| count_distinct(_sourceHost) as issue_workers by _timeslice
Then you can define the target as needed.
Now, if you're determined on using a single query, you could try casting fields manually and using them for aggregationcast values manually (to later use as counters):
It's hard to confirm this exact query will work without looking at your data, but I hope it conveys the idea.
_sourceName="/var/log/jvm_metrics.log"
| parse regex "used=(?<used>\d*)" nodrop
| parse regex "total=(?<total>\d*)" nodrop
| num(used)/1000/1000 as used
| num(total)/1000/1000 as total
| used/total as ratio
| if(ratio > 0.9, 1, 0) as has_issues
| count_distinct(_sourceHost) as issue_workers by has_issues
| where has_issues = 1