I think your idea of a local check that grabs the top 5 processes and sends them back would be better, although I’m not sure how this would work if one of the processes being graphed then drops out of this top 5? I would assume it would be removed from any graph and we then wouldn’t see it at that point. It’s very likely I will come back and test this though! I’m also concerned with the performance aspect, although I’m not sure of how severe the detrimental effect would actually be (yet). I think a combined graph may well not prove useful as you say, although I have seen some graphing tools that do allow you to click/select items on the graph and filter which may have been of help for this. Hey thanks for your reply! Yep, using “%s” is more the kind of thing I was after all along, but as you say it does mean a new CMK service for every process on each server. Per-process metrics and graphs etc are cool and all, and checkmk could do better there, but I don’t think it’s the best way to diagnose your issue. Gather that information however you can, collate it and see if you can start to come up with patterns. What you then want to look for is some kind of logging that indicates what was killed or restarted and when. If problems persist, then something like monit is usually the answer. So your best bet to start with IMHO is to either learn the dark arts of tuning the kernel oom-killer, or look at a userspace oom-killer daemon like earlyoom or oomd. It’s usually more likely that it’s memory. If an affected system is no longer externally responsive to ssh, then it’s probably likely that you won’t be able to pull much, if any monitoring information from it either.Īlso, high CPU isn’t often that fatal in my experience. If there’s a process that’s maxing resources (we assume CPU) and you’re unable to ssh to an affected instance, then the ability to monitor said instance may also be impaired. With a properly formatted output for the CMK agent you should get the information you need, more easily into CMK. runs top in batch mode, returning the “top ten” processes. Note: This approach might be a problem with the number of hosts you’re referring to, in terms of performance and/or licensing (if you use CEE).Īnother possibility would be that you write a local check, that e.g. In succession, you could create aĬombined graph of all those processes, and you would have all information in a single graph. The “downside” is that this will result in a new service for every process it discovers that way. You could also try to use the “Process discovery” rule differently: Try using a sane regular expression instead of “Match all processes” and use “%s” as “Process Name”. You can change this behaviour with a rule per host/service e.g.
![windows process monitor cpu usage windows process monitor cpu usage](https://www.linuxcommands.site/wp-content/uploads/2020/05/image-28-1536x747.png)
Good morning issue with the “HTML display” in your browser is most likely attributed to the fact, that HTML codes are escaped per default, since they are considered “insecure”. It doesn’t even have to be for all processes, it could be for say just the top 5, but I’m not sure how this would work for logging/display of the data? Or, whether if the CPU becomes maxed out and the CMK agent becomes unresponsive we would actually end up catching the problematic process if it was outside of the top 5 processes before the CPU became maxed out and instance is unresponsive and can’t be polled?
![windows process monitor cpu usage windows process monitor cpu usage](https://www.linuxcommands.site/wp-content/uploads/2019/12/image-14-2048x1028.png)
I have tried extensively searching for something for this, but am yet to find a way to monitor CPU activity on a per process basis. Is there an easy way (or any way) of monitoring per process CPU usage so we can establish what particular process could be causing this? We are monitoring CPU usage as a whole, but this doesn’t obviously give us the granularity we need to be able to identify the problematic process. We have an issue where an unknown process (on Linux - Ubuntu 18.04) is maxing out CPU and causing cloud based instances to become unresponsive and require a power-cycle (no console access for troubleshooting).