Taking Server Monitoring Alerts up a Level

by Apr 21, 2015

In this blog I wanted to tackle strategies for monitoring that go beyond looking for and responding to alerts.  One of the key features of any monitoring product is to raise alerts when things are not working correctly.  However, one of the challenges with monitoring is being able to step back from alerts and see what is really happening.  This higher level view is often what management is looking for.  Let’s discuss a few ways to get a better, high-level view of the alerts being raised in any environment.

One good way to step back from alerts is to plot the number of times an alert is raised (# of incidents) versus the duration of time the alert remains critical (duration).  This will effectively create a profile for each of the alerts in our environment.  Each alert will fit one of four profiles:

  • # of incidents low, duration low – This is the preferred result. Ideally, we never want an alert to be raised but if it is we don’t want it to reoccur and we want the durations of the alert to be minimal.
  • # of incidents low, duration high – This is not a good result but may not indicate a serious problem going forward. Alerts that fit this profile can be nasty but are often the result of a one-time (or few times) problem that has been addressed.  Examples could include something like a SAN failure that when fixed is unlikely to reoccur.
  • # of incidents high, duration low – This is a bad result and should be looked at. We sometimes refer to alerts that fit this description as “death by a thousand cuts”.  These alerts are often simple to solve and don’t cause significant downtime if you look at them one by one.  However, if you add up the amount of time spent responding to these alerts the time can be significant.  Efforts to resolve the underlying problem will often safe money in the long term.
  • # of incidents high, duration high – This is the worst possible result. This indicates a serious underlying problem that should be addressed.  Maybe you have failing hardware, or poorly written software.  This is where we really want to invest in doing some root cause analysis and fix the underlying problem.  Invests made to fix these problems are almost always money savers.

One of the best ways to step back from individual alerts is to roll up the status of a bunch of different alerts.  We set alerts for a lot of different things in our environments but what we often care about most are end-user applications.  That is, we care if end-users can run the applications they need to in order to do their jobs.  When I describe an application in this context I’m not thinking about a single applications like Microsoft Word, I’m describing a multi-tiered application made up of a number of different components.  For these applications we may create alerts around performance, or application server, or database status.  These individual alerts are critical to troubleshooting any problems that come up but for management or application users they likely want a single status that tells them if the application is up or down.  If you tie together the individual alerts being raised in a meaningful way you accomplish this.

An important aspect is rolling up application status is to separate predictive alerts from failure alerts.  For example, CPU running above a defined threshold is a predictive alert.  It may indicate a problem is coming but things may also continue working as expected.  A different example might be a website not responding – this is a failure alert and indicates things are not working right now.  If you want to roll up application status you would not want high CPU to indicate the application is down, but you would want the website failure to do so.

You can extend the application status idea one step further into service levels.  That is when you define a service level objective for an application.  Service level objectives are typically measured as a percentage – the percentage of time you expect an application to be available.  This can be a very useful tool for management.  Instead of being involved every time the application is down they can set targets and be notified when those targets are not reached.  This is also a great tool for the IT group to communicate their performance back to the organization.

Those are just a couple of strategies for stepping back from alerts to get a better high-level view of what is happening in an environment.  If you have thoughts on other strategies that have been helpful in your environment please leave a comment or post any thoughts you have to the up.time forum in the Idera community.