I get asked a lot "What does 'response time' really measure?" Or something close… Response time, simply put, is a measurement of how long it took the monitoring station to execute the service monitor to completion. From the point the service monitor thread is launched, to the moment data comes back and gets processed, but not after. Let me elaborate with a typical (and problematic) example.
You have a bunch of servers added to uptime. When they were added, a default PING check was created for them. Now, depending on what version this was done in makes a difference to whether you'd have default response time alerts for them. At one point (not sure when.. sorry), we had the default warning set to 200 ms and the default critical set to 1000 ms. Sounds legit right? After all, 200 ms ping time is pretty slow, and 1 second, well, that's pretty slow too. I think I can probably ping a server in Croatia in less time from my Houston data center. Well, the trouble is the misunderstanding that this is what response time is measuring. We are measuring how long it took us to ping, plus the time it took to run the service monitor. In the beginning, this is usually not a problem. As you start adding hundreds or thousands of things to monitoring, your monitoring station is going to slow down, that is if you're not keeping up with monitoring station performance, but that's another blog post… So, being occasionally mean spirited, I'm going to pick on the PING monitor some more.
The parameters for PING include pretty much everything you should care about in a PING monitor, that is the number (of pings) to send, average round trip time (RTT),and percent loss. Oh, and yes, response time… Really, what you should be primarily concerned with is average round trip time. Percent loss too, ok, that's kind of important. What will you find? Well, because these are used as HOST CHECKS (keyword here) we are pretty darn forgiving. We will ping something 5 times, and we're not accounting at all for round trip time or packet loss. What's that? HERESY you say?? Well, perhaps. But understand this (if you haven't experienced it from the response time values already) that if a service monitor that is the host check goes out of the OK state, we're going to stop monitoring that element. The idea is economy, not just in monitoring station power, but in alert fatigue. Why would you want more alerts than just "your server is down"? And why would we want to queue up service monitors to run, and fail (likely after quite some time) against a down element? Well maybe you do, but this isn't the department of verbal abuse, that's down the hall! Anyway, the response time being set is a bad thing because if your monitoring station slows down a bit and those ping checks take more than 200ms to process, you don't get other service monitors processing against those elements.
Ok now that I've gotten you all concerned, let's expand this idea. Response time monitors on ANYTHING, in my humble opinion, are pretty worthless for monitoring the things they monitor, because, they are talking about the monitoring itself, not what is being monitored (we do this anyway in the background and it is in the problem reports you send to support). Furthermore, why would you ever want to STORE this information? It just lives right next to all the other stuff you're storing and slows down everything as a result, never mind the space it consumes! So now you're all questioning your monitoring setups right? Good. That's the point. But wait, I've come to the table with a solution, not just to complain. I wouldn't do that to you! Below are some queries we will use to investigate what's going on, and to remedy it, if you should choose. These queries are tested against MySQL and MSSQL. Oracle should be the same or super similar. Should I test it, well maybe but, if you work with Oracle on the regular, you probably will have no issue translating and providing commentary! Note, the — lines are comments (in MSSQL) so if you're on MySQL, just don't run those lines, or just change the — to a #
--select all non null timer defaults. The first part of our discovery...
select * from erdc_parameter
where name = 'timer' AND default_value != '';
-- use one of the following two to remove the defaults.
-- Set default values for the PING monitor response time to make them blank. Note this doesn't modify any existing ones.
update erdc_parameter
set default_value = '',
default_comp_rule = ''
where name = 'timer' AND erdc_base_id =
(select erdc_base_id from erdc_base where name like 'PING');
– set ALL DEFAULT timer values to no time and no rule - this way when we make new service monitors the defaults will be cleared
update erdc_parameter
set default_value = '',
default_comp_rule = ''
where name = 'timer' AND default_value != '';
--Now let's look at all response time rules currently saved against entities
SELECT * FROM simple_rule where erdc_parameter_id IN
(select erdc_parameter_id from erdc_parameter where name = 'timer');
-- If you wish to remove the existing rules for response time use one of the following two queries.
--delete all the PING response time rules saved against entities for example or use next query to delete them ALL
delete from simple_rule
where erdc_parameter_id IN
(select erdc_parameter_id from erdc_parameter where name = 'timer' AND erdc_base_id =
(select erdc_base_id from erdc_base where name like 'PING'));
--delete ALL response time rules saved...
delete from simple_rule
where erdc_parameter_id IN
(select erdc_parameter_id from erdc_parameter where name = 'timer');
--display entities with response time data being saved
select count(erdc_int_data_id) from erdc_int_data where erdc_parameter_id =
(select erdc_parameter_id from erdc_parameter where name = 'timer' AND erdc_base_id =
(select erdc_base_id from erdc_base where name like 'PING'));
--or for all of them...
select count(erdc_int_data_id) from erdc_int_data where erdc_parameter_id =
(select erdc_parameter_id from erdc_parameter where name = 'timer');
/*delete it... because.. it is worthless. but, exercise caution.. This query isn't running in batches so it could do nasty things like growing your transaction logs or locking up uptime for some time, so use the above query to understand how many rows you're going to delete.. could run for a long time. Run the query if you're serious. If you've got hundreds of thousands or millions of rows, you shouldn't do this without uptime down during a maintenance window.. or just let it the data expire and get archived. your choice.*/
delete from erdc_int_data where erdc_parameter_id IN
(select erdc_parameter_id from erdc_parameter where name = 'timer')
You can see from running through these queries, the scope of these checks and the data being stored. You're a step closer now to "Pragmatic Monitoring". I always encourage folks to think about whether they should alert and whether they should store data when they create monitors. This example really had to do with the things we do by default, which wasn't always perfect. But we're always trying to improve. Don't forget, uptime allows you to alert and / or store data. Many times you want to store and not alert and vice versa.
Drop me some comments here on any defaults you'd like to see changed, or any of the out of box behaviors you might not like, or hey, even stuff you like! Let's get the dialogue going folks!