Counter calculations in SQL Diagnostic Manager

by May 1, 2014

Probably the three most common questions I get about counters in SQL Diagnostic Manager are:

“How can I prevent alerting on temporary noise values?”
“How does collecting once every 6 minutes give enough data for alerting?”
and, the big one
“Why is SQLdm showing me different data than I see in PerfMon?”

The answer to all of these questions lies in an understanding of the kind of counters used by SQL Diagnostic Manager, and I will endeavor to explain that here.

Cooked and Raw Counters

Windows provides two basic kinds of counters, called “cooked counters” and “raw counters.” As a user you are almost certainly most used to looking at cooked counters – when you want to see a CPU percentage, you want to see a value between 0 and 100 which corresponds to the percentage of CPU being used by the system at that time. However, within a monitoring product like SQL Diagnostic Manager, it’s much more important to understand raw counters. These are the raw numbers which are used under the covers to calculate the values you are used to seeing.

While this logic is true both in Windows and in SQL Server counters, for the purposes of this demonstration I am going to focus on OS counters for a minute – specifically, on CPU counters, which are relatively well understood. SQL Diagnostic Manager utilizes the Win32_PerfRawData_PerfOS_Processor WMI class to collect data for the alertable values OS Processor Time, OS Privileged Time, and OS User Time. This is true both for direct WMI and OLE Automation configurations of SQLdm. These three alerts come from the PercentProcessorTime, PercentPrivilegedTime, and PercentUserTime counters from the Win32_PerfRawData_PerfOS_Processor class. You can see the values of these counters by running the following Powershell command:

Get-WmiObject -Class Win32_PerfRawData_PerfOS_Processor |
Where-Object { $_.name -eq '_Total'} |
Format-List -Property PercentProcessorTime, PercentPrivilegedTime, PercentUserTime

On my laptop, the values returned look like so:

PercentProcessorTime : 9341260631492

PercentPrivilegedTime : 119022171957

PercentUserTime : 322043160366

Converting Raw Counters to Cooked Values

These values obviously are not useful as they are. They only become useful as you sample them over time. If I run the command again, I’ll receive slightly different values, like so:

PercentProcessorTime : 9343379827076

PercentPrivilegedTime : 119083850853

PercentUserTime : 322093412188

This continues over the entire life of the computer – these counters keep going up, in small increments and large, and SQL Diagnostic Manager keeps sampling them. From there, the next step is to make these into readable “cooked” values.

Each of these counters has an associated formula which is used to convert these raw values, and the deltas between them, to usable “cooked” counters that can be easily consumed by users. If you refer to the Win32_PerfRawData_PerfOS_Processor documentation referenced earlier, you’ll notice each counter has a CounterType value associated with it. You can use that CounterType to look up the related cooking formula from the table here. Once you know the formula name (such as PERF_100NSEC_TIMER_INV) you can then look up the full formula documentation, such as that found here. Doing this kind of research for the hundreds of different counters used by SQL Diagnostic Manager may sound like a pretty tedious process – but believe me, I’ve done it!

Why Bother With Raw Counters?

So, by collecting these raw counters and running them through very specific, well-documented formulas, we’re able to get user-readable values. But why do we go through all this trouble when cooked counters are available? That goes back to the questions I raised at the very beginning of this article.

For the most part, cooked counters are brief samples plucked from the operating system and used to determine a broad trend. For a program like PerfMon, which is running a collection once per second, this is generally fine – you are probably looking at the overall direction of the trendline and perhaps the min, max, and average value for the timeframe. In such a case, the fact that each data point is only accurate to a few milliseconds of time does not have too much of an effect. It’s also not too surprising or distracting when you see the occasional outlier data point – it’s a quick blip and it has very little effect overall. But if you were to change the PerfMon settings such that it was collecting a data point only every 6 minutes, it would absolutely miss a great deal of information! Between one data collection and the next you could have a huge spike in activity that brings the server to its knees for 4 or 5 minutes, and then resolves itself, and you’d never know.

That kind of behavior would be unacceptable in a product like SQL Diagnostic Manager – both on the outlier side and on the “missing a spike in activity” side. By using the raw counter classes, we are able to instead get information about everything that happened between our two collections, not just the counters which happen to be active at the moment that a refresh happens. This means that DM naturally smooths out brief blips and also naturally captures spikes in activity between collections. This also means that the product almost never matches PerfMon precisely – the methods of collection are really just too different!

Try It For Yourself

To quote Reading Rainbow, “You don’t have to take my word for it!” I’ve put together a little script and an Excel spreadsheet you can use to collect your own raw values and cook them yourself. I encourage you to give it a try, and compare the output you see to what you see in PerfMon. Try decreasing the frequency of PerfMon’s collections, and try spinning up CPU cycles to see if you can cause PerfMon to miss a spike – and then see if you were able to capture it with the raw counters.

Here’s a Powershell script which collects raw counters every 30 seconds for 5 data points:

$i = 1

while ($i -lt 5)
{
"Sample " + $i
Get-WmiObject -Class Win32_PerfRawData_PerfOS_Processor |
where-object { $_.name -eq '_Total'} |
Format-List -Property PercentProcessorTime, PercentPrivilegedTime, PercentUserTime, TimeStamp_Sys100NS
Start-Sleep -Seconds 30
$i = $i + 1
}