Recently, I wrote a blog post about performance counters and why the threshold depends (Performance Counters: Why It Depends). In this post, I talked about why the counter Disk Queue Length is no longer a valid counter to use. Today, I want to elaborate on that and explain why the counter is not relevant today. The counter is not incorrect, per se, but it has a limited view and requires a lot more information than may be available to a DBA or even a system administrator.
Disk queue length was relevant at one time when most SQL Servers were small systems running on local disks. The counters were very straight-forward on those systems. The recommendation that everyone had learned was that disk queue length should be less than 2, but often up to 5 was allowable. Then RAID (redundant array of inexpensive disks or redundant array of independent disks) arrays came into use, and we had to consider the makeup of the RAID system when reviewing this counter. From the Windows perspective, we see a single disk queue for the entire disk as presented to Windows. Suddenly, people were noticing much higher disk queue lengths than before and raising alarms.
We had to adjust our understanding of this counter to take into account that each underlying disk of the array has its own queue and what we are seeing is the cumulative disk queue for all underlying disks. So we had to divide the total disk queue by the number of spindles backing the array. For example, if an array had 10 spindles and the average disk queue length for the drive was 20, tt averages out to a disk queue length of 2 per spindle (20/10). But this begs the question of just how accurate that calculation is.
If one or more of the disks in the array were performing poorly, we could end up with varying disk queues across the array. If we had 8 spindles with a queue length of 1 each and 2 spindles with a queue length of 6, we would see a disk queue length of 20 and count that as an average of 2 per spindle. Everything would appear like it was performing well even though some disks were not.
Then along came SANs (storage area networks). SANs pose an even bigger problem with measuring disk queue length. SANs can have a mixture of RAID types and LUNs (logical unit numbers) can be carved out a subsection of an array. The number of spindles behind a given LUN is even more transparent to a DBA than they were with direct or network attached arrays. SANs generally have their own built-in queues and caching systems as well. With no insight into these caches and queues, it is possible that there could be a lot of queuing on the SAN with no queuing reported at the Windows level. On a SAN, disk queue length can be completely hidden from us and make us think there are no issues when in fact there could be issues completely hidden from us.
There is no absolute number we can tell you that your disk queue counters should be, and without understanding all the details of the underlying disk system, there is no guidance we can give you to determine what those numbers should be.. Besides, there are so many other counters that give us more valuable information about disk activity. Click on the images below to enlarge them.