Knowing Your Storage Basics – 2 Life Lessons

by Jan 5, 2015

In the software world, it is often the case where everyone wants to learn and master everything. The problem with this approach is that we land into a zone of being master of nothing by the end of the day. The best way to learn, is to learn one thing at a time. The other way to look at this is to learn one thing completely before moving onto the next one. It is critical to learn from our mistakes and the errors that one experiences from time to time. There are no shortcuts to these life lessons that one can experience. 

There are different personas of people in the industry which every organization have – Developer, DBA, System Administrator, Network Administrator, Architects and many more. Each of them play a significant role in the database’s architecture, development, deployment, integration etc. So in this blog post, I would like to take a step back in understanding some of the storage basics that I learned from SQL Server in the past couple of years. 

I/O requests taking longer than 15 seconds

View SQL Server Performance webinar, on-demandThese are nice and useful messages that get printed to the SQL Server Error Logs. Though these errors are evidence of a stalling SQL IO, sometimes these can also be false alarms to the system.

2014-11-11 16:30:02.140 spid6s       SQL Server has encountered 2 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [F:\…

2014-11-11 16:32:08.780 spid6s       SQL Server has encountered 2 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [H:\…

Be careful before jumping into conclusion and revalidate the IO behavior by looking at the SQL Server DMV sys.dm_io_pending_io_requests and Windows Performance counters. You can also use the Traceflog 830 to disable stalled, stuck IO operations during the startup. The interesting thing about this error is, if the error is once reported for a single request – it is not raised again for the same request. Hence if we see 20 such errors reported on our SQL Server Error Logs, it means 20 different I/O requests have been stalled in that duration of time. Also the error indicates the file which is under contention and it becomes easy for us to troubleshoot later. 

If this is the case, what could be the possible causes? Here are some of the things to look out for:

  • Faulty hardware

  • Hardware that is not configured correctly

  • Firmware settings

  • Filter drivers

  • Compression

  • SQL Server is saturating disks

Troubleshooting Techniques

DBArtisan Free TrialThere is no set procedure for solving IO problems. If I can narrow it down to optimizations that have been executed as a result of query tuning point, we can guess that the problem stems from external factors. Some areas to investigate are:

  • Exclude SQL Server .mdf and .ldf files from antivirus scan.

  • Do not place SQL Server DATAFILES on compressed drives.

  • Distribute the database mdf and ldf files on separate drives.

  • Follow best practices for TEMPDB esp. placing and no of TEMPDB files.

These are not exhaustive, but a representative start. Some of the other ways I get into a bit of troubleshooting for a start would be:

  • Compare the PERFMON counters "Physical Disk: Disk Bytes/sec" with "Process: IO Data Bytes/Sec" – SQLServr.exe instance. This way we can identify how much IO (Bytes/Sec) is being generated by SQL Server Process.

  • In Perfmon, "PhysicalDisk: % Disk Time" counter monitors the percentage of time that the disk is busy with read/write activity. If the "PhysicalDisk: % Disk Time" counter is high (more than 90 percent), check the "PhysicalDisk: Avg. Disk Queue Length" counter to see how many system requests are waiting for disk access.

  • In case SQL Server is contributing majority of I/O, "Process: IO Data Bytes/Sec" (SQLServr.exe) is close to "Physical disk: Bytes/sec" then compare this number with the disk sub system throughput. For example, If SQL Server is contributing 50MB/sec while the SAN throughput is 200MB/sec then SQL Server is not the cause.

  • If SQL Server is the cause then identify the TOP Read and/or Write intensive queries from trace and tune them accordingly.

These are some of the things that come to my mind which I do as daily learning which I had gained in various consulting and troubleshooting. If the problem is outside of SQL Server, I generally ask for help from System Administrators, SAN Administrators / Vendors for more insights before concluding the same.

Do let me know if these life lessors are useful and you would like to learn more of these in the future.

Additional Reading: SQL Server I/O Basics, Chapter 2, SQL Server 2000 I/O Basics.

Learn more about DBArtisan, the premier cross-platform database administration tool, and try DBArtisan for free.