Internals of Statistics That You Were Not Aware Of

by Nov 25, 2014

As a developer statistics is one of the important things we need to know while doing SQL Server query performance tuning exercise. Let us try to make a correlation with real world example to make sure these concepts stay in our memory. Imagine if you need to travel from your home to a friend’s place. If there are multiple routes to reach to destination, which one you would take? Obvious answer would be to take a road which takes less time. Let us go one step further; how would you know the route which would take less time? Well, it would depend on traffic condition, vehicle used to commute, weather condition and may be few more. Having latest information about all deciding factors would help us in taking correct decision end of the day. 

Now, imagine yourself as SQL Server’s query optimizer. Your task is to execute the query with the lowest possible cost. If the query given to you is a select statement with one predicate, you have minimum choices out of few possible plans. If query becomes complex, the plan choices would increase exponentially but you need to make quick decision as well. Optimizer cannot spend time in the optimization phase too long as it needs to execute and finish the query too. 

As query optimizer, what is the information you would need to make correct plan choice? Number of rows in the table, data distribution of the column to decide seek or scan, number of CPU, memory available and many more. Will optimizer look at the actual data during optimizing phase? Of course it should not. This is where statistics comes into picture. Statistics is the collection of information about data stored in our table. During optimization phase, the optimizer would look at statistics to find how the data looks likes and based on that plan choice would be made.  

Now correlating with our initial example – what if the information provided is not up-to-date or closer to up-to-date? The way you might choose a wrong route to go to friend’s place, optimizer can also choose wrong plan. Let’s look at the below example to see statistics in action. 

First, let us see how statistics are created: 

set nocount on 

go 

Create database StatisticsDemo 

go 

Use StatisticsDemo 

go 

Create Table MyTable (i int, j char(4000), k char(4000 

go 

declare @i int = 65 

— Insert A..Z in i and a..z in k 

while (@i<=90) 

begin 

insert into MyTable values (@i64, CHAR(@i),CHAR(@i+32 

set @i = @i + 1 

end 

go 

Create index Idx_MyTable_i on MyTable (i 

go 

Create Statistics Stats_MyTable_j on Mytable(j 

go 

Select * from MyTable where k = 'm' 

go 

 

As soon an index is created, we can check in management studio and look at the statistics node. This is what we would see. 

 

 

In the script, we have created an index which has created “Index Stats”, created a manual statistics using create statistics command and fired a query on a different column which caused “auto stats”. It should be noted that auto statistics are created when database property of “create auto stats” is set to ON. 

 

If you are don’t like using the UI, then the DMV sys.stats catalog view can be used to get same information: 


Select
* from sys.stats 

 

Next obvious question is to understand what is stored as part of stats? We can look at the statistics object in two ways.

 

1. We can go to Statistics node in your SSMS (as shown in earlier image), right click on the statistics > Properties, and click on Details tab. 

 

 

2. We can also run the DBCC SHOW_STATISTICS command to see the same details using a query:

 

 

In the above images we can look at Rows Sampled and Rows in the table to identify whether the histogram has been created with all the rows or using sampled rows.  

As we can see above, histogram actually represents the data distribution of the rows in the table. This is the information used by optimizer to choose the query plan out of various combinations. 

Whenever you encounter performance issue, there are few things to look for: 

  1. Compare the values in Rows and Rows Samples. Make sure they are closer. If they are not, update the statistics manually. 

  2. Below query can be used to look at the number of rows and modified rows in the table. Table name needs to be modified. 

SELECT object_name(id) AS 'table_name',  

       name            AS 'index_name',  

       rowcnt          AS 'Row_count',  

       rowmodctr       AS 'Row_Mod_Counter'  

FROM   sys.sysindexes  

WHERE  id = Object_id('MyTable')  

       AND indid > 0  

 

If row mod counter is high, then again update the statistics manually. 

To update the statistics manually there are multiple ways.  

  1. Rebuild the index which would in turn update index statistics.

  2. Run UPDATE STATISTICS command (update statistics <table name> with fullscan) 

Generally “auto create” and “auto update stats” settings are ON by default unless recommended by the application vendor don’t touch these settings. Hope this would help you in understanding the basics of statistics.

Webinar: SQL Server Performance Architecture for BeginnersNext Steps

You may also be interested to watch a webinar by Rick Morelan, SQL Server Performance Architecture for Beginners, available on-demand.

Learn more about Embarcadero DB Optimizer , which provides automated SQL profiling and tuning for optimized performance, and try DB Optimizer for free.