Example of Obtaining Index Use Statistics

This section assumes you have run the steps at Examples of Debugging a Slow Query.

Using the PLAYERS table, we can look at index statistics.

Caution: Don't forget Analyze! Remember, run: ANALYZE TABLE PLAYERS to be sure index statistics are generated.

# Don't forget to analyze your table
ANALYZE TABLE players;
 
# Note the DISTINCTCOUNT for our IDX_PLAYERS_FIRSTNAME index - not too good...
SELECT * FROM system.indexstatistics WHERE tablename = 'PLAYERS';
 OBJECTID  CATALOGID  SCHEMA  TABLENAME        INDEXNAME       KEYCOUNT  
 --------- ---------- ------- ---------- --------------------- --------- 
     1        3567    HOCKEY   PLAYERS   PLAYERS..PRIMARY_KEY     7520   
    23        3567    HOCKEY   PLAYERS   IDX_PLAYERS_FIRSTNAME    7520 
 
SELECT * FROM system.indexprefixstatistics WHERE objectid = 23 and catalogid = 3567;
  OBJECTID  CATALOGID  PREFIXLENGTH  DISTINCTCOUNT  AVERAGEKEYLENGTH    SOURCE   
 --------- ---------- ------------- -------------- ----------------- ----------
     23        3567          1             845           5.32        statistics 

SELECT * FROM system.indexhistograms where objectid = 23 and catalogid = 3567;
 OBJECTID  CATALOGID  HISTOGRAMID  FIELDCOUNT  MAXRESOLUTION  
 --------- ---------- ------------ ----------- -------------- 
    23        3567         1            1            10       
 
# The selectivity of our Histogram buckets leaves a lot to be desired...
SELECT * FROM system.indexhistogrambuckets WHERE objectid = 23 AND catalogid = 3567;
 OBJECTID  CATALOGID  HISTOGRAMID  BUCKETINDEX     BOUNDARY     
 --------- ---------- ------------ ------------ --------------- 
    23        3567         1             0      [0] 
    23        3567         1             1      [Dan]         
    23        3567         1             2      [Jack]          
    23        3567         1             3      [Matt]          
    23        3567         1             4      [Mike]          
    23        3567         1             5      [Mike]          
    23        3567         1             6      [Mike]          
    23        3567         1             7      [Mike]          
    23        3567         1             8      [Mike]          
    23        3567         1             9      [Ray]         
    23        3567         1            10      [Ziggy]     

Now we see that there are eleven buckets in the histogram for the index IDX_PLAYERS_FIRSTNAME. The 0th bucket contains one element and that is the lowest key in the index, in this case a [0] zero length key. We can easily see that we do not have good selectivity on this index and that most buckets are filled with keys equal to "Mike".

In most cases, using the system default value of 10 for the number of histogram buckets that an index will use to store statistics will allow for cost estimates that are reasonable enough for the optimizer to choose the most efficient path to retrieve the result set of a query. However table size increases and data value distributions become highly skewed, having only 10 histogram buckets can hide these spikes in the distribution of data values. This can result in less than optimal cost estimates for index statistics. Therefore, the number of index histogram buckets can be increased at the time of index creation.

For example, using the above example, how would these cost estimates change if we increase the number of histogram buckets for this index.

# Create the index with an increased number of histogram buckets
DROP INDEX idx_players_firstname;
CREATE INDEX idx_players_firstname ON players(firstname) WITH (RESOLUTION 300);
 
#Analyze the new index to update statistics
ANALYZE INDEX idx_players_firstname;
 
SELECT * FROM system.indexhistogrambuckets WHERE objectid = 24 AND catalogid = 3567;
 OBJECTID  CATALOGID  HISTOGRAMID  BUCKETINDEX     BOUNDARY     
 --------- ---------- ------------ ------------ --------------- 
    24        3567         1             0      [0] 
    24        3567         1             1      [Al]            
    24        3567         1             2      [Aleksey]       
    24        3567         1             3      [Alexander]     
    24        3567         1             4      [Alexsandr]     
    24        3567         1             5      [Andre]         
...
    24        3567         1            93      [Mikael]        
    24        3567         1            94      [Mike]          
    24        3567         1            95      [Mike]          
...
    24        3567         1           256      [Mike]          
    24        3567         1           257      [Mikhail]       
    24        3567         1           258      [Morris]        
...
    24        3567         1           299      [Wayne]         
    24        3567         1           300      [Ziggy]         
EXPLAIN (opt_estimates on) SELECT firstname, lastname, playerID FROM players WHERE firstname='Mike';
 Select
  List
    Field HOCKEY.PLAYERS.FIRSTNAME (1)
    Field HOCKEY.PLAYERS.LASTNAME (1)
    Field HOCKEY.PLAYERS.PLAYERID (1)
  Boolean sieve
    Eql
      Field HOCKEY.PLAYERS.FIRSTNAME (1)
      "Mike" (varchar) 
    Inversion HOCKEY.PLAYERS  (1)
      Bitmap index IDX_PLAYERS_FIRSTNAME [cost: 12182.4, selectivity: 54.000%, rows: 4060]
        "Mike" (varchar) 

The cost estimates now being returned are more in line with actual data values. Distributing the data across an increased number of histogram buckets decreases the number of rows in each bucket which better relects the spikes in distribution of data.

Previously, the optimizer cost estimates for retrieving rows with a FIRSTNAME equal to "Mike" were:

Bitmap index FIRST_IDX [cost: 9024.0, selectivity: 40.000%, rows: 3008]

The new cost estimates after increasing the number of histogram buckets for the index are now calculated as:

Bitmap index FIRST_IDX [cost: 12182.4, selectivity: 54.000%, rows: 4060]

Both the selectivity and the number of rows are much closer to the actual values. There are 4093 rows for "Mike" out of 7520 total rows, or 54% of the total rows. However, there is a cost for both collecting and storing these statistics. This cost must outweigh the benefits gained in query performance.