Monday, July 22, 2013

Is the length of your inline LOBs optimal ?

From DB2 v10, you can specify that a portion of a LOB column will be stored in the base TableSpace. This improves the performance of applications that access LOB data, because it reduces the need to access the auxiliary LOB TableSpace. That also enables the creation of expression-based indexes on the inline portion of a LOB column, which can help improving the performance of searches through a LOB column [more info on IBM website].

Inline LOBs means that, for LOBs that are less than or equal to the size of the specified inline length, DB2 will store the entire LOB data in the base TableSpace, so more or less as if it would be a VARCHAR. However, for LOBs that are bigger than the inline length, the inline portion of the LOB resides in the base TableSpace, and the remainder of the LOB in the LOB auxiliary TableSpace. In order to use this feature, assuming you are currently using standard LOBs columns, you can alter existing LOB columns to become inline (using an ALTER TABLE statement).

But, what is the optimal INLINE LENGTH that you should specify ? The question is almost philosophical, since the answer greatly depends on the LOB Lengths Distribution in the column, in other words, it depends on the type of data that are stored in the LOB. And the DBA creating the table does not necessarily know what it contains, nor what it will contain…

Let’s examine 5 types of LOB length distributions that could exist in a LOB column.

1st case : almost all LOBs data are small, limited in size, with a few exceptions.



This case is almost too good to be true, because the answer of the optimal INLINE LENGTH is obvious. You should setup the INLINE LENGTH “threshold” so that it fits the “almost all small data” :


2nd case : the LOBs lengths are equally distributed, from the minimum length to the maximum length




This case is also a no brainer, and you will want to setup a threshold based on the percentage of LOBs row that should be self-contained in the inline LOB. For example, a 80 / 20 distribution as illustrated in the graph below:



3rd case : the LOBS lengths is are linearly distributed (ascending or descending)


Similarly to the 2nd case, you will want to setup a threshold based on the percentage of LOBs row that should be self-contained in the inline LOB. For example, a 80 / 20 distribution as illustrated in the graph below:


4th case : the LOBs length follow a normal distribution (Gauss).



This case might be more common, but essentially (if you forget about the small amount of low length LOBs on the left of the graph), you end up with a situation very similar as the 1st Case, where you will want to set the Inline LOB Length very close to the end of the Gauss curve. You might also use a 80 / 20 percent approach. 


5th case : the LOBs length distribution is sinusoidal


For this case, it is fairly complicated to provide an educated suggestion as to where to set the Inline LOB length “threshold”. In fact this looks like a succession of Gauss curves (4th case), so you could setup the threshold after any of these curves as illustrated below. Furthermore, you might want to analyze which data get accessed more frequently, an thus determine after which curve to place the threshold. Hopefully, this case will not be that common in real environements.
  


 From Theory to Practice …

The above describes distribution models, which is nice in theory… Well, in practice, you will want some sort of tool, to analyze existing length of LOB data, to understand in which case your LOBs are, and adapt the INLINE length accordingly (or confirm the setup is good the way it is). Read below …

The following SQL statement will list the columns in tables which are using INLINE LOB technique in your subsystem:

SELECT TBCREATOR,TBNAME,"NAME",LENGTH,LENGTH2,COLTYPE
  FROM "SYSIBM".SYSCOLUMNS                          
 WHERE ( COLTYPE = 'CLOB'                           
      OR COLTYPE = 'BLOB'                           
      OR COLTYPE = 'DBCLOB' )                       
    AND  LENGTH  > 4 ;                               

“LENGTH” > 4 indicates that the LOB column uses the INLINE LOB technique, the actual Inline LOB Length is LENGTH-4
“LENGTH2” reflects the maximum length of the LOB column (inline + stored in Auxiliary)

As an example, the below test was made on a SYSIBM catalog table SYSIBM.SYSVIEWS that contains a LOB column PARSETREE, a 1GB BLOB (1073741824) with Inline LENGTH 27674-4 = 27670  

Here the sample SQL statement that I used to visualize the length distribution of a LOB column. It simply categorizes the various lengths of the LOBs into LOB length ranges.

SELECT RANGE, COUNT(LENGTH) AS ROW_COUNT FROM (                      
SELECT LENGTH(PARSETREE) AS LENGTH,                                  
 (CEIL(27670/100)*(1+CEIL( 100 * LENGTH(PARSETREE) / 27670))) AS RANGE
FROM SYSIBM.SYSVIEWS) AS TABLE_WITH_RANGE                          
GROUP BY RANGE                                                       
ORDER BY RANGE ;                                                     

When exporting the output of the query into Excel, one can make a simple graph representation of the LOB length distribution and visualize if the INLINE LENGTH value is properly set. The analyzes shows that we are pretty much in the scenario of 1st Case, which I would assume to be the most common one anyway. The Inline LOB Length “threshold” seems adequately placed, for this LOB column.


The nice thing about this technique is that it is re-usable, and that the same query can be ran over time, when the LOB data values evolve, monitoring the LOB length distribution, to make sure the INLINE LENGTH value is still optimal.

No comments:

Post a Comment