Monday, July 29, 2013

What means Big Data to you ?

Whether you are an insurance provider, a bank, an advertising company, or any company of medium-to-large size, you need to consume and exploit huge amount of data internally. You would also benefit from analyzing additional, external, data which are available to you, but can be difficult to collect and process in a simple and meaning way.

As an insurance company, you might want to intercept Facebook updates of your clients who publish skiing activities while receiving an insurance for a broken leg.

As an international bank, you might want to monitor the GPS location of mobile transactions to intercept two transactions made on the same personal account at the same minute, one from South Africa, and the second from China.

As an advertising company, you might want to monitor, analyze, and react accordingly to the Tweets on a particular theme, in real-time. Similarly, as a manufacturer (for example a shoes or a car manufacturer), you might want to monitor discussions and complaints on specialized forums to gather data regarding to quality of the products, understand which parts / models breaks more often, in order to improve the quality of your manufactured products (using, for instance, the 6 sigma model).

As a software company, you might want data regarding the real utilization of each function / feature in your products, in order to make educated decision, and pro-actively improve the quality and focus the development accordingly (using, for instance, the Agile/Scrum methodology).

As a political party, you might want to have much better and much more granular information regarding voting intentions and more especially to be able to locate (and convince) undecided voters that are most likely to accept your political opinions (for example, if your party wants to abolish the 2nd amendment, there’s probably no chances to convince an individual who renewed his NRA membership last month, who’s favorite movie is Rambo II, and who just bought yet another rifle to add in his 20 pieces living room collection).

And to some degree, any company would benefit from a better understanding of their clients’ needs, satisfaction level, profiles, … you name it.

The term “Big Data” is just raising, and its meaning may evolve, but it so far covers:

  1. The idea of using vast amount of data, from various sources, such as internal company records up to social media activity.
  2. The idea of analyzing the above described data in real-time
  3. The idea that the real-time analyzes will help to trigger actions (as automatically as possible) to
  • Improve security
  • Detect frauds
  • Better understand the market / the clients
  • Monitor / Adjust ideas and discussions in social networking web-sites
Since almost a decade, millions of people openly share their profile (Facebook), share synthetic information (Twitter) to the world, and publish opinions and concerns on various blogs, forums, and the like. Big Data is a logical extension of this trend, as an exploitation of the data generated by this new mentality, mentality that can be summarized by the openness and visibility of individuals’ profiles and opinions. So … Big Data =  Big Brother ? Yes, in a way, in a Big way. And as any evolution in IT, some companies will make use of these new trends and dynamics, some will stay behind, will you ?

Monday, July 22, 2013

Is the length of your inline LOBs optimal ?

From DB2 v10, you can specify that a portion of a LOB column will be stored in the base TableSpace. This improves the performance of applications that access LOB data, because it reduces the need to access the auxiliary LOB TableSpace. That also enables the creation of expression-based indexes on the inline portion of a LOB column, which can help improving the performance of searches through a LOB column [more info on IBM website].

Inline LOBs means that, for LOBs that are less than or equal to the size of the specified inline length, DB2 will store the entire LOB data in the base TableSpace, so more or less as if it would be a VARCHAR. However, for LOBs that are bigger than the inline length, the inline portion of the LOB resides in the base TableSpace, and the remainder of the LOB in the LOB auxiliary TableSpace. In order to use this feature, assuming you are currently using standard LOBs columns, you can alter existing LOB columns to become inline (using an ALTER TABLE statement).

But, what is the optimal INLINE LENGTH that you should specify ? The question is almost philosophical, since the answer greatly depends on the LOB Lengths Distribution in the column, in other words, it depends on the type of data that are stored in the LOB. And the DBA creating the table does not necessarily know what it contains, nor what it will contain…

Let’s examine 5 types of LOB length distributions that could exist in a LOB column.

1st case : almost all LOBs data are small, limited in size, with a few exceptions.



This case is almost too good to be true, because the answer of the optimal INLINE LENGTH is obvious. You should setup the INLINE LENGTH “threshold” so that it fits the “almost all small data” :


2nd case : the LOBs lengths are equally distributed, from the minimum length to the maximum length




This case is also a no brainer, and you will want to setup a threshold based on the percentage of LOBs row that should be self-contained in the inline LOB. For example, a 80 / 20 distribution as illustrated in the graph below:



3rd case : the LOBS lengths is are linearly distributed (ascending or descending)


Similarly to the 2nd case, you will want to setup a threshold based on the percentage of LOBs row that should be self-contained in the inline LOB. For example, a 80 / 20 distribution as illustrated in the graph below:


4th case : the LOBs length follow a normal distribution (Gauss).



This case might be more common, but essentially (if you forget about the small amount of low length LOBs on the left of the graph), you end up with a situation very similar as the 1st Case, where you will want to set the Inline LOB Length very close to the end of the Gauss curve. You might also use a 80 / 20 percent approach. 


5th case : the LOBs length distribution is sinusoidal


For this case, it is fairly complicated to provide an educated suggestion as to where to set the Inline LOB length “threshold”. In fact this looks like a succession of Gauss curves (4th case), so you could setup the threshold after any of these curves as illustrated below. Furthermore, you might want to analyze which data get accessed more frequently, an thus determine after which curve to place the threshold. Hopefully, this case will not be that common in real environements.
  


 From Theory to Practice …

The above describes distribution models, which is nice in theory… Well, in practice, you will want some sort of tool, to analyze existing length of LOB data, to understand in which case your LOBs are, and adapt the INLINE length accordingly (or confirm the setup is good the way it is). Read below …

The following SQL statement will list the columns in tables which are using INLINE LOB technique in your subsystem:

SELECT TBCREATOR,TBNAME,"NAME",LENGTH,LENGTH2,COLTYPE
  FROM "SYSIBM".SYSCOLUMNS                          
 WHERE ( COLTYPE = 'CLOB'                           
      OR COLTYPE = 'BLOB'                           
      OR COLTYPE = 'DBCLOB' )                       
    AND  LENGTH  > 4 ;                               

“LENGTH” > 4 indicates that the LOB column uses the INLINE LOB technique, the actual Inline LOB Length is LENGTH-4
“LENGTH2” reflects the maximum length of the LOB column (inline + stored in Auxiliary)

As an example, the below test was made on a SYSIBM catalog table SYSIBM.SYSVIEWS that contains a LOB column PARSETREE, a 1GB BLOB (1073741824) with Inline LENGTH 27674-4 = 27670  

Here the sample SQL statement that I used to visualize the length distribution of a LOB column. It simply categorizes the various lengths of the LOBs into LOB length ranges.

SELECT RANGE, COUNT(LENGTH) AS ROW_COUNT FROM (                      
SELECT LENGTH(PARSETREE) AS LENGTH,                                  
 (CEIL(27670/100)*(1+CEIL( 100 * LENGTH(PARSETREE) / 27670))) AS RANGE
FROM SYSIBM.SYSVIEWS) AS TABLE_WITH_RANGE                          
GROUP BY RANGE                                                       
ORDER BY RANGE ;                                                     

When exporting the output of the query into Excel, one can make a simple graph representation of the LOB length distribution and visualize if the INLINE LENGTH value is properly set. The analyzes shows that we are pretty much in the scenario of 1st Case, which I would assume to be the most common one anyway. The Inline LOB Length “threshold” seems adequately placed, for this LOB column.


The nice thing about this technique is that it is re-usable, and that the same query can be ran over time, when the LOB data values evolve, monitoring the LOB length distribution, to make sure the INLINE LENGTH value is still optimal.

Monday, July 15, 2013

Impact of SELECT * versus SELECTing specific columns

A couple of weeks ago, I attended a DB2 for z/OS technical conference and one of the sessions focused on tuning SQL statements and other topics related to DB2 performance. For the records the presentation name was “DB2 10 for z/OS: selected performance topics” and the presenter was Cristian Molaro (Independent DB2 Consultant, IBM Gold Consultant & IBM Champion).

Anyway, I learned that there is a notable performance impact when doing a simple and dirty SELECT * compared to selecting only the specific columns you need. There was no benchmarks however, so I thought it would be a good test to perform, create some simple tests with various tables and measure the impact of this SELECT * bad coding practice. First thing to mention, the performance degradation is not visible if we are dealing with small tables (small number of rows). My tests with a 1000 rows table did not show any difference. But when I performed tests on a more realist size of table (1,000,000 rows), I did notice meaningful differences.

The below graphs represents the results of a performance test (CPU time, and Elapsed time) for a 1,000,000 rows table, residing in a DB2 v10 New Function Mode (NFM) subsystem. This represents a fairly simple table, with 7 columns only (integer, dates, and timestamps), and compared a SELECT * with SELECT COL1, COL2.


The results of this test speak for themselves, there is a 70% overhead of CPU time consumed, and a 17% overhead of Elapsed time spent when using the SELECT * statement, i.e. when retrieving all columns, some of which are probably not needed by “the application”. Other tests performed show similar results, granting that obviously the overhead of SELECT * varies depending on the number of columns specified in the SELECT COL1, COL2, … and the size of the table (number of rows).

Digging deeper, I used a product a few years ago that examines embedded SQL statement in Cobol programs: CA Plan Analyzer (it also works for other programming languages). While writing this blog article, I went back to this tool, to see if it would detect a SELECT * statement. It did. Not to enter into too much details about this tool, there is a rule (sort of trigger) called Expert Rule 0064 that will be triggered if an embedded SELECT * SQL statement is discovered in the application / plan / package analyzed. In addition to the performance impact mentioned detailed above, CA Plan Analyzer also notifies the user with the following recommendation (which makes a lot of sense, and IMHO another good reason why application developers should not use SELECT * type of SQL statements in their application) :
                                                                             
         This should be avoided because of problems that can be encountered 
         when adding and removing columns from the underlying table(s). Your
         application program host variables will not correspond to the      
         added/removed columns.                                             

In a nutshell, if you are a DB2 application developer and you care about your applications performance : do not use SELECT * statements !
If you are a DB2 administrator, you may want to share this post with your application developers, and / or look for products that can detect the use of embedded SELECT * statements running in your DB2 environment.

Monday, July 8, 2013

Improving the response time of Native SQL procedures in DB2 v10

IBM claims better performance “out of the box” when upgrading to DB2 version 10. Some additional manual steps are however recommended to gain additional performance of your DB2 application and/or environment. One of which is about Native SQL Procedures.

DB2 version 9 introduced a new type of stored procedures called Native SQL Procedures. These procedures execute in DBM1 address space, and provide SQL performance improvements due to less cross-memory calls compared to external stored procedures.
Several improvements were done in this area in DB2 version 10. IBM however mentions that the response time improvement (up to 20% improvement) can be achieved only if the existing Native SQL Procedures are dropped and re-created. That is, if you had created Native SQL Procedures under DB2 version 9 and upgraded to DB2 version 10, you might want to DROP/CREATE those.

I created a SQL statement to locate Native SQL Procedures that were created prior the migration to DB2 version NFM. Since I did not know the exact date when the subsystems were upgraded, I figure out that I could use a trick to get this information. As in every new DB2 version, the DB2 catalog contains additional tables that are created during the migration process. I took one of them, SYSIBM.SYSAUTOALERTS, and queried SYSIBM.SYSTABLES to get the CREATEDTS value. That is a timestamp that indicated when a table was created. So the following SQL statement indicates when DB2 was either installed at the version 10 level, or upgraded to version 10 NFM:

-- When was DB2 upgraded to v10 NFM ?
SELECT CREATEDTS               
  FROM SYSIBM.SYSTABLES        
 WHERE NAME    = 'SYSAUTOALERTS'
        AND CREATOR = 'SYSIBM' ;

The game is then to locate Native SQL Procedures created prior that date. Stored procedures are listed in SYSIBM.SYSROUTINES, and the column ORIGIN = 'N' indicates that we deals with a Native SQL Procedure. The following statement will thus provide a list of the Native SQL Procedures created prior DB2 was upgraded to version 10 NFM :

-- List of Native SQL Procedures to re-create
SELECT CREATEDBY,OWNER,NAME,ORIGIN,CREATEDTS        
  FROM SYSIBM.SYSROUTINES                           
 WHERE ORIGIN = 'N'                                 
   AND CREATEDTS < ( SELECT CREATEDTS               
                       FROM SYSIBM.SYSTABLES        
                      WHERE NAME    = 'SYSAUTOALERTS'
                        AND CREATOR = 'SYSIBM' );    

With that, I got the list of Native SQL Procedures that I want to DROP / RECREATE. Hopefully, you have the DDL for these objects stored in dataset, but nothing is less obvious. If not, you can use tools to generate the DDL statements from the information in the catalog, in this example I used CA RC/Query for DB2 for z/OS to locate a particular Native SQL Procedure and generate its DDL:


This command produces the following output :


All what’s needed is to update the SQL, add the DROP syntax and a couple of COMMITs, and the job is done!

I hope that this post is useful to you. If you used the technique described above, I will be glad to read your comments, especially if you noticed and quantified performance improvements when dropping and re-creating your Native SQL Procedures!

Monday, July 1, 2013

DB2 v11 – expected timeframe & features

Although IBM does not provide an exact schedule for their new versions of DB2, one can guess that DB2 v11 will be available pretty soon, probably in Fall 2013. Indeed, looking at a the release schedule of DB2 over the last 15 years, you will notice that IBM releases a new version of DB2 consistently every 3 years, and that the support of each version lasted 8 years – with the exception of DB2 version 8, which had an additional year of support. Worth mentioning as well, it’s no secret that the IBM development labs transitioned several years ago to an Agile development methodology. This method which tends to fix schedule (date of delivery) over amount of features delivered, and tends to make new version delivery cycles very predictable.

I represented this graphically, with my best guess of dates for DB2 v11 :

 

Only members of the ESP program (Early Support Program) of IBM have the chance to see what new features are really in DB2 v11 (with ~monthly drops of new features (Agile Rally)), but some clues are given by IBM on their official marketing ESP page. In addition, IBM representatives gave more specific details during several DB2 Regional User Groups since a couple of months. Let’s summarize the information here:

IBM DB2 marketing materials:

DB2® 11 for z/OS® expands the value delivered to your business by IBM's industry-leading mainframe data server in the following ways.
  • Help save money, save time, and reduce costs
  • Provide unmatched availability, reliability, and security for business critical information
  • Provide enhanced analytics for business growth
  • Offer simpler, faster migration for faster ROI
Selected features that deliver these valuable benefits to your business include:
·         CPU reductions and performance improvements for certain online transaction processing (OLTP), heavy insert, select query workloads, and when running queries against compressed tables
·         Improved-data sharing performance and efficiency
·         Improved utility performance and additional zIIP eligible workload
·         Cost-effective archiving of warm and cold data with easy access to both within a single query
·         Intelligent statistics gathering and advanced optimization technology for efficient query execution in dynamic workloads
·         Additional online schema changes that simplify management, reduce the need for planned outages, and minimize the need for REORG
·         Productivity improvements for DBAs, application developers, and system administrators
·         Efficient real-time scoring within your existing transaction environment
·         Enhanced analysis, forecasting, reporting, and presentation capabilities, as well as improved storage management, in QMF™
·         Expanded SQL, SQL PL, temporal, and XML function for better application performance
·         Faster migration with application protection from incompatible SQL and XML changes and simpler catalog migration

Additional information from User Groups’ events:

As seen at the SQLAdria event in Croatia in June 2013, IBM Curt Cotner presented “Future Directions for DB2 for z/OS” which highlights the following “DB2 11 Major Themes”:

Performance Improvements
– Improving efficiency, reducing costs, no application changes
– 0-5% for OLTP, 5-15% for update intensive batch
– 5-20% for query workloads
– Less overhead for data de-compression
– Exploitation of new zEC12 hardware features
Continuous Availability Features
– Improved autonomics which reduces costs and improves availability
– Making online changes without affecting applications
– Online REORG improvements, less disruption
– DROP COLUMN, online change of partition limit keys
– Extended log record addressing capacity (1 yottabyte)
– BIND/REBIND, DDL break into persistent threads
Enhanced business analytics
– Faster, more efficient performance for query workloads
– Temporal and SQLPL enhancements
– Transparent archiving
– SQL improvements and IDAA enhancements
Simpler, faster DB2 version upgrades
– No application changes required for DB2 upgrade
– Access path stability improvements
– Product stability: support pre GA customer production

Some of these features look very interesting such as the Extended RBA and LRSN addressing and the transparent archiving via IDAA.

A word of caution: The following statements regarding IBM plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion.