Do you have a topic that you'd like discussed? We'd love to hear from you. Send us your idea for a blog topic.

 |
 |
|
|
 |
 |
Location: Blogs
Richard To's Blog
|
|
| RichardTo |
Monday, October 30, 2006 2:28 PM |
For years, commercial database manufacturers have fought an endless battle to improve the performance of inserting, updating, deleting, and retrieving information stored in the database. Despite their continual efforts and hard work, we have not seen a significant improvement in the performance of most Relational Database Management Systems (RDBMS). Users still suffer from under-performing SQL statements and database experts are still spending countless hours tuning SQL statements.
Major deficiencies cause the database's internal SQL optimizer to fail to find the best, or at the least a good execution plan for a SQL statement. Two of these deficiencies are, having inaccurate statistics and having inaccurate cost estimation of the resources that may be used for the SQL statement.
Statistics and Index Statistics may not be Up-to-date
Two common methods are used to collect database statistics: batch mode statistics collection and a near real time data sampling technique. The batch mode statistics collection method processes a large set of data and uses system resources and takes time to collect the data statistics. Consequently, it is normally executed during non-peak hours when the systems administrator determines statistics need to be updated. If a lot of INSERT or DELETE operations have occurred since the last collection of the statistics, then the statistics will not reflect the most up-to-date information when a SQL statement is executed. This may cause the database SQL optimizer to generate a poor performing execution plan for the current data distribution.
Fortunately, in the last few years the database vendors have improved the collection of statistics with a new real time data sampling technique. With this technique, small amounts of data are sampled for the statistics collection. If the data is equally distributed, the sampling data can fully reflect the true data distribution. But if the data is not equally distributed, the batch mode statistics collection method should still be used to generate accurate statistics for the database.
Cost estimation may not be accurate
The second deficiency that can cause the database SQL optimizer to select a poor execution plan is to have an inaccurate cost estimation. It is obvious that incorrect statistics will result in inaccurate cost estimation for a SQL statement. Even if the database statistics are accurate, the database SQL optimizer cannot correctly determine the best execution plan because it does not have all the pertinent information. Let's examine the following SQL (see figures 1 and 2):
SELECT *
FROM A, B, C
WHERE A.key1 = B.key1
AND B.key2 = C.key2
AND C.f2 = 0
AND A.f1 = 0

Figure 1: This example is using the Nested Loop join operation. The driving path starts from table C to table B which matches 4 rows from table B. Then those 4 rows are matched to the rows in table A. The rows highlighted in gray in table B show the two records that caused extra scan operations using this driving path.

Figure 2: This example is also using the Nested Loop join operation. The driving path starts from table A to table B which matches 2 rows. Then those 2 rows are used to match to the rows in table C. So with this driving path, no unnecessary table scans are performed. Though starting with a condition of lower selectivity, the overall scan operations are more optimistic.
In this SQL statement, three tables are joined. We assume that database SQL optimizer will use a Nested Loop Join only when all of the statistics are available for the tables and indexes used in this SQL statement. In this example, we find from reviewing the data that driving path from table C→B→A takes more time than that of the driving path A→B→C. But most of database SQL optimizers will pick up the driving path C→B→A. This happens because C.f2 has a unique index and the database SQL optimizer does not know which values will be returned from C.key2. As you can see from this example, the data itself is determining the number of records retrieved and the cost to process the execution plan. The database SQL optimizer does not have this information so it cannot choose the best execution plan.
Database Internal SQL Rewrite
Today, the internal SQL optimizer in most databases has a limited built-in SQL Rewrite ability (see figure 3). The internal rewrite is used to correct some obvious programming mistakes or to rewrite the input SQL to an internal SQL syntax. This internal SQL syntax will make the later stage of the optimization easier such as transforming an IN or EXISTS sub-query to a JOIN statement or making use of materialized views. Some databases have weak internal SQL rewrite ability, which leaves a lot of room for database tuning experts to influence the database SQL optimizer so that it makes a better choice in the later stage of the execution plan generation. Of all the databases, Oracle has the most open architecture to accept the user’s influence with use of the optimization hints. Oracle is also the most syntax sensitive database which allows users to tune their SQL statements by restructuring the SQL syntax.
In contrast, a database like IBM DB2 UDB has a very strong internal SQL rewrite ability. In IBM DB2 UDB, most SQL statements can be transformed into their internal SQL syntax before they are further optimized. This appears to be good news that would make it unnecessary for programmers to tune their SQL statements since the database SQL optimizer would be doing the job for them. But the problem with this approach is that when the database SQL optimizer makes a mistake on a specific SQL statement, it is hard for the programmer to influence the database SQL optimizer to pick up other choices. This illustrates the dilemma that database engineers always face; do they make the database smart so that it makes intelligent choices for the user or do they give the user the control since the database SQL optimizer will not always make an intelligent choice?

Figure 3: Database SQL Optimizer processing diagram
Changing SQL syntax can Improve SQL Performance
Due to limitations in the database's internal SQL rewrite capability; it cannot transform a SQL statement into very many semantically equivalent SQL syntaxes. Therefore, it is not possible for it to generate every possible way to rewrite a SQL statement. For example, assume that 5 SQL statements from SQL 1 to SQL 5 are semantically equivalent, but syntactically different. The database SQL optimizer may generate a different set of execution plans accordingly (see figure 4). For each set of execution plans (SQL 1 has Plan A, B, C), the database SQL optimizer will carry out a cost estimation and will execute the execution plan with the lowest cost. If the database SQL optimizer fails to select a good execution plan due to an inaccurate cost estimation or limited by the number of execution plans generated, the corresponding SQL statement’s performance will be degraded.
In order to rectify this situation, a programmer can rewrite the original SQL multiple times. For each rewritten SQL, the database SQL optimizer creates a set of execution plans. Using the new set of execution plans, the database SQL optimizer may select to execute an execution plan that was not found in the execution plan set of the original SQL. If one of the execution plans from those rewritten SQL has better performance, the programmer can use the new syntax to replace the original SQL in the source to improve the SQL performance without changing the results. This is what we call SQL tuning without creating new indexes or database configuration changes.
SQL tuning experts may find that some rules of SQL syntax restructuring improve SQL statements in certain environment. For example, a Nested Loop join accessing a small table first and using an index search on a large table is normally better than joining the data from the large table to small table. But, for complex SQL statements with many table joins, a general rule like that may be hard to apply. This is especially true in situations with many join operations where sorting and filtering methods are combined into one long execution plan.

Figure 4: Relationships with SQL Syntax changes and the execution plans generation
Let me stop here, next month I will tell you “Why Join Path Matters?”
Richard To |
| Copyright ©2006 SQLExpert |
| Permalink |
Trackback |
|
 |
 |
|
 |
|
 |
|
|