Mar
13
Written by:
Richard To
Friday, March 13, 2009 4:11 AM
This blog is the first in a series about misconceptions surrounding SQL tuning that are quite common. The first one covers the misconception that you can use the estimated cost from the database SQL optimizer to accurately judge the performance of a SQL statement in comparison to its rewrites.
SQL tuning is a very interesting topic and most DBA or developers have at least some experience with tuning SQL. They learn some tuning concepts from different sources such as the internet, tuning books, or their colleagues. There is no single scientific tuning guideline or method in the field. This gives us the impression that SQL tuning is more of an art than scientific knowledge. I have been in this field for almost 20 years and have experience doing SQL tuning on at least 4 database platforms. Basically, I have found that each database internal optimizer’s behavior is different from the others. For example, IBM LUW(Unix) has a very strong internal rewrite that can transform your SQL to a relative standardized internal syntax in which you don’t have any control over the ultimate query plan’s generation. In contrast, the Oracle SQL optimizer is the most syntax sensitive SQL optimizer in the market which allows you to influence this optimizer to select a better plan. Microsoft SQL Server and Sybase came from the same architecture in the beginning. Consequently, their SQL optimizers behave similarly, but now they are getting more diverse from version to version.
I cannot say which database platform’s approach to SQL optimization is better, but what I can tell you is that Oracle is the most aggressive and open database SQL optimizer among all databases. I believe the architects of Oracle are the first few people who admit the limitation of their SQL optimization technology. They do not pretend to provide you with a mighty database internal SQL optimizer Instead they provide you Hints, SQL Outlines, SQL Profiles and even the latest SQL Plan Baselines features to help you to rectify their SQL optimization problems. Actually, this trend is getting popular. Now, every database platform provides at least some features for you to influence or freeze a SQL query plan.
It is no doubt that SQL optimization is still one of the challenges of today’s database research. We are still suffering from bad performance SQL from day to day. Since there are many common misconceptions floating around SQL optimization, I would like to make some clarifications in this corner and hope it will be helpful for you while you are tuning your SQL statements.
First Misconception:
The estimated cost is a good way to judge the performance of a SQL and its rewrites
It is quite a common misconception that people should use the cost to judge a SQL statement’s performance. Statistically, if we are talking about hundreds of SQL statements, this judgment is correct. For example, if we have a pool of two hundred SQL statements that is running on the same database and if we select the set of one hundred SQL statements with highest estimated cost from this pool and compare the execution time of the set with the other set of one hundred SQL statements with lowest cost from the same pool, the result is probably as expected, the set with the lowest cost will be the better performing set. (This explanation is assuming that there is no one SQL statement that dominates the total execution time.) But if you randomly select two SQL statements to compare their cost and execution time, unless the cost is significantly different up to multiple times, it will be hard to tell which SQL will run faster just by the cost estimation. The reason for this is that the cost estimation algorithm of the database SQL optimizer is not as accurate as what we expect it to be. There are many reasons for why it is not accurate, especially for complex SQL statements. The most common errors found in cost estimations are filtering and join cardinality estimation.
Here is an example to demonstrate the problems a database optimizer will encounter.
select emp_name
from employee, manager
where emp_mgr_id=mgr_id
and mgr_name like “% Peter %”
This SQL extracts all employees’ names whose reporting manager’s name is like “% Peter %”. Since there is no information for the database SQL optimizer to estimate how many managers will have a name like “% Peter %” before the execution, the SQL optimizer will fail to make an accurate estimation for this filtering step in the query plan which is at the first stage of the plan. It will create an arbitrary number at this stage. Another problem is the join cardinality estimation. In a very ideal situation where the emp_mgr_id is equally distributed in the employee table, the SQL optimizer can use the emp_mgr_id histogram to roughly estimate the matched records number, but the accuracy will depend on the arbitrary number created in the first stage. In contrast, in a real life situation, the emp_mgr_id distribution may be skewed in the employee table, so, it will only be by luck that the SQL optimizer can provides an accurate cost estimation of the join cardinality.
Let me give you another more mathematical example with the following:
select *
from A,B
where A.f1 = B.key
In this example, the database SQL optimizer has no problem in estimating an accurate cardinality of the join result, since both tables are retrieved without filter criteria. Theoretically, the database optimizer shouldn’t have any problem to do a 100% accurate cost estimation, but a correct cardinality estimation doesn’t mean that you will have a 100% accurate cost estimation. Why?
To simplify the discussion, let me assume the SQL optimizer can generate only two query plans for this SQL. The first query plan is a Nested Loop Join that uses table A to index search of table B using B.key. The other plan is a Sort Merge Join, in which table A and table B are sorted and then merged together. Let’s further simplify the discussion by assuming both tables have the same structure and the same number of records N.
The cost of the Nested Loop is ~ N*log2N/2 ~ O(NlogN),
where log2N/2 is the average depth of the a balance B-tree index and assuming that two nodes for each parent node.
The cost of Sort Merge will be in the order of O(NlogN).
So, you will find that both join methods have similar orders of speed. The difference will be the scale and cost of the individual operation. Let’s further simplify the situation and assume that the scale and the operation cost are the equal for both join methods, and see whether we can distinguish which join method will use less resource or run faster. The answer is still “not 100% sure!” Why?
You may be aware that the sort algorithm speed depends on the input data’s natural order. An ordered data will be a benefit to a Merge sort, but it will be a disadvantage to a Quick sort. It is quite common that the data in a database has at least some sort of order that is created by the users’ operation behavior. So, even though I made a lot of assumptions and tried to make everything isolated, I still cannot 100% accurately estimate the cost of a Nested Loop join and a Sort Merge join for this simple SQL statement, of course, I am talking about 100% accurate estimation here. But for the commercial database, we don’t need that rigid of a requirement, maybe 70% accuracy will be good enough, but please remember that this accuracy rate is not a constant; it will get lower in proportion as your SQL complexity gets higher.
Conclusion
The above mentioned examples are only part of the reasons that the database optimizer cannot carry out accurate cost estimation for your SQL statements. There are many other reasons that have not been discussed here. So, next time when you tune SQL statements, you will get the best result if you test run your rewrites or create physical indexes instead of relying on the virtual indexes cost estimation to judge which SQL rewrite or new indexes are better. I put this as the first discussion topic, since it is a very common mistake, and even some senior members in our team have this misconception sometimes!