New Oracle Bug Alert!

Oracle Community

New Oracle Bug Alert!

Follow / 8.13.2014 at 8:37pm

While regression testing for an Oracle 11.2.0.1 to 11.2.0.4 upgrade at my current client, a bug was found that I thought would be worth sharing. You can download  and then import a test data set if you'd like to recreate the bug behavior yourself. The bug is 19384287 : WRONG NUMBER OF ROWS RETURNED ON FIRST EXECUTION. This bug is confirmed as being reproducible on 11.2.0.4.0 through 12.1.0.2.

The problem is that when certain query formulations are used and the buffer cache does not contain any of the data that would satisfy the query request (i.e. this can be recreated by flushing the buffer cache), only partial results are returned for anywhere from 1-n (n has gone as high as 4 in our test cases) executions of the SQL. After the query finally does return the full set of rows, it will continue to do so until such time as the blocks are aged out of cache and must be physically read in again and the problem repeats.

You can imagine that this is worrisome for many reasons, not the least of which is that although this behavior was noted in just one case, we could not be sure if the same SQL formulation wouldn't be found elsewhere. Part of the reason why this was a bit difficult to know for sure is that the problem SQL is dynamically generated. That means that there are countless possible ways SQL could be formulated so it's difficult to know every possible specific SQL statement that might be affected until someone notices/catches the problem.

Take a look at the test case:

SQL> alter system flush buffer_cache; -- Forces no blocks to be present in buffer cache

System altered.

SQL>
SQL> alter system flush shared_pool; -- Not necessary to reproduce the bug but clears out previously executed SQL

System altered.

SQL>
SQL> desc oracle_sr_test_data.oracle_p;
Name Null? Type
---------------------------------------------------------------------- -------- -----------------------------------------------
OID NOT NULL NUMBER(19)
S_OID NOT NULL NUMBER(19)
SUB_OID NUMBER(19)
GUID NOT NULL RAW(16)

SQL>
SQL> desc oracle_sr_test_data.oracle_h;
Name Null? Type
---------------------------------------------------------------------- -------- -----------------------------------------------
P_OID NUMBER(19)

SQL>
SQL> select count(*)
2 from dba_unused_col_tabs
3 where table_name in ('ORACLE_H', 'ORACLE_P')
4 and owner = 'ORACLE_SR_TEST_DATA';

COUNT(*)
----------
0

SQL>
SQL> select object_type, object_name
2 from all_objects
3 where owner = 'ORACLE_SR_TEST_DATA'
4 order by object_type desc, object_name;

OBJECT_TYPE OBJECT_NAME
------------------- ------------------------------
TABLE ORACLE_H
TABLE ORACLE_P
INDEX OH_P_O
INDEX OP_G_IDX
INDEX OP_PK
INDEX OP_SUB_IDX
INDEX OP_SUB_OID_IDX
INDEX OP_S_NUK

8 rows selected.

SQL>
SQL> select count(*)
2 from (select p.guid, s.guid
3 from oracle_sr_test_data.oracle_p p
4 join oracle_sr_test_data.oracle_p s on s.oid = nvl(p.sub_oid, p.oid)
5 join oracle_sr_test_data.oracle_h h on h.p_oid = nvl(p.sub_oid, p.oid)
6 where p.s_oid = 5578308);

COUNT(*)
----------
1927

SQL>
SQL> /

COUNT(*)
----------
2183

SQL>
SQL> /

COUNT(*)
----------
2183

SQL>




With another dataset, I captured a SQL Monitor report that shows how the rows basically stop flowing up through the plan. Note that at line 6, there are 544 rows (that is the final/total number of rows the query should return). But, the NESTED LOOPS join at step 5 only ends up with 262 rows that eventually get produced in the final result set.


The key here appears to be the presence of the NVL in the join and a function-based index on the NVL expression:
join oracle_sr_test_data.oracle_p s on s.oid = nvl(p.sub_oid, p.oid)

Oracle has made several suggestions to work around the problem including:
Set statistics_level = ALL
Set "_disable_function_based_index" = TRUE

With either of these set, the bug does not reproduce. But, can we afford to disable all function-based indexes when we have them present all throughout the database? Probably not. And, can we afford the overhead of setting statistics_level to ALL in our production environment? Probably not.

They've also suggested we try a couple of different hints to see if they would eliminate the issue, but none of them worked. They were: NO_NLJ_BATCHING and NO_OUTER_JOIN_TO_INNER (I'd never seen this hint before).

We don't have a fix yet but if/when we do, I'll make sure to post an update.

So, heads up folks on 11.2.0.4 with similar SQL...you may be experiencing this bug and not even know it! 

1663 1 /
Follow / 10 Sep 2014 at 3:26pm

Thanks for sharing. This is serious, perhaps serious enough to delay upgrading or patching until it's resolved.