I just fixed a bug in Quest Code Tester for Oracle, and I thought I would share my experience with you.
The bug was reported by our most excellent QA tester, Danny Pham. He is not only good at going through the defined test cases to verify correct behavior. He also takes devilish delight in trying all sorts of "crazy" stuff to see what happens.
So Danny encountered this error after he set up a test case with three outcomes, all built on cursor variable expression tests (something that is very unlikely for a customer to have done):
PL/SQL: numeric or value error: NULL index table key value
ORA-06512: at "QCTO.QU_TEMPLATE_XP", line 1523
ORA-06512: at "QCTO.QU_TEMPLATE_XP", line 1570
ORA-06512: at "QCTO.QU_GENERATE", line 6062
ORA-06512: at "QCTO.QU_GENERATE", line 6138
I did some tracing, digging around, etc. and I concluded that this is very quirky behavior, likely specific to this unlikely scenario, and I was just about to write him an email suggesting that this one will not be easy to fix, is not a high priority bug to fix (we are in final QA testing - only able to fix "P0" bugs - the product is considered broken if a P0 is not addressed), and we should leave it for next version.
But then I thought: "Aw, heck, let's try to reproduce the problem with a simpler, more likely scenario." And, lo and behold, and much to my stunned disappointment, it was very easy to replicate the error with a much more likely scenario.
That left me with no choice: MUST FIX BUG. And it was very odd behavior, too. The error occurred when I first defined the test and any time I tried to run it in that session. But when I exported and imported the test definition, the bug went away, and I could run the test just fine.
After a little bit more analysis, I also found that if I closed Code Tester and reopened it, the bug would disappear. How could this be? What was different?
Well, any time a bug's behavior changes with a new connection, you should think about what was cached in the session, either by Oracle or by the application. That led me to remember that I had implemented some caching of key data in the Code Tester backend to improve performance when generating and running tests.
I cache the data when validating the test definition, and then reuse that same cached data when generating the test code and finally when running the test (at least in those situations where I need to generate some "just in time" code elements).
Caching is fine and wonderful - especially Oracle 11g's new function result cache, which is a thing of great beauty - but you have to make sure that the underlying data hasn't changed!
So maybe that was the problem...something had changed. But why would it change? So I opened up one of my program files and did some searching for DML operations. And I found a point in the validation phase when I go back and make a change to the underlying data (actually, I knew about this step and had studied it, but hadn't connected it to the caching issue).
Now, you could argue (and I did so to myself) that validation is not the place for data changing. You could also argue that there is no time to refactor this convoluted piece of code. You could, finally, argue that you should put it on your to do list to get back to this piece of code and clean it up.
I put the following on my to do list: "Get back to this piece of code and clean it up."
In the meantime, however, I put in a call to invalidate the cache and force new retrieval of data, immediately after the change I made to the qu_attributes table.
And then the bug disappeared. Ah...so nice.
Lessons learned (or of which I was reminded):
- Don't be hasty in drawing conclusions about the causes of bugs, especially if your conclusion seems anything like a rationalization for not doing something you know you should do.
- Be very, very, very careful when caching data. It's a powerful performance optimization technique, but it can lead to dirty data and bugs that are hard to identify and fix.