Saturday 7 September 2019

Curious case of SQLs restarting by itself and ORA -12842

We might have seen lots of ORA-12842 "Cursor Invalidated during parallel execution".
Usually a select query running in parallel would fail with this error. This can happen when the select is running concurrently with other sessions doing some sort of partition maintenance ( like exchange partitions or adding / dropping partitions). Well, this is an expected behaviour. Document 1322894.1 tells you to ignore this error and re-execute the query. Or you can just "suppress" the error message by setting a fix control.

SQL> alter system set "_fix_control"='7170213:OFF' ;


But the note doesn't tell you about how Oracle actually suppresses the error. And is it really a good idea to set this fix control ?

Let's try and understand the current database behaviour. If a parallel query encounters ORA 12842 midway during the execution, the error is not immediately reported back to the user. Instead, Oracle silently restarts the SQL on its own hoping that it will go through this time. If this "automated" second run also fails with the same error, Oracle again restarts the SQL. And this continues. If the 10th execution is unsuccessful, only then ORA 12842 is reported back to the user.

Now, if I have a select query that should complete in 15 minutes. Let's say this fails after running for first 10 minutes and then Oracle restarts the query all over again. Assuming that all retry attempts were unsuccessful, this query would effectively run for 100 minutes and then report ORA 12842.
And if the query succeeds in the 10th attempt, the total run time for this query would appear to be 105 minutes! So, the user might complain that his query is running slow but actually Oracle is silently restarting the query in the background. In the ASH data you would see that the sql_exec_id would change after every failure.

Now let's talk about how Oracle suppresses this error. When we turn off this fix control, ORA 12842 is never reported. But there is a catch to it. With the fix control turned off, Oracle does not retry the SQL for 10 times ( as explained above). Instead, it downgrades the SQL to serial in the second attempt itself. And serial SQLs never report ORA 12842.  So, suppressing the error effectively means that your SQL will be downgraded to serial. And this will again have huge performance implications.

Now that we know how Oracle handles these kind of situations, we can make a careful choice on whether this fix control should be set. I'd personally advise against it. There are better ways to handle this error. Like, you could run the partition maintenance on weekends ( outside of your  normal OLTP/ batch reporting jobs).







19c Multitenant RAC- Default service disables parallelism

We recently upgraded one of our RAC databases from 11.2 to 19.10 on Exadata. And to everyone's surprise, almost everything on the system...