Wednesday, 12 August 2020

Demystifying Data Sharing in Oracle 19c Multitenant

Oracle introduced multitenant architecture in 12c. Purpose of multitenant is consolidation. Instead of having multiple databases on a server we can now create one consolidated database called the CDB. This can contain multiple pluggable databases (PDBs). To an application a PDB would appear as if it is a separate database. But behind the scenes we now have one instance, one set of background processes, one shared pool and so on. This architecture is based on data sharing between CDBs and PDBs. Oracle manages to keep the dictionary information in the CDB while the application data segments remain private to the PDB.

In this article we will understand how data sharing works across PDBs and CDBs.
Oracle allows three different techniques of sharing 

1. Metadata Link
2. Data Link
3. Extended Data Link

These sharing methods is what Oracle uses internally to achieve multitenant architecture. From 12.2 Oracle allows users to create their own application root containers and then create PDBs inside this application container. This root application container is actually a PDB to its main root container CDB$ROOT.

Let's get started with a test case. We will first create an application container and then a PDB inside this container. I'm running Oracle database 19c on a Windows Platform. I have a root container already installed called NC001.

Log in as SYSDBA on NC001

SQL> create pluggable database app_container as application container admin user admin identified by oracle file_name_convert=('D:\app\beriv\oradata\NC001','D:\app\beriv\oradata\app_container');

Pluggable database created.

SQL> alter session set container=app_container;

Session altered.

SQL> alter pluggable database open;

Pluggable database altered.

SQL> create pluggable database APP_PDB1 admin user  pdb1admin identified by oracle file_name_convert=('D:\app\beriv\oradata','D:\app\beriv\oradata\APP_PDB1');

Pluggable database created.

SQL> alter pluggable database APP_PDB1 open;

Pluggable database altered.

We now have an application container called APP_CONTAINER. And this has a PDB called APP_PDB1.
Now let us log in into this APP_CONTAINER and create shared objects.We have created three tables here.

ML_T   : A metadata linked table
EL_T    : An extended data linked table
DL_T   : A data link table

To be able to create shared objects, the application root container must be in install or upgrade mode.

SQL> alter session set container=APP_CONTAINER;

Session altered.

SQL> alter pluggable database application app_container begin install '1';

Pluggable database altered.

SQL> create table ML_T SHARING=METADATA (A NUMBER);

Table created.

SQL> create table EL_T SHARING=EXTENDED DATA(A NUMBER);

Table created.

SQL> create table DL_T SHARING=DATA (A NUMBER);

Table created.

SQL> alter pluggable database application app_container end install '1';

Pluggable database altered.

SQL> alter session set container=APP_PDB1;

Session altered.

SQL> alter pluggable database application app_container sync;

Pluggable database altered.

METADATA LINK TABLE
======================

Table ML_T is a metadata linked table. The table definition or metadata is in the application container but the actual data segment is private to each of the containers. Sync command that I ran above is just a one time copy. So that Oracle can create the actual data segment in the PDBs. Now this table ML_T exists in APP_CONTAINER and also in APP_PDB1. They share the dictionary but the data is private to each of these containers.

SQL> alter session set container=APP_CONTAINER;

Session altered.

SQL> insert into ML_T values(1);

1 row created.

SQL> commit;

Commit complete.

SQL> select * from ML_T;

         A
----------
         1

SQL> alter session set container=APP_PDB1;

Session altered.

SQL> insert into ML_T values(2);

1 row created.

SQL> commit;

Commit complete.

SQL> select * from ML_T;

         A
----------
         2

As  you can see, ML_T data segment is private. APP_PDB1 can only see its own data (2) . While APP_CONTAINER can see its own version of data  (1).


EXTENDED DATA LINK TABLE
===========================

Table EL_T is an extended data link table. Metadata is shared and data exists in both the containers.
 Now the PDB can access its own data and also the data from the application container. 

SQL> alter session set container=APP_CONTAINER;

Session altered.

SQL> insert into EL_T values(1);

1 row created.

SQL> commit;

Commit complete.

SQL> select * from EL_T;

         A
----------
         1

SQL> alter session set container=APP_PDB1;

Session altered.

SQL> insert into EL_T values(2);

1 row created.

SQL> commit;

Commit complete.

SQL> select * from EL_T;

         A
----------
         2
         1

As you can see, this returns 1,2. It shows data from both the app_container and app_pdb1.
APP_PDB1 can modify its own rows but not the rows coming from the application container. 

DATA LINK TABLE
=================

DL_T is a data linked table. Data segment exists only in the app_container. Data cannot be modified in the PDB. But this PDB can view the data that's stored in the root container. 

SQL> alter session set container=APP_CONTAINER;

Session altered.

SQL> insert into DL_T values(1);

1 row created.

SQL> commit;

Commit complete.

SQL> alter session set container=APP_PDB1;

Session altered.

SQL> select * from DL_T;

         A
----------
         1

Any attempt to modify this data would result in an error. 

SQL> update DL_T set A=99;
update DL_T set A=99
       *
ERROR at line 1:
ORA-65097: DML into a data link table is outside an application action

So far we have seen shared tables, similarly we can also create shared views. Let's look at couple of examples

SHARED VIEWS
==============

Here I have created a data link view DL_V. And this view selects from table ML_T which is a metadata linked table we created above.
I created another view EL_V which is an extended data link view. And this view also selects from the same metadata linked table ML_T.

SQL> alter session set container=APP_CONTAINER;

SQL> alter pluggable database application app_container begin upgrade '1' to '2';

Pluggable database altered.

SQL> create view DL_V SHARING=DATA AS (SELECT * FROM ML_T);

View created.

SQL> create view EL_V SHARING=EXTENDED DATA AS (SELECT * FROM ML_T);

View created.

SQL> alter pluggable database application app_container end upgrade;

Pluggable database altered.

SQL> alter session set container=APP_PDB1;

Session altered.

SQL> alter pluggable database application app_container sync;

Pluggable database altered.

SQL> select * from DL_V;

         A
----------
         1

Here I'm logged in as APP_PDB1. We know that selecting from table ML_T gives us the output of '2'. But selecting from this view DL_V we get '1'. This is because the view is itself defined as a data link. This instructs Oracle to pull the data from the application root container and not from the current PDB. If you want to pull the data from the current APP_PDB1 you can use the Oracle supplied function NO_OBJECT_LINK

SQL> select * from NO_OBJECT_LINK(DL_V);

         A
----------
         2

SQL> select * from EL_V;

         A
----------
         2
         1

We see the similar behaviour here. We know that selecting from the table ML_T gives us the output of '2'. But selecting from this view EL_V we get 1,2. Because the view is an extended data link, Oracle pulls the data both from the application root container(app_container) and the current PDB(app_pdb1). Again, if you want to pull the data only from the current PDB, use the function NO_OBJECT_LINK

SQL> select * from NO_OBJECT_LINK(DL_V);

         A
----------
         2

In the next article we will see how Oracle uses this data sharing framework to consolidate the AWR data in CDB-PDB.




Friday, 12 June 2020

Exadata: IMPX vs Smart Scans

Starting 11.2, Oracle introduced a new feature called In-Memory parallel execution. This allows parallel slaves to do full segment scans via the buffer cache (and not via direct path reads). Note that this feature is not related to the 12c database in-memory option.On the face of it, it appears to be a nice little feature that allows you to cache a big table across the buffer cache of all RAC instances. But on Exadata, your query will not take advantage of smart scans, storage indexes, column projections because these features need direct path read as a prerequisite.

So, the next question that immediately comes to mind is – which is better? A parallel query doing buffered reads or a parallel query doing smart scans. Let’s get started with a little demo. IMPX kicks in when parallel_degree_policy is set to AUTO and the segment size is between 2-80% of the global buffer cache. There could be other heuristics and algorithms that impacts Oracle decision to use direct reads or buffered reads. 

In this set up, I’ve a table T of size 26 GB. I’m on an Exadata X7 machine, using a 3 node RAC. Each node has buffer cache size of 20 GB. Therefore, global buffer cache size is 60 GB. Database version is 11.2.0.4

Scenario 1 : parallel_degree_policy is set to MANUAL. IMPX does not kick in. 

SQL> SELECT /*+ PARALLEL(8) */ COUNT(*) FROM T ;

This SQL has a wall clock time of 1 second. Looking at the execution stats

SELECT
inst_id,
elapsed_time / 1e6,
       executions,
       buffer_gets,
       disk_reads,
       ROUND(physical_read_bytes / 1024 / 1024 / 1024,2) physical_read_gb,
       ROUND(IO_CELL_OFFLOAD_ELIGIBLE_BYTES / 1024 / 1024 / 1024,2) IO_CELL_OFFLOAD_ELIGIBLE_GB,
       ROUND(IO_INTERCONNECT_BYTES / 1024 / 1024 / 1024,2) IO_INTERCONNECT_GB,
       ROUND(IO_CELL_UNCOMPRESSED_BYTES/1024/1024/1024,2) IO_CELL_UNCOMPRESSED_GB,
       ROUND(100*(IO_CELL_OFFLOAD_ELIGIBLE_BYTES -
IO_INTERCONNECT_BYTES)/IO_CELL_OFFLOAD_ELIGIBLE_BYTES,2) IO_SAVED_PERC
  FROM gv$sql
    WHERE sql_id = 'aavvpznpcf8b7'
 ORDER BY 1






** Click on the image to have clear view.

As you can see, total disk reads is about 26 GB. And almost all of it is eligible for offloading.
IO_SAVED_PERC is at 99%. This indicates that a smart scan has taken place. Only a few MBs of data was transferred back from the cell server to the database compute node. 
This explains why the SQL was so quick. This is the hardware capability of Exadata. Its ability to scan 26 GB of data at cell server in less than a second. AWSOME ! Isn't it ?

Scenario 2parallel_degree_policy is set to AUTO. IMPX kicks in.

SQL> alter system set parallel_degree_policy=AUTO SCOPE=BOTH SID='*' ;

Now we run the same select again. 

SQL> SELECT /*+ PARALLEL(8) */ COUNT(*) FROM T ;

Now the SQL wall clock time went up drastically from 1 second to 8 minutes and 45 seconds.
Here are the execution stats






IO_CELL_OFFLOAD_ELGIBLE_BYTES is now 0. And around 26 GB of data has been sent back from the cell server to the database compute node. It was a buffered read, smart scans did not take place. Let's look at V$BH to see if the blocks has been cached across all the RAC instances.

SELECT inst_id, status, COUNT (*)
FROM gv$bh
WHERE objd = 66837983 AND status != 'free'
GROUP BY inst_id, status
ORDER BY 2 DESC





T


These are the number of blocks cached on each of the instances in scur mode.
Adding the number of blocks - 1680263. Multiplied by DB Block size of 16KB- I get 26 GB.
Oracle has now uniformly cached the entire table in the buffer cache of instance 1,2,3. Each instance caches around 8 GB of data blocks. This was done using cell multiblock physical reads (and not via cell smart scans). This explains why this execution took more than 8 minutes. It had to read 25 GB of data from the storage, transfer it to the compute nodes and cache it onto the instance buffer cache. This is certainly an overhead for the first execution.
Now let's run the same SQL again. Hoping that this time, Oracle would read the data from the global buffer cache.

SQL> SELECT /*+ PARALLEL(8) */ COUNT(*) FROM T ;

And yes, second execution of this SQL completes in 1 second. Looking at the stats






No smart scans here as io_cell_offload_eligible_bytes is 0. But disk reads and physical reads  bytes has come down to 0. Because Oracle can now find all the block it needs in the buffer cache of instance 1,2,3. Another important point to note here is that the PX servers do not use cache fusion to transfer blocks between nodes. It is also possible to see one instance doing direct path reads and other instance(s) going for buffered reads. 

To summarise this behaviour, IMPX is not really useful on Exadata environments because it prevents the use of smart scans and storage indexes. Queries that benefits a lot from smart scans will most likely suffer from parallel buffered reads. Moreover, the global database cache is still limited in size, couple of big tables in a warehouse environment can easily fill up 80% of the cache and then Oracle will start to trash the cache. That's again a huge overhead. 
In my opinion, AUTO parallel degree policy is like a beast in the cage. It makes your database vulnerable to IMPX, parallel statement queuing and as a DBA you are never in complete control ! It is best to have the policy set to MANUAL specially for large warehouse databases on Exadata environments.  

Tuesday, 28 April 2020

Various methods of Partition Pruning and DML restrictions

Partition Pruning is an optimization technique that the optimizer applies to eliminate certain partitions during query execution. In this article we will discuss about various methods of Partition Pruning and certain pruning restrictions that apply to Insert As Select ( IAS) statements.

Let's get started with a test case. We create two tables T1 and T2. T1 is non partitioned.
T2 is a partitioned table. Partitioned by HASH on the partition_id column.

CREATE TABLE T1 ( event_name VARCHAR2(254) , partition_id NUMBER);

CREATE TABLE T2
(
    partition_id      NUMBER,
    partition_name    VARCHAR2 (254)
)
PARTITION BY HASH (partition_id)(PARTITION p1,
 PARTITION p2,
 PARTITION p3,
 PARTITION p4);

Now we run a SELECT statement with different hints and examine the execution plans and look at the different types of partition pruning Oracle uses.

BLOOM PRUNING


EXPLAIN PLAN FOR
SELECT /*+ USE_HASH(T2) */ * FROM T1,T2 where 
T1.event_name like 'str%'
and T1.partition_id=T2.partition_id

Plan hash value: 2384829105

--------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name    | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
--------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |         |  5000 |   976K| 11158   (1)| 00:03:33 |       |       |
|*  1 |  HASH JOIN                   |         |  5000 |   976K| 11158   (1)| 00:03:33 |       |       |
|   2 |   JOIN FILTER CREATE         | :BF0001 |     1 |   100 |     4   (0)| 00:00:01 |       |       |
|   3 |    PART JOIN FILTER CREATE   | :BF0000 |     1 |   100 |     4   (0)| 00:00:01 |       |       |
|*  4 |     TABLE ACCESS STORAGE FULL| T1      |     1 |   100 |     4   (0)| 00:00:01 |       |       |
|   5 |   JOIN FILTER USE            | :BF0001 |   100K|  9765K| 11154   (1)| 00:03:32 |       |       |
|   6 |    PARTITION HASH JOIN-FILTER|         |   100K|  9765K| 11154   (1)| 00:03:32 |:BF0000|:BF0000|
|*  7 |     TABLE ACCESS STORAGE FULL| T2      |   100K|  9765K| 11154   (1)| 00:03:32 |:BF0000|:BF0000|
--------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("T1"."PARTITION_ID"="T2"."PARTITION_ID")
   4 - storage("T1"."EVENT_NAME" LIKE 'str%')
       filter("T1"."EVENT_NAME" LIKE 'str%')
   7 - storage(SYS_OP_BLOOM_FILTER(:BF0001,"T2"."PARTITION_ID"))
       filter(SYS_OP_BLOOM_FILTER(:BF0001,"T2"."PARTITION_ID"))

Here a bloom filter is created at O/P 3. And at O/P 7, Pstart and Pstop is :BF0000.
Based on this bloom filter partition pruning occurs on T2 and therefore, only the partitions containing the relevant data are scanned.This is called Bloom Pruning.
This is controlled by the hidden parameter _bloom_pruning_enabled which defaults to TRUE
At O/P 2 we see JOIN FILTER CREATE, this is a join filter which also relies on bloom filters but this does not eliminate partitions. This is another sort of optimization and not to be confused with bloom pruning. You can read about the bloom filter here.


KEY KEY PRUNING


EXPLAIN PLAN FOR
SELECT /*+ USE_NL(T1) */ * FROM T1,T2 where 
T1.event_name like 'str%'
and T1.partition_id=T2.partition_id

Plan hash value: 2185522821

----------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |      |  5000 |   976K|  2792   (1)| 00:00:54 |       |       |
|   1 |  NESTED LOOPS               |      |  5000 |   976K|  2792   (1)| 00:00:54 |       |       |
|*  2 |   TABLE ACCESS STORAGE FULL | T1   |     1 |   100 |     4   (0)| 00:00:01 |       |       |
|   3 |   PARTITION HASH ITERATOR   |      | 10000 |   976K|  2788   (1)| 00:00:53 |   KEY |   KEY |
|*  4 |    TABLE ACCESS STORAGE FULL| T2   | 10000 |   976K|  2788   (1)| 00:00:53 |   KEY |   KEY |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - storage("T1"."EVENT_NAME" LIKE 'str%')
       filter("T1"."EVENT_NAME" LIKE 'str%')
   4 - storage("T1"."PARTITION_ID"="T2"."PARTITION_ID")
       filter("T1"."PARTITION_ID"="T2"."PARTITION_ID")

With Nested Loop joins, the execution plan has completely changed. There are no bloom filters here.
All rows of T1 are scanned. And for each of these partition_id's Oracle can now prune into a specific partition on T2. Pstart and Pstop shows as KEY-KEY. If T1 returns 10 records with same partition ids - then the same partition on T2 is scanned 10 times. This kind of pruning is bad if T1 returns lot of rows.

SUBQUERY PRUNING


ALTER SESSION SET "_bloom_pruning_enabled"=FALSE
ALTER SESSION SET "_bloom_filter_enabled"=FALSE

EXPLAIN PLAN FOR
SELECT /*+ USE_HASH(T2) SUBQUERY_PRUNING(T2 PARTITION) */ * FROM T1,T2 where 
T1.event_name like 'str%'
and T1.partition_id=T2.partition_id

Plan hash value: 2405306729

----------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |      |  5000 |   976K| 11158   (1)| 00:03:33 |       |       |
|*  1 |  HASH JOIN                  |      |  5000 |   976K| 11158   (1)| 00:03:33 |       |       |
|*  2 |   TABLE ACCESS STORAGE FULL | T1   |     1 |   100 |     4   (0)| 00:00:01 |       |       |
|   3 |   PARTITION HASH SUBQUERY   |      |   100K|  9765K| 11154   (1)| 00:03:32 |KEY(SQ)|KEY(SQ)|
|   4 |    TABLE ACCESS STORAGE FULL| T2   |   100K|  9765K| 11154   (1)| 00:03:32 |KEY(SQ)|KEY(SQ)|
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("T1"."PARTITION_ID"="T2"."PARTITION_ID")
   2 - storage("T1"."EVENT_NAME" LIKE 'str%')
       filter("T1"."EVENT_NAME" LIKE 'str%')

We have disabled bloom pruning here. With Hash joins, Oracle now goes for Subquery Pruning. At O/P 3 you will see Partition Hash Subquery. T1 is fully scanned. Now Oracle runs a recursive query on TBL$OR$IDX$PART$NUM to discover what partitions it will need from T2. Based on the result of that recursive query, T2 is pruned. Pstart and Pstop is KEY(SQ) KEY( SQ) 
This is controlled by the parameter _subquery_pruning_enabled.

INSERT AS SELECT and BLOOM PRUNING


EXPLAIN PLAN FOR
INSERT /*+ APPEND */  INTO T3
SELECT /*+ USE_HASH(T2) */ * FROM T1,T2 where 
T1.event_name like 'str%'
and T1.partition_id=T2.partition_id

Plan hash value: 1934473477
-----------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
-----------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT             |      |  5000 |   976K| 11158   (1)| 00:03:33 |       |       |
|   1 |  LOAD AS SELECT              | T3   |       |       |            |          |       |       |
|*  2 |   HASH JOIN                  |      |  5000 |   976K| 11158   (1)| 00:03:33 |       |       |
|*  3 |    TABLE ACCESS STORAGE FULL | T1   |     1 |   100 |     4   (0)| 00:00:01 |       |       |
|   4 |    PARTITION HASH ALL        |      |   100K|  9765K| 11154   (1)| 00:03:32 |     1 |     4 |
|   5 |     TABLE ACCESS STORAGE FULL| T2   |   100K|  9765K| 11154   (1)| 00:03:32 |     1 |     4 |
-----------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T1"."PARTITION_ID"="T2"."PARTITION_ID")
   3 - storage("T1"."EVENT_NAME" LIKE 'str%')
       filter("T1"."EVENT_NAME" LIKE 'str%')

In 11.2.0.4 Oracle does not consider bloom pruning for Insert As Select statements. 
Looking at the plan Pstart -> Pstop is 1 and 4. All partitions of T2 are being scanned. This could cause major performance issues when you have tens of thousands of partitions in T2.
This restriction has been lifted on 18.1. One workaround would be to use CTAS instead if IAS.
CTAS allows bloom pruning. 

You can now request backport for fix 16609749 on 11.2.0.4. This enables bloom pruning for direct path insert as selects. Note that this fix is not enabled by default. You need to explicitly enable this 

ALTER SESSION SET "_fix_control" = '16609749:ON';

EXPLAIN PLAN FOR
INSERT /*+ APPEND */  INTO T3
SELECT /*+ USE_HASH(T2) */ * FROM T1,T2 where 
T1.event_name like 'str%'
and T1.partition_id=T2.partition_id


Plan hash value: 3019913043
---------------------------------------------------------------------------------------------------------
| Id  | Operation                     | Name    | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT              |         |  5000 |   976K| 12379   (1)| 00:03:31 |       |       |
|   1 |  LOAD AS SELECT               | T3      |       |       |            |          |       |       |
|*  2 |   HASH JOIN                   |         |  5000 |   976K| 12379   (1)| 00:03:31 |       |       |
|   3 |    JOIN FILTER CREATE         | :BF0001 |     1 |   100 |     4   (0)| 00:00:01 |       |       |
|   4 |     PART JOIN FILTER CREATE   | :BF0000 |     1 |   100 |     4   (0)| 00:00:01 |       |       |
|*  5 |      TABLE ACCESS STORAGE FULL| T1      |     1 |   100 |     4   (0)| 00:00:01 |       |       |
|   6 |    JOIN FILTER USE            | :BF0001 |   100K|  9765K| 12375   (1)| 00:03:31 |       |       |
|   7 |     PARTITION HASH JOIN-FILTER|         |   100K|  9765K| 12375   (1)| 00:03:31 |:BF0000|:BF0000|
|*  8 |      TABLE ACCESS STORAGE FULL| T2      |   100K|  9765K| 12375   (1)| 00:03:31 |:BF0000|:BF0000|
---------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("T1"."PARTITION_ID"="T2"."PARTITION_ID")
   5 - storage("T1"."EVENT_NAME" LIKE 'str%')
       filter("T1"."EVENT_NAME" LIKE 'str%')
   8 - storage(SYS_OP_BLOOM_FILTER(:BF0001,"T2"."PARTITION_ID"))
       filter(SYS_OP_BLOOM_FILTER(:BF0001,"T2"."PARTITION_ID"))

And BLOOM ....!

Friday, 24 April 2020

System Statistics on Exadata and its impact on Optimizer cost

In this post I will discuss about the Exadata mode system statistics and how this impacts the cost of a full table scan.
Let's get started with the test case. I'm running this test case on an Exadata machine with database version 18.1

SQL> create table T (ID number) segment creation immediate ;

I don't want to insert a lot of rows, so I just fudge the statistics.

SQL> exec dbms_stats.set_table_stats(user,'T',numrows=>50000,numblks=>50000);

Optimizer now believes that this table has 50000 rows and 50000 blocks.
I currently  have NOWORKLOAD statistics 

FLAGS 0
CPUSPEEDNW 3326
IOSEEKTIM 10
IOTFRSPEED 4096
SREADTIM
MREADTIM
CPUSPEED
MBRC

FLAGS set to 0 indicate No Workload statistics. Note that SREADTIM and MREADTIM is not set. They are derived from IO Seek time and IO transfer speed. MBRC is also not set, it is derived from the parameter db_file_multiblock_read_count. Note that this parameter has been explicitly set to 64 on my database ( Why did I use the word "explicitly" - will explain that in some other post).
Now let's see the plan

SQL> explain plan for 
                                select * from T;

Plan hash value: 1601196873
----------------------------------------------------------------------------------
| Id  | Operation                 | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |      | 50000 |   634K| 14857   (1)| 00:00:02 |
|   1 |  TABLE ACCESS STORAGE FULL| T    | 50000 |   634K| 14857   (1)| 00:00:02 |
----------------------------------------------------------------------------------

Cost of this full table scan is 14857. 
We will see how Oracle computes the cost

Cost = IO Cost + CPU Cost

IO Cost of FTS = (Blocks/MBRC) * (MREADTIME/SREADTIM)

Blocks = 50000
MBRC = 64
SREADTIM = IOSEEKTIME + DB_BLOCK_SIZE/IOTFRSPEED
                     = 10 + 16384/4096
                     = 14 
MREADTIM =IOSEEKTIME + (MBRC*DB_BLOCK_SIZE) / IOTFRSPEED
                      = 10 + (64*16384)/4096
                      = 266
Cost of FTS = (50000/64) * ( 266/14)
                    = 14843 

14843 is the IO cost. Very close to 14857 which we see in the execution plan. The difference is the CPU cost which is very minimal here.

Now let's gather the statistics in Exadata Mode and compute the numbers again

SQL> exec dbms_stats.gather_system_stats('EXADATA')

These are my new values now

FLAGS
1
CPUSPEEDNW 3326
IOSEEKTIM 13
IOTFRSPEED 204800
SREADTIM
MREADTIM
CPUSPEED
MBRC 64

FLAGS set to 1 indicate workload statistics. Now let's compute the cost of FTS

Blocks = 50000
MBRC = 64
SREADTIM = IOSEEKTIME + DB_BLOCK_SIZE/IOTFRSPEED
                     = 13 + 16384/204800
                     = 13.08
MREADTIM =IOSEEKTIME + (MBRC*DB_BLOCK_SIZE) / IOTFRSPEED
                      = 13 + (64*16384)/204800
                      = 18.12
Cost of FTS = (50000/64) * ( 18.12/13.08)
                    = 1082

Let's look at the plan to see if our calculations are correct

SQL> explain plan for 
                                select * from T;

Plan hash value: 1601196873
----------------------------------------------------------------------------------
| Id  | Operation                 | Name | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |      | 50000 |   634K|  1095   (2)| 00:00:01 |
|   1 |  TABLE ACCESS STORAGE FULL| T    | 50000 |   634K|  1095   (2)| 00:00:01 |
----------------------------------------------------------------------------------

Yes. Our calculations look correct. Plan shows 1095 and we calculated the value to be 1082. Difference is the CPU cost.

FTS cost dropped from 14843 to just 1082 .Fact that Exadata is a powerful hardware and that has to be accounted in the cost. Full table scans are much faster on Exadata and therefore the cost of FTS should be much lesser. This is why you must gather system statistics in Exadata mode.







Thursday, 16 April 2020

Oracle is running truncates against your tabes but don't worry


Oracle never fails to surprise me. And this time it was running truncate statements against my tables.

Oracle introduced dbms_space_admin.drop_empty_segments in 11.2. This enables us to drop unused space within a segment. In other words, this procedure will defragment a table. But why would it issue truncate against the tables that has data ?
Let's get started with a test case

SQL> create table T1 (A number) ;

Since DEFERRED_SEGMENT_CREATION is set to TRUE, segment for T1 is never created.

SQL> create table T2 (A number);
SQL> insert into T2 values (1);
SQL> commit;

My insert statement has created a segment because of the insert.

SQL> create table T3 (A number) ;
SQL> insert into T3 values (1);
SQL> commit;
SQL> delete from T3;
SQL> commit;


SQL> select table_name, segment_created from user_tables ;
T1     NO
T2    YES
T3    YES

T3 is the table that is fragmented. I inserted a row and then deleted it. So, I have some free space below the HWM. And drop_empty_segments procedure should take care of this.
Now lets run the procedure

SQL> execute dbms_space_admin.drop_empty_segments('SCOTT') ;

10046 trace of the above procedure shows that Oracle recursively ran the following SQLs

LOCK TABLE T2 IN EXCLUSIVE MODE NOWAIT
TRUNCATE TABLE T2
LOCK TABLE T3 IN EXCLUSIVE MODE NOWAIT
TRUNCATE TABLE T3

It locked tables T2 and T3 in exclusive mode. and then ran truncates against them.
But why was a truncate issued against T2 ? This table has data in it. And not even eligible for segment drop.


SQL> select table_name, segment_created from user_tables ;
T1     NO
T2    YES
T3    NO

As you can see, We have reclaimed the free space in T3 and segment_created has been set to NO.
But surprisingly, truncate was issued against T2 as well. I ran a select on T2, data wasn't actually truncated.
As per bug notes, this truncate is actually a no-op truncate. This is a special form of truncate. This truncate first checks if the segment has rows or not and then decides whether to truncate the segment.
But it still holds an exclusive lock on T2 which can cause major performance bottleneck. If T2 is a huge table with thousands of partitions, Oracle might take too long to probe each of these partitions to check if a row exists or not. And while this probe is on, any other session trying to insert a row into this table would wait for enq- TM contention.
Therefore, I'd never suggest running this procedure. There are other better ways to defrag a table.

Sunday, 29 March 2020

Fine Grained Cursor Invalidation on 18c

In the previous post we discussed about parallel query cursor invalidation and ORA 12842. In 11.2.0.4 we had a hard time dealing with cursor invalidation error and serial downgrade. We had to choose between one of the two evils.

In Oracle Database 12.2, we have a new feature called fine grained cursor invalidation. This is controlled by a new parameter cursor_invalidation. Prior to 12.2, cursors were invalidated immediately after a DDL was issued. This DDL could be any of the partition maintenance operations such as move partition, truncate partition or a partition exchange. But with 12.2 Oracle does not invalidate them immediately. Rather, we now have a set of rules based on individual cursor dependencies which decides whether a cursor should be invalidated immediately or deferred.

Here is a small test case to demonstrate the problem on 11.2

/*PREPARE the test table*/

--DROP TABLE TMP_TN2;

CREATE TABLE TMP_TN2
(
    ID    NUMBER
)
PARTITION BY LIST (ID)(PARTITION P_1 VALUES (1), PARTITION P_2 VALUES (2));

BEGIN
  INSERT INTO tmp_tn2
  SELECT /*+parallel(8)*/
                 1
          FROM dba_objects t1,
               dba_objects t2,
               dba_objects t3
          WHERE ROWNUM < 100000000;
  COMMIT;
END;
/
/* START QUERY IN SESSION 1 */

DECLARE
  lv_c NUMBER;
BEGIN
  dbms_output.put_line('START');

  SELECT SUM(c)
  INTO lv_c
  FROM
  (
    SELECT /*+ PARALLEL(4)*/
           COUNT(1) c
    FROM tmp_tn2 partition(P_1) t1,
         tmp_tn2 partition(P_1) t2,
         tmp_tn2 partition(P_1) t3
    UNION ALL
    SELECT /*+ PARALLEL(4)*/
           COUNT(1) c
    FROM tmp_tn2 partition(P_1) t1,
         tmp_tn2 partition(P_1) t2,
         tmp_tn2 partition(P_1) t3
    WHERE ROWNUM < 1000000
  );
    
END;
/

/* INVALIDATE CURSOR IN SESSION 2. Let this run until session 1 fails with ORA 12842  */


BEGIN
  WHILE(TRUE)
  LOOP
      EXECUTE IMMEDIATE 'ALTER TABLE TMP_TN2 TRUNCATE PARTITION P_2';
      dbms_lock.sleep(5);
   END LOOP;
END;
/

In 11.2, first session fails with ORA 12842. But on my 18c database my first session never fails.
Now the optimizer is much more intelligent, it realizes that the current cursor is accessing only one partition P_1. And truncate of P_2 from second session will not affect the existing cursor in anyway. Therefore that is not invalidated. Hence avoiding ORA 12842 or serial downgrade.

Note that add and drop partition would still invalidate the cursors because it changes the partition number. Instead of a drop, it is better to exchange or truncate the partition.

Bloom Partition Pruning In Oracle Vs Postgres

In this blog post let's compare Oracle and Postgres optimizer transformation features.  For this test case I created two small tables an...