Sunday 2 January 2022

Bloom Min-Max optimization on Exadata

In this post we will discuss how bloom filters work and how they can benefit specifically on an Exadata environment. We will also be looking at bloom filter min/max optimization that Oracle uses to benefit from storage indexes. Let's get started with a test case. 

Database version : 18.8 on Exadata X7-2 machine

SQL> CREATE TABLE T1 ( event_name VARCHAR2(254) , partition_id NUMBER);

SQL> CREATE TABLE T2
(
    partition_id      NUMBER,
    partition_name    CHAR(254)
) STORAGE ( CELL_FLASH_CACHE NONE);

SQL> insert into T1 values ('str',1) ;

SQL> insert into T1 values ('str',2) ;

SQL> BEGIN
     FOR i in 1..1000000
     loop
     insert into T2 values(i,'ABC');
     end loop;
     END;
     /

SQL> commit;

-- Size of T2 is 296 MB. T1 would be few KBs with just two rows.
-- Run few dummy selects on T2 just to ensure that storage indexes have been created
-- Force direct path reads to avoid buffered reads

SQL> alter session set "_serial_direct_read"=ALWAYS ;

SQL> SELECT /*+  USE_HASH(T2) */ * FROM T1,T2 where 
     T1.event_name='str'
     and T1.partition_id=T2.partition_id ;

-- Returns two rows

Plan hash value: 1144549688
 
---------------------------------------------------------------------------------------
| Id  | Operation                   | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |         |     2 |   822 |   421   (3)| 00:00:01 |
|*  1 |  HASH JOIN                  |         |     2 |   822 |   421   (3)| 00:00:01 |
|   2 |   JOIN FILTER CREATE        | :BF0000 |     2 |   284 |     2   (0)| 00:00:01 |
|*  3 |    TABLE ACCESS STORAGE FULL| T1      |     2 |   284 |     2   (0)| 00:00:01 |
|   4 |   JOIN FILTER USE           | :BF0000 |   887K|   227M|   417   (2)| 00:00:01 |
|*  5 |    TABLE ACCESS STORAGE FULL| T2      |   887K|   227M|   417   (2)| 00:00:01 |
---------------------------------------------------------------------------------------
 
Outline Data
-------------
 
  /*+
      BEGIN_OUTLINE_DATA
      PX_JOIN_FILTER(@"SEL$1" "T2"@"SEL$1")
      USE_HASH(@"SEL$1" "T2"@"SEL$1")
      LEADING(@"SEL$1" "T1"@"SEL$1" "T2"@"SEL$1")
      FULL(@"SEL$1" "T2"@"SEL$1")
      FULL(@"SEL$1" "T1"@"SEL$1")
      OUTLINE_LEAF(@"SEL$1")
      ALL_ROWS
      DB_VERSION('18.1.0')
      OPTIMIZER_FEATURES_ENABLE('18.1.0')
      IGNORE_OPTIM_EMBEDDED_HINTS
      END_OUTLINE_DATA
  */
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - access("T1"."PARTITION_ID"="T2"."PARTITION_ID")
   3 - storage("T1"."EVENT_NAME"='str')
       filter("T1"."EVENT_NAME"='str')

Here is the execution plan of this SQL.

1. Oracle first scans table T1.
2. During this scan it also builds a bloom bit array on the join column (partition_id).
3. This bloom filter along with join column's min max value is sent to the storage layer.
4. Now T2 is scanned. At this time Oracle uses the min max values to scan the storage indexes.
5. If storage index cannot be used, Oracle reads all records from T2 and passes it thru the bloom filter.
6. Only the 'probable' matching records are sent back to the compute nodes.

In the plan outline, you see this hint called PX_JOIN_FILTER. "PX" usually denotes some sort of parallelism but px_join_filter is a hint to force the bloom filter. 

At step 2, Oracle creates a bloom filter on T1.  Please read this excellent this Bloom Filter White Paper by Christian Antognini.
In our case, with just two records in T1, Bloom filter bit array would conceptually look like this.


This bloom filter consists of bit vector of length 50. It uses 3 hash functions.  When this SQL is executed Oracle sends this bit array and min-max of partition_id ( 1-2) to the storage server.

Storage indexes consists of 1MB storage regions. They store the min and max of the column value for each of these regions. It can store up to 8 columns and with ESS 19c it can accommodate up to 24 columns. By looking at these indexes, Oracle can decide to skip the storage regions that do not satisfy the query. 

Let's look at the SQL execution stats 




Total physical reads for this query is 304250880 bytes. 
Cell physical IO bytes saved by storage index is 30400510 bytes.
Cell physical IO interconnect bytes is just 34728 bytes.

As you can see, storage indexes were used in this execution. It eliminated 304 MB of disk IO. This is impressive. And only 34 KB of data was sent back from the cell server to the compute nodes. Elimination was possible because there is minimum overlap between T1 min-max (1-2) vs T2 min-max of (1-1000000). 

Now I insert just one more record into T1. And run the same query again

SQL> insert into T1 values ('str',1000000) ;
SQL> commit ;

SQL> SELECT /*+  USE_HASH(T2) */ * FROM T1,T2 where 
     T1.event_name='str'
     and T1.partition_id=T2.partition_id

-- Returns three rows rows
-- Execution Plan remains the same as above

So, now with three rows in T1, the bit vector would look like this.

This time the min-max of T1 is 1-1000000. Even though we have just 3 records, min max is highly skewed. Let's look at the SQL stats for this execution ..



Total physical reads for this query is 304250880 bytes. 
Cell physical IO bytes saved by storage index is 0 bytes.
Cell physical IO interconnect bytes is just 151512 bytes.

Storage indexes were not used during this execution. With just one additional row, 'cell physical IO saved by storage index' went down from 304 MB to 0. Therefore, Oracle had to perform 304 MB of disk IO this time. IO elimination was just not possible here because min-max of T1 overlaps min-max of T2. 

151 KB of data was sent back from the cell server to the compute nodes. Disk IO savings was not possible but smart scans did take place. All the 1 million records were read by Oracle and passed thru the bloom filter and then only "probable" matching rows were sent back. 

First query took advantage of both smart scans and storage indexes. While the second query benefitted only from smart scans. 

BENEFITS of Bloom Filters on Exadata 

1. Depending on the min-max values, queries can benefit from Storage Indexes, thereby reducing disk reads.

2. Since the bloom structure is sent to the storage server, queries benefit from smart scans. Thereby sending fewer rows back to the compute nodes.

3. On RAC environment, bloom filters on parallel queries will reduce the interconnect traffic and save CPU time because fewer rows are exchanged between slave sets

There is a hidden parameter _bloom_minmax_enabled. This controls the min max optimization. If this parameter is turned off, then Oracle would not be able to send the min-max of the join column and therefore storage indexes will not be used.

19c Multitenant RAC- Default service disables parallelism

We recently upgraded one of our RAC databases from 11.2 to 19.10 on Exadata. And to everyone's surprise, almost everything on the system...