Saturday, 19 January 2013

Optimizer is Over Estimating Selectivity with the LIKE clause ?

I came across a problem where Optimizer was over estimating the selectivity for a predicate in the LIKE clause. Here is the set up.

SQL> create table mytest (col2 varchar2(2000));

SQL> begin
          for i in 1..100
          loop
          insert into mytest values ('AB_1' || to_char(i) );
          end loop;
          end;
          /

SQL> commit ;

I created a table mytest. Inserted 100 records in there. This is the pattern of data that I have.

AB_11
AB_12
AB_13
...
AB_1100

Collect statistics with 100% sample and without histograms.

SQL> exec dbms_stats.gather_table_stats(user,'mytest',estimate_percent=>100,method_opt=>'for all                 columns size 1');

SQL> select num_rows from user_tables where table_name='MYTEST';

  NUM_ROWS

----------
       100

SQL> select num_distinct from user_tab_col_statistics where table_name='MYTEST'
    and column_name='COL2';

NUM_DISTINCT
------------
         100

So, these statistics are correct. We have 100 distinct values and 100 number of rows in the table.
Now, consider this query.

SQL> select count(*) from mytest where col2 like 'AB_188%' ;


COUNT(*)
----------
         1

I know this was going to return count as 1. Now let us look at the execution plan.

Execution Plan

IdOperationNameRowsBytesCost (%CPU)Time
0SELECT STATEMENT2 (100)
1   SORT AGGREGATE17
2     TABLE ACCESS FULLMYTEST996932 (0)00:00:0


Estimated Number of Rows - 99. Why did optimizer just over estimated this cardinality ?
I know this was going to return just 1 row. Statistics are 100% accurate. What went wrong here ?

I thought this was something to do with the histograms. I created a frequency histogram with 100 buckets.

SQL> exec dbms_stats.gather_table_stats(user,'mytest',estimate_percent=>100,method_opt=>'for all                 columns size 100');

SQL> select count(*) from mytest where col2 like 'AB_188%' ;



COUNT(*)
----------
         1

Execution Plan

IdOperationNameRowsBytesCost (%CPU)Time
0SELECT STATEMENT2 (100)
1   SORT AGGREGATE17
2     TABLE ACCESS FULLMYTEST1006932 (0)00:00:0


With histograms in place, estimated number of rows is now 100.

So, why is optimizer overestimating the selectivity ? Why is it that optimizer believed that our search string will return 100 rows ignoring the fact that it actually returns 1 row. With this kind of data distribution and if I had million of records, optimizer will always choose a full table scan and not an index range scan.

Here is what is happening . There is a "_" character in our search string . Optimizer interprets it as "match any character". Therefore, selectivity is computed after ignoring everything that comes after this "_" character.
So, selectivity of AB_188% is considered as selectivity of AB%. And this leads to over-estimation.

Now let us run this query. I instruct optimizer to consider "_" as a regular character.

SQL> select count(*) from mytest where col2 like 'AB\_188%' escape '\';


COUNT(*)
----------
         1

Execution Plan

IdOperationNameRowsBytesCost (%CPU)Time
0SELECT STATEMENT2 (100)
1   SORT AGGREGATE17
2     TABLE ACCESS FULLMYTEST16932 (0)00:00:0


Wow ! Now optimizer guesses it correct. Estimated cardinality of 1.

So, you have to be careful when you use a "LIKE" clause.

No comments:

Post a Comment

19c Multitenant RAC- Default service disables parallelism

We recently upgraded one of our RAC databases from 11.2 to 19.10 on Exadata. And to everyone's surprise, almost everything on the system...