I came across a problem where Optimizer was over estimating the selectivity for a predicate in the LIKE clause. Here is the set up.
SQL> create table mytest (col2 varchar2(2000));
SQL> begin
for i in 1..100
loop
insert into mytest values ('AB_1' || to_char(i) );
end loop;
end;
/
SQL> commit ;
I created a table mytest. Inserted 100 records in there. This is the pattern of data that I have.
AB_11
AB_12
AB_13
...
AB_1100
Collect statistics with 100% sample and without histograms.
SQL> exec dbms_stats.gather_table_stats(user,'mytest',estimate_percent=>100,method_opt=>'for all columns size 1');
SQL> select num_rows from user_tables where table_name='MYTEST';
NUM_ROWS
----------
100
SQL> select num_distinct from user_tab_col_statistics where table_name='MYTEST'
SQL> select count(*) from mytest where col2 like 'AB_188%' ;
COUNT(*)
----------
1
I know this was going to return count as 1. Now let us look at the execution plan.
Estimated Number of Rows - 99. Why did optimizer just over estimated this cardinality ?
I know this was going to return just 1 row. Statistics are 100% accurate. What went wrong here ?
I thought this was something to do with the histograms. I created a frequency histogram with 100 buckets.
SQL> exec dbms_stats.gather_table_stats(user,'mytest',estimate_percent=>100,method_opt=>'for all columns size 100');
SQL> select count(*) from mytest where col2 like 'AB_188%' ;
COUNT(*)
----------
1
With histograms in place, estimated number of rows is now 100.
So, why is optimizer overestimating the selectivity ? Why is it that optimizer believed that our search string will return 100 rows ignoring the fact that it actually returns 1 row. With this kind of data distribution and if I had million of records, optimizer will always choose a full table scan and not an index range scan.
Here is what is happening . There is a "_" character in our search string . Optimizer interprets it as "match any character". Therefore, selectivity is computed after ignoring everything that comes after this "_" character.
So, selectivity of AB_188% is considered as selectivity of AB%. And this leads to over-estimation.
Now let us run this query. I instruct optimizer to consider "_" as a regular character.
SQL> select count(*) from mytest where col2 like 'AB\_188%' escape '\';
COUNT(*)
----------
1
Wow ! Now optimizer guesses it correct. Estimated cardinality of 1.
So, you have to be careful when you use a "LIKE" clause.
SQL> create table mytest (col2 varchar2(2000));
SQL> begin
for i in 1..100
loop
insert into mytest values ('AB_1' || to_char(i) );
end loop;
end;
/
SQL> commit ;
I created a table mytest. Inserted 100 records in there. This is the pattern of data that I have.
AB_11
AB_12
AB_13
...
AB_1100
Collect statistics with 100% sample and without histograms.
SQL> exec dbms_stats.gather_table_stats(user,'mytest',estimate_percent=>100,method_opt=>'for all columns size 1');
SQL> select num_rows from user_tables where table_name='MYTEST';
NUM_ROWS
----------
100
SQL> select num_distinct from user_tab_col_statistics where table_name='MYTEST'
and column_name='COL2';
NUM_DISTINCT
------------
100
So, these statistics are correct. We have 100 distinct values and 100 number of rows in the table.
Now, consider this query.SQL> select count(*) from mytest where col2 like 'AB_188%' ;
COUNT(*)
----------
1
I know this was going to return count as 1. Now let us look at the execution plan.
Execution Plan
Id | Operation | Name | Rows | Bytes | Cost (%CPU) | Time |
---|---|---|---|---|---|---|
0 | SELECT STATEMENT | 2 (100) | ||||
1 | SORT AGGREGATE | 1 | 7 | |||
2 | TABLE ACCESS FULL | MYTEST | 99 | 693 | 2 (0) | 00:00:0 |
Estimated Number of Rows - 99. Why did optimizer just over estimated this cardinality ?
I know this was going to return just 1 row. Statistics are 100% accurate. What went wrong here ?
I thought this was something to do with the histograms. I created a frequency histogram with 100 buckets.
SQL> exec dbms_stats.gather_table_stats(user,'mytest',estimate_percent=>100,method_opt=>'for all columns size 100');
SQL> select count(*) from mytest where col2 like 'AB_188%' ;
COUNT(*)
----------
1
Execution Plan
Id | Operation | Name | Rows | Bytes | Cost (%CPU) | Time |
---|---|---|---|---|---|---|
0 | SELECT STATEMENT | 2 (100) | ||||
1 | SORT AGGREGATE | 1 | 7 | |||
2 | TABLE ACCESS FULL | MYTEST | 100 | 693 | 2 (0) | 00:00:0 |
With histograms in place, estimated number of rows is now 100.
So, why is optimizer overestimating the selectivity ? Why is it that optimizer believed that our search string will return 100 rows ignoring the fact that it actually returns 1 row. With this kind of data distribution and if I had million of records, optimizer will always choose a full table scan and not an index range scan.
Here is what is happening . There is a "_" character in our search string . Optimizer interprets it as "match any character". Therefore, selectivity is computed after ignoring everything that comes after this "_" character.
So, selectivity of AB_188% is considered as selectivity of AB%. And this leads to over-estimation.
Now let us run this query. I instruct optimizer to consider "_" as a regular character.
SQL> select count(*) from mytest where col2 like 'AB\_188%' escape '\';
COUNT(*)
----------
1
Execution Plan
Id | Operation | Name | Rows | Bytes | Cost (%CPU) | Time |
---|---|---|---|---|---|---|
0 | SELECT STATEMENT | 2 (100) | ||||
1 | SORT AGGREGATE | 1 | 7 | |||
2 | TABLE ACCESS FULL | MYTEST | 1 | 693 | 2 (0) | 00:00:0 |
Wow ! Now optimizer guesses it correct. Estimated cardinality of 1.
So, you have to be careful when you use a "LIKE" clause.