Optimizer is Over Estimating Selectivity with the LIKE clause ?

I came across a problem where Optimizer was over estimating the selectivity for a predicate in the LIKE clause. Here is the set up.

SQL> create table mytest (col2 varchar2(2000));

SQL> begin
for i in 1..100
loop
insert into mytest values ('AB_1' || to_char(i) );
end loop;
end;
/

SQL> commit ;

I created a table mytest. Inserted 100 records in there. This is the pattern of data that I have.

AB_11
AB_12
AB_13
...
AB_1100

Collect statistics with 100% sample and without histograms.

SQL> exec dbms_stats.gather_table_stats(user,'mytest',estimate_percent=>100,method_opt=>'for all columns size 1');

SQL> select num_rows from user_tables where table_name='MYTEST';

NUM_ROWS
----------
100

SQL> select num_distinct from user_tab_col_statistics where table_name='MYTEST'

and column_name='COL2';

NUM_DISTINCT

------------

100

So, these statistics are correct. We have 100 distinct values and 100 number of rows in the table.

Now, consider this query.

SQL> select count(*) from mytest where col2 like 'AB_188%' ;

COUNT(*)
----------
1

I know this was going to return count as 1. Now let us look at the execution plan.

Execution Plan

Id	Operation	Name	Rows	Bytes	Cost (%CPU)	Time
0	SELECT STATEMENT				2 (100)
1	SORT AGGREGATE		1	7
2	TABLE ACCESS FULL	MYTEST	99	693	2 (0)	00:00:0

Estimated Number of Rows - 99. Why did optimizer just over estimated this cardinality ?
I know this was going to return just 1 row. Statistics are 100% accurate. What went wrong here ?

I thought this was something to do with the histograms. I created a frequency histogram with 100 buckets.

SQL> exec dbms_stats.gather_table_stats(user,'mytest',estimate_percent=>100,method_opt=>'for all columns size 100');

SQL> select count(*) from mytest where col2 like 'AB_188%' ;

COUNT(*)
----------
1

Execution Plan

Id	Operation	Name	Rows	Bytes	Cost (%CPU)	Time
0	SELECT STATEMENT				2 (100)
1	SORT AGGREGATE		1	7
2	TABLE ACCESS FULL	MYTEST	100	693	2 (0)	00:00:0

With histograms in place, estimated number of rows is now 100.

So, why is optimizer overestimating the selectivity ? Why is it that optimizer believed that our search string will return 100 rows ignoring the fact that it actually returns 1 row. With this kind of data distribution and if I had million of records, optimizer will always choose a full table scan and not an index range scan.

Here is what is happening . There is a "_" character in our search string . Optimizer interprets it as "match any character". Therefore, selectivity is computed after ignoring everything that comes after this "_" character.
So, selectivity of AB_188% is considered as selectivity of AB%. And this leads to over-estimation.

Now let us run this query. I instruct optimizer to consider "_" as a regular character.

SQL> select count(*) from mytest where col2 like 'AB\_188%' escape '\';

COUNT(*)
----------
1

Execution Plan

Id	Operation	Name	Rows	Bytes	Cost (%CPU)	Time
0	SELECT STATEMENT				2 (100)
1	SORT AGGREGATE		1	7
2	TABLE ACCESS FULL	MYTEST	1	693	2 (0)	00:00:0

Wow ! Now optimizer guesses it correct. Estimated cardinality of 1.

So, you have to be careful when you use a "LIKE" clause.

Vishal's blog - Oracle Database Performance

Saturday, 19 January 2013