Optimizer is Over Estimating Selectivity with the LIKE clause ?

I came across a problem where Optimizer was over estimating the selectivity for a predicate in the LIKE clause. Here is the set up.

SQL> create table mytest (col2 varchar2(2000));

SQL> begin
for i in 1..100
loop
insert into mytest values ('AB_1' || to_char(i) );
end loop;
end;
/

SQL> commit ;

I created a table mytest. Inserted 100 records in there. This is the pattern of data that I have.

AB_11
AB_12
AB_13
...
AB_1100

Collect statistics with 100% sample and without histograms.

SQL> exec dbms_stats.gather_table_stats(user,'mytest',estimate_percent=>100,method_opt=>'for all columns size 1');

SQL> select num_rows from user_tables where table_name='MYTEST';

NUM_ROWS
----------
100

SQL> select num_distinct from user_tab_col_statistics where table_name='MYTEST'

and column_name='COL2';

NUM_DISTINCT

------------

100

So, these statistics are correct. We have 100 distinct values and 100 number of rows in the table.

Now, consider this query.

SQL> select count(*) from mytest where col2 like 'AB_188%' ;

COUNT(*)
----------
1

I know this was going to return count as 1. Now let us look at the execution plan.

Execution Plan

Id	Operation	Name	Rows	Bytes	Cost (%CPU)	Time
0	SELECT STATEMENT				2 (100)
1	SORT AGGREGATE		1	7
2	TABLE ACCESS FULL	MYTEST	99	693	2 (0)	00:00:0

Estimated Number of Rows - 99. Why did optimizer just over estimated this cardinality ?
I know this was going to return just 1 row. Statistics are 100% accurate. What went wrong here ?

I thought this was something to do with the histograms. I created a frequency histogram with 100 buckets.

SQL> exec dbms_stats.gather_table_stats(user,'mytest',estimate_percent=>100,method_opt=>'for all columns size 100');

SQL> select count(*) from mytest where col2 like 'AB_188%' ;

COUNT(*)
----------
1

Execution Plan

Id	Operation	Name	Rows	Bytes	Cost (%CPU)	Time
0	SELECT STATEMENT				2 (100)
1	SORT AGGREGATE		1	7
2	TABLE ACCESS FULL	MYTEST	100	693	2 (0)	00:00:0

With histograms in place, estimated number of rows is now 100.

So, why is optimizer overestimating the selectivity ? Why is it that optimizer believed that our search string will return 100 rows ignoring the fact that it actually returns 1 row. With this kind of data distribution and if I had million of records, optimizer will always choose a full table scan and not an index range scan.

Here is what is happening . There is a "_" character in our search string . Optimizer interprets it as "match any character". Therefore, selectivity is computed after ignoring everything that comes after this "_" character.
So, selectivity of AB_188% is considered as selectivity of AB%. And this leads to over-estimation.

Now let us run this query. I instruct optimizer to consider "_" as a regular character.

SQL> select count(*) from mytest where col2 like 'AB\_188%' escape '\';

COUNT(*)
----------
1

Execution Plan

Id	Operation	Name	Rows	Bytes	Cost (%CPU)	Time
0	SELECT STATEMENT				2 (100)
1	SORT AGGREGATE		1	7
2	TABLE ACCESS FULL	MYTEST	1	693	2 (0)	00:00:0

Wow ! Now optimizer guesses it correct. Estimated cardinality of 1.

So, you have to be careful when you use a "LIKE" clause.

Histograms statistics not accurate with Auto Sample Size

I was reading that AUTO sampling in 11g generates statistics that are as accurate as 100% sampling but takes less time than 100% sampling. I decided to look at the quality of these statistics. Are they good for the optimizer ?

SQL> create table sales (order_key number) ;

Table created.

Insert few records.

SQL> begin
for i in 1..10000
loop
insert into sales values(10);
insert into sales values(20);
insert into sales values(30);
insert into sales values(40);
insert into sales values(50);
end loop;
end ;
/

I inserted five distinct values (10,20,30,40,50) each a 10,000 time.

Now take statistics with AUTO_SAMPLE_SIZE

SQL> exec dbms_stats.gather_table_stats(user,'SALES',method_opt=>'for all columns size 254');

This command completed in just 30 seconds. WOW !!

Let us look at the quality statistics now.

SQL> select num_rows,sample_size from user_tables where table_name='SALES';

NUM_ROWS SAMPLE_SIZE
---------- -----------
50000 50000

SQL> select num_distinct from user_tab_col_statistics where table_name='SALES';

NUM_DISTINCT

--------------------

It tells me that the sample size used was 100% and we see the correct number of num_rows and num_distinct.

These statistics are 100% accurate. Takes very little time. What is wrong then ?

Now let us look at the sample used to generate column statistics (histograms)

When I use method_opt => 'for all columns size 254' , I instruct Oracle to create a histogram on every column

SQL> select sample_size from user_tab_col_statistics where table_name='SALES'

and column_name='ORDER_KEY' ;

SAMPLE_SIZE

-------------------

5450

OK. So Oracle used different samples to collect different statistics. It looks like NUM_ROWS and NUM_DISTINCT was derived from 100% sample but histogram was computed by sampling only 5450 rows ! I ran the same tests with 10 million rows, column sample size was still close to ~ 5550. Is it something that is hard-coded into Oracle's code ?

We create histograms to make optimizer aware of the real data distribution. Of course, quality of this particular statistics might be very very bad given this small sample size.

So, why present the optimizer with histograms when it is not even close to real ?

How many SQL's are actually captured and displayed by AWR ?

While looking into a performance problem, we generally take an AWR report and directly look at the top resource consuming SQL statements. But what do you do when the most expensive SQL is missing from the AWR report ?

How do I check if my AWR report is complete and has captured all or most of the SQL statements ?

Let us have a look at this section of the AWR report.

WORKLOAD REPOSITORY report for

DB Name	DB Id	Instance	Inst num	Release	RAC	Host
VISHAL-TEST	9999999999	VISHAL-TEST	4	10.2.0.4.0	YES	abcd.host.com

	Snap Id	Snap Time	Sessions	Cursors/Session
Begin Snap:	18305	07-Jan-13 15:00:30	38	1.6
End Snap:	18306	07-Jan-13 16:00:05	38	1.6
Elapsed:		59.59 (mins)
DB Time:		58.40 (mins)

SQL ordered by Executions

Total Executions: 9,914
Captured SQL account for 68.9% of Total

Executions	Rows Processed	Rows per Exec	SQL Id	SQL Module	SQL Text
3,989	42	0.01	c4nhd1ntptxq7		select message_level, sequenc...
715	0	0.00	6ssrk2dqj7jbx		select job, nvl2(last_date, ...
238	238	1.00	gj5r9jj2xad7f	JDBC Thin Client	SELECT 1, status, '', arch...
114	114	1.00	0ws7ahf1d78qa		select SYS_CONTEXT('USERENV', ...
114	114	1.00	459f3z9u4fb3u		select value$ from props$ wher...
104	1,872	18.00	0h6b2sajwb74n		select privilege#, level from ...
104	104	1.00	0k8522rmdzg4k		select privilege# from sysauth...
104	0	0.00	1uzhrfn574t12	sqlplus@vcd01024.ute.fedex.com (TNS V1-V3)	SELECT ATTRIBUTE, SCOPE, NUMER...
104	104	1.00	5qgz1p0cut7mx	sqlplus@vcd01024.ute.fedex.com (TNS V1-V3)	BEGIN DBMS_OUTPUT.DISABLE; END...
104	104	1.00	5ur69atw3vfhj		select decode(failover_method,...
104	0	0.00	c4pc3jhzjcmc7	sqlplus@vcd01024.ute.fedex.com (TNS V1-V3)	SELECT CHAR_VALUE FROM SYSTEM....
104	104	1.00	d6vwqbw6r2ffk	sqlplus@vcd01024.ute.fedex.com (TNS V1-V3)	SELECT USER FROM DUAL
104	104	1.00	dyk4dprp70d74	SQL*Plus	SELECT DECODE('A', 'A', '1', '...
104	104	1.00	g4y6nw3tts7cc	sqlplus@vcd01024.ute.fedex.com (TNS V1-V3)	BEGIN DBMS_APPLICATION_INFO.SE...

It is very important to understand this section of the AWR report.

Well, there are two things to consider.

1. How much data is being captured in the AWR repository ?

2. How much data is being displayed in the AWR report ?

Looking at this AWR report, I see total executions at 9914. (This number comes from DBA_HIST_SYSSTAT ). 68.9 % of this 9914 = 6830 executions were captured in the AWR repository.

What happened to the other SQL statements ? Why did Oracle fail to capture them ?

Could be the case where these missing SQL statements might have consumed a lot of CPU and that information is now lost.

By default AWR captues a snap every 1 hour. It takes a snap of V$SQL every hour and keeps this data safe in DBA_HIST_SQLSTAT and other related tables. Let us consider this scenario

At 10 AM - You take an AWR snap . S1

At 10.15 AM - You run a resource intensive SQL statement

At 10.45 AM - SQL execution got completed. It has done a lot of damage by now.

At 10.50 AM - You flush the shared pool manually. / SQL gets aged out of shared pool automatically.

At 11 AM - You take another AWR snap S2

Now, if you generate an AWR report between these two snaps. You would see that the SQL which you ran is missing ! That's an interesting takeaway. AWR base tables are populated through V$SQL and if there is something missing in V$SQL, AWR too would miss the same.

Coming back to our AWR report,

Total executions were - 9914

6830 executions were actually captured by the AWR repository.

Rest is lost ! ( Well, not completely, we will talk about ASH a little later. ASH takes samples every second and may give us some more information )

Now lets us add up the numbers in the "executions" column of this AWR report.

3989 + 715 + 238 + ... = 6106

AWR told us it captured 6830 executions, but when I add up these numbers I get 6106.

Well, this is quite simple. AWR does not display everything that is present in DBA_HIST_SQLSTAT

You can directly query DBA_HIST_SQLSTAT if you want to get a list of all the SQLs.

From 11.2 it is possible to control the number of SQL statements displayed in the AWR report. It defaults to 30 and can be altered using the following procedure.

SQL> exec dbms_workload_repository.awr_set_report_thresholds(top_n_sql=>50,top_n_sql_max=>50);

Vishal's blog - Oracle Database Performance

Saturday, 19 January 2013

Optimizer is Over Estimating Selectivity with the LIKE clause ?

Execution Plan

Execution Plan

Execution Plan

Histograms statistics not accurate with Auto Sample Size

How many SQL's are actually captured and displayed by AWR ?

WORKLOAD REPOSITORY report for

SQL ordered by Executions

Bloom Partition Pruning In Oracle Vs Postgres

Blogroll

DISCLAIMER