I was reading that AUTO sampling in 11g generates statistics that are as accurate as 100% sampling but takes less time than 100% sampling. I decided to look at the quality of these statistics. Are they good for the optimizer ?
SQL> create table sales (order_key number) ;
Table created.
Insert few records.
SQL> begin
for i in 1..10000
loop
insert into sales values(10);
insert into sales values(20);
insert into sales values(30);
insert into sales values(40);
insert into sales values(50);
end loop;
end ;
/
I inserted five distinct values (10,20,30,40,50) each a 10,000 time.
SQL> create table sales (order_key number) ;
Table created.
Insert few records.
SQL> begin
for i in 1..10000
loop
insert into sales values(10);
insert into sales values(20);
insert into sales values(30);
insert into sales values(40);
insert into sales values(50);
end loop;
end ;
/
I inserted five distinct values (10,20,30,40,50) each a 10,000 time.
Now take statistics with AUTO_SAMPLE_SIZE
SQL> exec dbms_stats.gather_table_stats(user,'SALES',method_opt=>'for all columns size 254');
This command completed in just 30 seconds. WOW !!
Let us look at the quality statistics now.
SQL> select num_rows,sample_size from user_tables where table_name='SALES';
NUM_ROWS SAMPLE_SIZE
---------- -----------
50000 50000
---------- -----------
50000 50000
SQL> select num_distinct from user_tab_col_statistics where table_name='SALES';
NUM_DISTINCT
--------------------
5
It tells me that the sample size used was 100% and we see the correct number of num_rows and num_distinct.
These statistics are 100% accurate. Takes very little time. What is wrong then ?
Now let us look at the sample used to generate column statistics (histograms)
When I use method_opt => 'for all columns size 254' , I instruct Oracle to create a histogram on every column
SQL> select sample_size from user_tab_col_statistics where table_name='SALES'
and column_name='ORDER_KEY' ;
SAMPLE_SIZE
-------------------
5450
OK. So Oracle used different samples to collect different statistics. It looks like NUM_ROWS and NUM_DISTINCT was derived from 100% sample but histogram was computed by sampling only 5450 rows ! I ran the same tests with 10 million rows, column sample size was still close to ~ 5550. Is it something that is hard-coded into Oracle's code ?
We create histograms to make optimizer aware of the real data distribution. Of course, quality of this particular statistics might be very very bad given this small sample size.
So, why present the optimizer with histograms when it is not even close to real ?
No comments:
Post a Comment